PDFs are a journalist’s work nightmare. But we can beat them!

Find all of the command-line commands over here.

Tabula: Convert table-based PDF into spreadsheets

http://tabula.technology/

Installation

Download from their site

  • Convert table-based data into spreadsheets
  • Can’t be images, though

PDFMiner: Python PDF Parser

https://github.com/pdfminer/pdfminer.six (the default version is Python 2, this is the Python 3 version)

Installation

pip install pdfminer.six
  • Open PDF files in Python
  • Also installs the pdf2txt.py tool for the command line
  • …which probably won’t work on OS X, you’ll need to use dos2unix to convert it

OS X troubleshooting

When you try to run it you’ll probably get an error about not finding Python, with a weird backslash \r. This is something called “DOS- (or Windows-) style line endings” and is a tragedy. Use the following command to install dos2unix, which is a conversion utility, and then use it to convert the command.

brew install dos2unix
dos2unix /usr/local/bin/pdf2txt.py

If the default python on your system isn’t Python 3, you’ll also get a “module not found” error. You can run the following to force the python command on your system to be python3. This prrrobably won’t break anything, but make sure you cut and paste it, don’t try to type it.

sudo chown -R `whoami` /usr/local
ln -sf `which python3` /usr/local/bin/python

PDFQuery: XPath for PDFs in Python

https://github.com/jcushman/pdfquery

Installation

pip install pdfquery
  • Search PDFs with XPath in Python, kiiiind of
  • in_bbox(..), overlays_bbox(..), contains(..)
  • Coordinates from LOWER LEFT

Tesseract: Converts images to text

https://github.com/tesseract-ocr/tesseract

  • Free, open-source OCR (optical character recognition) software
  • Convert images to text

Installation (OS X)

brew install tesseract

Installation (Windows)

Download the 3.05 version from here. If you want non-English language ability pay attention during installation!

If you want to run it from the command line without typing out the entire path, it needs to be added to the PATH or you can cheat and run this command cmd /c mklink C:\Windows\System32\tesseract.exe "C:\Program Files (x86)\Tesseract-OCR\tesseract.exe"

Pytesseract: Tesseract bindings for Python

https://github.com/madmaze/pytesseract

Summary: Allows you to use tesseract from Python

Installation

pip install pytesseract

Kull: Image/PDF region selection tool

https://jsoma.github.io/kull/

Summary: Visual tool for creating PDFQuery bounding boxes/statements. You can also generate tesseract OCR regions by exporting UZN files

tesseract-uzn: Simplify using UZN files with tesseract

https://github.com/jsoma/tesseract-uzn

Summary: Simplified workflow for zone files with tesseract

  • UZN files with tesseract are kind of a pain
  • Might be useful if you’re extracting the same regions on many documents

Imagemagick: Edit images from the command line

https://www.imagemagick.org/script/index.php

Summary: Command-line image editing and conversion tool

  • Convert between different image formats
  • Useful for to convert PDF to PNG for tesseract

Installation (OS X)

OS X: brew install imagemagick and then brew install ghostscript 👻

Windows: Download the installer

Muckrock: Open FOIA requests

https://www.muckrock.com

Summary: Open FOIA/FOIL requests

Searching for Completed requests is a great way to find “real” documents to work on

DocumentCloud

https://www.documentcloud.org/

Summary: The gold standard for journalists doing things with documents

Cometdocs

https://www.cometdocs.com/

Summary: Convert PDFs to Excel/CSV

Not a tool on your computer, but sometimes you’re too lazy to run Tabula and you know Cometdocs will work

Other tools (may cost $$$)

Command-line commands

I moved this! Find all of the command-line commands over here.