Tools and tips for dealing with PDFs

PDFs are a journalist’s work nightmare. But we can beat them!

Find all of the command-line commands over here.

Tabula: Convert table-based PDF into spreadsheets

http://tabula.technology/

Installation

Download from their site

Convert table-based data into spreadsheets
Can’t be images, though

PDFMiner: Python PDF Parser

https://github.com/pdfminer/pdfminer.six (the default version is Python 2, this is the Python 3 version)

Installation

pip install pdfminer.six

Open PDF files in Python
Also installs the pdf2txt.py tool for the command line
…which probably won’t work on OS X, you’ll need to use dos2unix to convert it

OS X troubleshooting

When you try to run it you’ll probably get an error about not finding Python, with a weird backslash \r. This is something called “DOS- (or Windows-) style line endings” and is a tragedy. Use the following command to install dos2unix, which is a conversion utility, and then use it to convert the command.

brew install dos2unix
dos2unix /usr/local/bin/pdf2txt.py

If the default python on your system isn’t Python 3, you’ll also get a “module not found” error. You can run the following to force the python command on your system to be python3. This prrrobably won’t break anything, but make sure you cut and paste it, don’t try to type it.

sudo chown -R `whoami` /usr/local
ln -sf `which python3` /usr/local/bin/python

PDFQuery: XPath for PDFs in Python

https://github.com/jcushman/pdfquery

Installation

pip install pdfquery

Search PDFs with XPath in Python, kiiiind of
in_bbox(..), overlays_bbox(..), contains(..)
Coordinates from LOWER LEFT

Tesseract: Converts images to text

https://github.com/tesseract-ocr/tesseract

Free, open-source OCR (optical character recognition) software
Convert images to text

Installation (OS X)

brew install tesseract

Installation (Windows)

Download the 3.05 version from here. If you want non-English language ability pay attention during installation!

If you want to run it from the command line without typing out the entire path, it needs to be added to the PATH or you can cheat and run this command cmd /c mklink C:\Windows\System32\tesseract.exe "C:\Program Files (x86)\Tesseract-OCR\tesseract.exe"