Tools and tips for dealing with PDFs
PDFs are a journalist’s work nightmare. But we can beat them!
Find all of the command-line commands over here.
Tabula: Convert table-based PDF into spreadsheets
Installation
- Convert table-based data into spreadsheets
- Can’t be images, though
PDFMiner: Python PDF Parser
https://github.com/pdfminer/pdfminer.six (the default version is Python 2, this is the Python 3 version)
Installation
pip install pdfminer.six
- Open PDF files in Python
- Also installs the
pdf2txt.py
tool for the command line - …which probably won’t work on OS X, you’ll need to use dos2unix to convert it
OS X troubleshooting
When you try to run it you’ll probably get an error about not finding Python, with a weird backslash \r
. This is something called “DOS- (or Windows-) style line endings” and is a tragedy. Use the following command to install dos2unix
, which is a conversion utility, and then use it to convert the command.
brew install dos2unix
dos2unix /usr/local/bin/pdf2txt.py
If the default python
on your system isn’t Python 3, you’ll also get a “module not found” error. You can run the following to force the python
command on your system to be python3
. This prrrobably won’t break anything, but make sure you cut and paste it, don’t try to type it.
sudo chown -R `whoami` /usr/local
ln -sf `which python3` /usr/local/bin/python
PDFQuery: XPath for PDFs in Python
https://github.com/jcushman/pdfquery
Installation
pip install pdfquery
- Search PDFs with XPath in Python, kiiiind of
in_bbox(..)
,overlays_bbox(..)
,contains(..)
- Coordinates from LOWER LEFT
Tesseract: Converts images to text
https://github.com/tesseract-ocr/tesseract
- Free, open-source OCR (optical character recognition) software
- Convert images to text
Installation (OS X)
brew install tesseract
Installation (Windows)
Download the 3.05 version from here. If you want non-English language ability pay attention during installation!
If you want to run it from the command line without typing out the entire path, it needs to be added to the PATH or you can cheat and run this command cmd /c mklink C:\Windows\System32\tesseract.exe "C:\Program Files (x86)\Tesseract-OCR\tesseract.exe"
Pytesseract: Tesseract bindings for Python
https://github.com/madmaze/pytesseract
Summary: Allows you to use tesseract from Python
Installation
pip install pytesseract
Kull: Image/PDF region selection tool
Summary: Visual tool for creating PDFQuery bounding boxes/statements. You can also generate tesseract OCR regions by exporting UZN files
tesseract-uzn: Simplify using UZN files with tesseract
https://github.com/jsoma/tesseract-uzn
Summary: Simplified workflow for zone files with tesseract
- UZN files with tesseract are kind of a pain
- Might be useful if you’re extracting the same regions on many documents
Imagemagick: Edit images from the command line
https://www.imagemagick.org/script/index.php
Summary: Command-line image editing and conversion tool
- Convert between different image formats
- Useful for to convert PDF to PNG for tesseract
Installation (OS X)
OS X: brew install imagemagick
and then brew install ghostscript
👻
Windows: Download the installer
Muckrock: Open FOIA requests
Summary: Open FOIA/FOIL requests
Searching for Completed requests is a great way to find “real” documents to work on
DocumentCloud
https://www.documentcloud.org/
Summary: The gold standard for journalists doing things with documents
Cometdocs
Summary: Convert PDFs to Excel/CSV
Not a tool on your computer, but sometimes you’re too lazy to run Tabula and you know Cometdocs will work
Other tools (may cost $$$)
- ABBYY FineReader / OS X Version
- Omnipage
- Might want to give a read-through of You Got the Documents. Now What? by Jonathan Stray
Command-line commands
I moved this! Find all of the command-line commands over here.