Skip to content

Selecting your tool for processing PDFs

Selectable text?
YesNo
Tabular data?Yes Camelot Sigh, read below
No pdfminer.six OCRmyPDF or Tika

Camelot

Camelot is great for extracting tabular data from PDFs. It's like Tabula, but a hundred times better.

Requires selectable text

pdfminer.six

If you have a PDF that has selectable text, you can use pdfminer.six to extract the text into a Python script. There are about ten thousand other libraries you can use to do this, but I find pdfminer.six to be the easiest to use.

Requires selectable text

Tika

Tika is a dream. You can throw any sort of document at it - PDF, Word, Excel, PowerPoint, HTML, etc. - and it will extract the text for you. If you have tesseract installed it will also extract the text from images. It isn't the easiest to install, but it's more than worth it.

OCRmyPDF

OCRmyPDF is great to quickly add a layer of text onto your document. Find out more on the OCR tools page.

OCR with tabular data

If you can't select the text in your PDF, it needs to be converted into text. OCR (optical character recognition) takes images and tries to make guesses about what the text is.

It can be incredibly useful if you're making a rough guess about the content of a long document, but I'm really against using it for tabular data. Typically tabular data is a long list of important numbers, where accidentally reading a 7 as a 1 is going to cause a lot of trouble. Since you're probably going to be doing some sort of analysis on the data, you want to make sure you're getting the right numbers!

If you use an OCR tool on a PDF and then feed it into something like Camelot to create a CSV, you're bound to get errors: but because the process is so easy you probably won't realize it. Doing manual input in those situations goes a long way to really confirm what the data is.