Selecting your tool for processing PDFs¶
Selectable text? | |||
Yes | No | ||
Tabular data? | Yes | Camelot | Sigh, read below |
No | pdfminer.six or pdfplumber | OCRmyPDF or Tika |
Camelot¶
Camelot is great for extracting tabular data from PDFs. It's like Tabula, but a hundred times better.
Requires selectable text
pdfminer.six¶
If you have a PDF that has selectable text, you can use pdfminer.six to extract the text into a Python script. There are about ten thousand other libraries you can use to do this, but I find pdfminer.six to be the easiest to use.
Requires selectable text
pdfplumber¶
Another option for selectable text is pdfplumber, which is infinitely powerful but more complex than pdfminer.six.
Requires selectable text
Tika¶
Tika is a dream. You can throw any sort of document at it - PDF, Word, Excel, PowerPoint, HTML, etc. - and it will extract the text for you. If you have tesseract installed it will also extract the text from images. It isn't the easiest to install, but it's more than worth it.
OCRmyPDF¶
OCRmyPDF is great to quickly add a layer of text onto your document. Find out more on the OCR tools page.
OCR with tabular data¶
If you can't select the text in your PDF, it needs to be converted into text. OCR (optical character recognition) takes images and tries to make guesses about what the text is.
It can be incredibly useful if you're making a rough guess about the content of a long document, but I'm really against using it for tabular data. Typically tabular data is a long list of important numbers, where accidentally reading a 7
as a 1
is going to cause a lot of trouble. Since you're probably going to be doing some sort of analysis on the data, you want to make sure you're getting the right numbers!
If you use an OCR tool on a PDF and then feed it into something like Camelot to create a CSV, you're bound to get errors: but because the process is so easy you probably won't realize it. Doing manual input in those situations goes a long way to really confirm what the data is.