Commands for PDF analysis

The important thing to think about is does your PDF have selectable text? If so, Tabula or pdf2txt.py are great for you. If not, or the text is dirty, tesseract is best for you.

Find all of the tools over here.

Turn a PDF of tabular data into a CSV

You should probably use Tabula, not a command-line tool!

Extract certain parts of a PDF

You should probably use Kull + PDFQuery, or if you’re having a hard time use tesseract + a zone file (see down below).

Extract all text from a pdf

You’ll use pdf2txt.py for this, which came with PDFMiner.

Display the text on the command line (print to standard output):

pdf2txt.py source.pdf

You can also save the result to a file.

pdf2txt.py source.pdf -o output.txt

Convert an image to text (OCR)

You should use tesseract for this. Print the contents:

tesseract myfile.png stdout

Save the result to output.txt.txt

tesseract myfile.png output.txt

Yes, it will call it .txt.txt but it’s an easier command to remember.

You might want to remove ligatures if tesseract is turning “fi” info ﬁ.

tesseract myfile.png output.txt -c tessedit_char_blacklist=ﬁﬂ

Convert MANY images to text (OCR)

You can scroll down to where I talk about bash scripts below, but Marcel had the great idea of doing it in a notebook by using Python’s glob to find all of the files and then ! to jump out to the command line to run tesseract.

pdfs = glob.glob('your_folder_with_pngs/*.png')

for pdf in pdfs:
    !tesseract {pdf} {pdf[:-4]} -c tessedit_char_blacklist=ﬁﬂ

Convert a pdf into a 300dpi PNG in preparation for OCR

If you have an image or a PDF with bad OCR (like we did with the franklin example), you can convert the PDF to PNG and then use tesseract.

If it’s just one page, or you’re okay with it creating multiple image files:

convert -density 300 file.pdf output.png

Multi-page PDFs on the command line

If it’s a multi-page PDF and you want it to only create one PNG file (which is usually better, but much much much slower when using tesseract), you need to change the command a little:

convert -density 300 franklin.pdf -append output.png

Fair warning, this can create REALLY BIG files and be REALLY SLOW with tesseract.

If tesseract tells you that your file is too large (larger than 32767x32767), you can use ImageMagick to convert your png into a new png that will fit.

convert output.png -resize 32767x32767\> resized.png

This might make your text look bad, though, if it becomes really really tall and skinny.

Extract a few pages of a PDF

There’s a few ways to do it, but this one uses GhostScript (gs comes along with ImageMagick).

gs -dNOPAUSE -dBATCH -dFirstPage=2 -dLastPage=4 -sDEVICE=pdfwrite -sOutputFile=destination.pdf -f source.pdf

This will save pages 2-4 from source.pdf into destination.pdf.

Multi-page PDFs using the Python bindings for tesseract

Making sure all of your installations are in the right place can be a little tough - the newest Imagemagick might not support pytesseract, but you can cross your fingers and/or email me about it.

import io
from PIL import Image
import pytesseract
from wand.image import Image as wi
 
pdf = wi(filename = "1812490.pdf", resolution = 300)

pdf_image = pdf.convert('jpeg')

images = []

for img in pdf_image.sequence:
  ImgPage = wi(image = img)
  images.append(ImgPage.make_blob('jpeg'))

recognized_text = []

for image in images:
  im = Image.open(io.BytesIO(image))
  text = pytesseract.image_to_string(im, lang='eng')
  recognized_text.append(text)

print(recognized_text)

Use a zone file to extract text only in certain regions in an image

Use Kull to generate a zone file, which is the coordinates you want to extract text from.

If you’re on a Mac you can install and use tesseract-uzn.

tesseract-uzn myuzn.uzn image.png

If you’re using Windows, you need to name your .uzn file the same name as your image, and add -psm 4 to your tesseract command.

tesseract image.png stdout -psm 4

Download a list of URLs

You’ll need to install wget if it isn’t on your system. OS X users can use brew install wget.

Let’s say I have a file named urls.txt that looks like this

https://example.com/001.pdf
https://example.com/301.pdf
https://example.com/041.pdf
https://example.com/AB1.pdf

I can download every file using the following command.

wget -i urls.txt

Just make sure urls.txt has one URL per line.

Loop through scripts and run a command (OS X)

For example, this one converts all pdfs in the current directory into PNG files. You can find and run an example in the /keno/ directory.

#!/bin/bash
FILES=*.pdf
for f in $FILES
do
  echo "Processing $f..."
  pdf2txt.py $f -o $f.txt
done

It’s a shell script, which you can use to automate the command line. You save it as a .sh file and run it with bash yourscript.sh.

Loop through scripts and run a command (Windows)

For example, this one converts all pdfs in the current directory into PNG files. I think this works but we can work on it more during lab.

@echo off
setlocal EnableDelayedExpansion

for %%a in (*.pdf) do (
  pdf2txt.py "%%a" -o "%%a".txt
)

It’s a batch file, which you can use to automate the command line. You save it as a .bat file and run it with yourscript.bat.