Commands for PDF analysis
The important thing to think about is does your PDF have selectable text? If so, Tabula or pdf2txt.py
are great for you. If not, or the text is dirty, tesseract
is best for you.
Find all of the tools over here.
Turn a PDF of tabular data into a CSV
You should probably use Tabula, not a command-line tool!
Extract certain parts of a PDF
You should probably use Kull + PDFQuery, or if you’re having a hard time use tesseract
+ a zone file (see down below).
Extract all text from a pdf
You’ll use pdf2txt.py for this, which came with PDFMiner.
Display the text on the command line (print to standard output):
pdf2txt.py source.pdf
You can also save the result to a file.
pdf2txt.py source.pdf -o output.txt
Convert an image to text (OCR)
You should use tesseract for this. Print the contents:
tesseract myfile.png stdout
Save the result to output.txt.txt
tesseract myfile.png output.txt
Yes, it will call it .txt.txt
but it’s an easier command to remember.
You might want to remove ligatures if tesseract is turning “fi” info fi
.
tesseract myfile.png output.txt -c tessedit_char_blacklist=fifl
Convert MANY images to text (OCR)
You can scroll down to where I talk about bash scripts below, but Marcel had the great idea of doing it in a notebook by using Python’s glob
to find all of the files and then !
to jump out to the command line to run tesseract
.
pdfs = glob.glob('your_folder_with_pngs/*.png')
for pdf in pdfs:
!tesseract {pdf} {pdf[:-4]} -c tessedit_char_blacklist=fifl
Convert a pdf into a 300dpi PNG in preparation for OCR
If you have an image or a PDF with bad OCR (like we did with the franklin
example), you can convert the PDF to PNG and then use tesseract.
If it’s just one page, or you’re okay with it creating multiple image files:
convert -density 300 file.pdf output.png
Multi-page PDFs on the command line
If it’s a multi-page PDF and you want it to only create one PNG file (which is usually better, but much much much slower when using tesseract), you need to change the command a little:
convert -density 300 franklin.pdf -append output.png
Fair warning, this can create REALLY BIG files and be REALLY SLOW with tesseract.
If tesseract tells you that your file is too large (larger than 32767x32767), you can use ImageMagick to convert your png into a new png that will fit.
convert output.png -resize 32767x32767\> resized.png
This might make your text look bad, though, if it becomes really really tall and skinny.
Extract a few pages of a PDF
There’s a few ways to do it, but this one uses GhostScript (gs
comes along with ImageMagick).
gs -dNOPAUSE -dBATCH -dFirstPage=2 -dLastPage=4 -sDEVICE=pdfwrite -sOutputFile=destination.pdf -f source.pdf
This will save pages 2-4 from source.pdf
into destination.pdf
.
Multi-page PDFs using the Python bindings for tesseract
Making sure all of your installations are in the right place can be a little tough - the newest Imagemagick might not support pytesseract
, but you can cross your fingers and/or email me about it.
import io
from PIL import Image
import pytesseract
from wand.image import Image as wi
pdf = wi(filename = "1812490.pdf", resolution = 300)
pdf_image = pdf.convert('jpeg')
images = []
for img in pdf_image.sequence:
ImgPage = wi(image = img)
images.append(ImgPage.make_blob('jpeg'))
recognized_text = []
for image in images:
im = Image.open(io.BytesIO(image))
text = pytesseract.image_to_string(im, lang='eng')
recognized_text.append(text)
print(recognized_text)
Use a zone file to extract text only in certain regions in an image
Use Kull to generate a zone file, which is the coordinates you want to extract text from.
If you’re on a Mac you can install and use tesseract-uzn.
tesseract-uzn myuzn.uzn image.png
If you’re using Windows, you need to name your .uzn
file the same name as your image, and add -psm 4
to your tesseract
command.
tesseract image.png stdout -psm 4
Download a list of URLs
You’ll need to install wget
if it isn’t on your system. OS X users can use brew install wget
.
Let’s say I have a file named urls.txt
that looks like this
https://example.com/001.pdf
https://example.com/301.pdf
https://example.com/041.pdf
https://example.com/AB1.pdf
I can download every file using the following command.
wget -i urls.txt
Just make sure urls.txt
has one URL per line.
Loop through scripts and run a command (OS X)
For example, this one converts all pdfs in the current directory into PNG files. You can find and run an example in the /keno/
directory.
#!/bin/bash
FILES=*.pdf
for f in $FILES
do
echo "Processing $f..."
pdf2txt.py $f -o $f.txt
done
It’s a shell script, which you can use to automate the command line. You save it as a .sh
file and run it with bash yourscript.sh
.
Loop through scripts and run a command (Windows)
For example, this one converts all pdfs in the current directory into PNG files. I think this works but we can work on it more during lab.
@echo off
setlocal EnableDelayedExpansion
for %%a in (*.pdf) do (
pdf2txt.py "%%a" -o "%%a".txt
)
It’s a batch file, which you can use to automate the command line. You save it as a .bat
file and run it with yourscript.bat
.