How to OCR PDF files¶
"Optical character recognition" is a fancy way of saying "get the text out of PDFs." With some PDFs you can just cut and paste, but with image-based PDFs you need this extra step to get the text out. It doesn't work perfectly, but it's better than nothing, especially when you have a long long document and you just want to get a rough idea of what's in it.
OCRmyPDF¶
To convert image-based PDFs to text, OCRmyPDF is probably the best compromise between speed, ease of installation, and usability.
It also has the added benefit of being able to create a text layer on the PDF, so that if positioning matters (maybe there's tabular data?) you'll be able to take advantage of other tools.
Documentation¶
Installation¶
The first install line installs the base OCRmyPDF software. The pip
install then allows you to use it from Python. You need both!
OS X
brew install ocrmypdf
pip install ocrmypdf
Windows
scoop install ocrmypdf
pip install ocrmypdf
Usage¶
You actually run this one from the command line!
ocrmypdf players-scan.pdf players-scan-ocr.pdf
This command will output a PDF with a text layer on it.
Alternatively, you can use it from Python if you also did the pip install ocrmypdf
step.
!ocrmypdf --deskew players-scan.pdf players-scan-ocr.pdf
Opened a file Scanning contents: 100%|███████████████████████| 1/1 [00:00<00:00, 219.34page/s] Opened a file 1 Opened a file 1 Opened a file 1 Opened a file OCR: 100%|██████████████████████████████████| 1.0/1.0 [00:04<00:00, 4.89s/page] Postprocessing... Opened a file PDF/A conversion: 100%|█████████████████████████| 1/1 [00:00<00:00, 3.65page/s] Opened a file Opened a file Opened a file Recompressing JPEGs: 0image [00:00, ?image/s] Deflating JPEGs: 100%|█████████████████████████| 1/1 [00:00<00:00, 83.92image/s] JBIG2: 0item [00:00, ?item/s] Optimize ratio: 1.21 savings: 17.3% Opened a file Output file is a PDF/A-2B (as expected) Opened a file Opened a file
import ocrmypdf
ocrmypdf.ocr('players-scan.pdf', 'players-scan-ocr.pdf', deskew=True)
Scanning contents: 100%|██████████| 1/1 [00:00<00:00, 15.32page/s] OCR: 100%|██████████| 1.0/1.0 [00:05<00:00, 5.93s/page] PDF/A conversion: 100%|██████████| 1/1 [00:00<00:00, 4.31page/s] Recompressing JPEGs: 0image [00:00, ?image/s] Deflating JPEGs: 100%|██████████| 1/1 [00:00<00:00, 57.10image/s] JBIG2: 0item [00:00, ?item/s]
<ExitCode.ok: 0>
You can then use pdfminer.six to extract the text from the PDF.
# Use pdfminer.six to extract text from players-scan-ocr.pdf
from pdfminer.high_level import extract_text
text = extract_text("players-scan-ocr.pdf")
text[:100]
"Player \nRhett Bomar \nJoe Webb \nChristian Ponder \nAdrian Peterson \nLorenzo Booker \nRyan D'lmper"
Speed¶
OCRmyPDF took around 7 seconds.
PyTesseract¶
Pytesseract only works on images, not PDFs. Don't use it.
EasyOCR¶
Don't use this! It's only here because it shows up in a lot of Google searches. EasyOCR does not support PDF files. It only supports images. As a result it's not good for us!
Documentation¶
Installation¶
pip install easyocr
Easy will also need to download models the first time you use it.
Usage¶
EasyOCR only works with images, not PDFs! While you could convert your PDF into an image, it's honestly too much of a pain to deal with. If you really really want to use EasyOCR, though, you can use the pdf2image
library to convert your PDF into series of images and use OCR on each page.
import easyocr
# I'm interested in english only
# Languages are at https://www.jaided.ai/easyocr/ under "Supported Languages"
reader = easyocr.Reader(['en'])
# detail=0 to mean, "only give me the text, not the bounding boxes"
# And yes, we're doing a jpg here instead of the pdf so it can be happy
result = reader.readtext('players-scan.jpg', detail=0)
[2022-12-04 21:48:06,916] [ WARNING] easyocr.py:74 - CUDA not available - defaulting to CPU. Note: This module is much faster with a GPU.
Because we used detail=0
, the result is a simple list of strings. If you remove detail=0
EasyOCR will provide coordinates for each word.
result[:10]
['Player', 'Pos', 'Status', 'Ht', 'Wt', 'DOB', 'Rhett Bomar', 'Quarterback', 'Active', "6'2'"]
Speed¶
On my sample page, EasyOCR took 1m25s.
PaddleOCR¶
PaddleOCR is incredible fancy, but is pretty hard to install.
Documentation¶
- GitHub
- Quickstart
- Tutorials (maybe this link will work better in the future)
Also, it can apparently read tables! It looks like it can currently only do one at a time, though, so OCRmyPDF + Camelot is probably your best bet.
Installation¶
I had to use PaddleOCR 2.5 because 2.6 was giving errors when working with PDFs (something about
pdf.pageCount
).
# I had to install swig, maybe you'll need to install other things?
# Windows people: https://pymupdf.readthedocs.io/en/latest/installation.html
brew install swig pymupdf
pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple
pip install paddleocr
PaddleOCR will also need to download models the first time you use it.
Usage¶
# Newer versions of PyMuPDF renamed pageCount to page_count
# and getPixmap to get_pixmap, but PaddleOCR still tries
# to use the old ones! By running these three lines we
# make sure the old names are still okay to use.
import fitz
fitz.Document.pageCount = fitz.Document.page_count
fitz.Page.getPixmap = fitz.Page.get_pixmap
# Okay now we can actually use PaddleOCR
from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=True, lang='en', page_num=2)
result = ocr.ocr('players-scan.pdf', cls=True)
[2022/12/04 21:34:59] ppocr DEBUG: Namespace(help='==SUPPRESS==', use_gpu=False, use_xpu=False, use_npu=False, ir_optim=True, use_tensorrt=False, min_subgraph_size=15, precision='fp32', gpu_mem=500, image_dir=None, page_num=2, det_algorithm='DB', det_model_dir='/Users/soma/.paddleocr/whl/det/en/en_PP-OCRv3_det_infer', det_limit_side_len=960, det_limit_type='max', det_box_type='quad', det_db_thresh=0.3, det_db_box_thresh=0.6, det_db_unclip_ratio=1.5, max_batch_size=10, use_dilation=False, det_db_score_mode='fast', det_east_score_thresh=0.8, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_sast_score_thresh=0.5, det_sast_nms_thresh=0.2, det_pse_thresh=0, det_pse_box_thresh=0.85, det_pse_min_area=16, det_pse_scale=1, scales=[8, 16, 32], alpha=1.0, beta=1.0, fourier_degree=5, rec_algorithm='SVTR_LCNet', rec_model_dir='/Users/soma/.paddleocr/whl/rec/en/en_PP-OCRv3_rec_infer', rec_image_inverse=True, rec_image_shape='3, 48, 320', rec_batch_num=6, max_text_length=25, rec_char_dict_path='/Users/soma/.pyenv/versions/3.10.3/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/paddleocr/ppocr/utils/en_dict.txt', use_space_char=True, vis_font_path='./doc/fonts/simfang.ttf', drop_score=0.5, e2e_algorithm='PGNet', e2e_model_dir=None, e2e_limit_side_len=768, e2e_limit_type='max', e2e_pgnet_score_thresh=0.5, e2e_char_dict_path='./ppocr/utils/ic15_dict.txt', e2e_pgnet_valid_set='totaltext', e2e_pgnet_mode='fast', use_angle_cls=True, cls_model_dir='/Users/soma/.paddleocr/whl/cls/ch_ppocr_mobile_v2.0_cls_infer', cls_image_shape='3, 48, 192', label_list=['0', '180'], cls_batch_num=6, cls_thresh=0.9, enable_mkldnn=False, cpu_threads=10, use_pdserving=False, warmup=False, sr_model_dir=None, sr_image_shape='3, 32, 128', sr_batch_num=1, draw_img_save_dir='./inference_results', save_crop_res=False, crop_res_save_dir='./output', use_mp=False, total_process_num=1, process_id=0, benchmark=False, save_log_path='./log_output/', show_log=True, use_onnx=False, output='./output', table_max_len=488, table_algorithm='TableAttn', table_model_dir=None, merge_no_span_structure=True, table_char_dict_path=None, layout_model_dir=None, layout_dict_path=None, layout_score_threshold=0.5, layout_nms_threshold=0.5, kie_algorithm='LayoutXLM', ser_model_dir=None, re_model_dir=None, use_visual_backbone=True, ser_dict_path='../train_data/XFUND/class_list_xfun.txt', ocr_order_method=None, mode='structure', image_orientation=False, layout=True, table=True, ocr=True, recovery=False, use_pdf2docx_api=False, lang='en', det=True, rec=True, type='ocr', ocr_version='PP-OCRv3', structure_version='PP-StructureV2') [2022/12/04 21:35:00] ppocr DEBUG: dt_boxes num : 311, elapse : 0.4760923385620117 [2022/12/04 21:35:02] ppocr DEBUG: cls num : 311, elapse : 1.4697859287261963 [2022/12/04 21:35:17] ppocr DEBUG: rec_res num : 311, elapse : 15.190897941589355
results
looks pretty complicated at first glance because... well, it is! It's a list of lists of lists: along with including just the word, it includes bounding coordinates and a confidence score. Below I've turned the words into a single list, but if you're interested in more complex transformations check the documentation.
words = [page[1][0] for page in result[0]]
words[:10]
['Player', 'Pos', 'Status', 'Ht', 'Wt DOB', 'Rhett Bomar', 'Quarterback', 'Active', "6'2'", '215']
Speed¶
On my sample page, PaddleOCR took about 20s.