How to OCR PDF files¶

"Optical character recognition" is a fancy way of saying "get the text out of PDFs." With some PDFs you can just cut and paste, but with image-based PDFs you need this extra step to get the text out. It doesn't work perfectly, but it's better than nothing, especially when you have a long long document and you just want to get a rough idea of what's in it.

OCRmyPDF¶

To convert image-based PDFs to text, OCRmyPDF is probably the best compromise between speed, ease of installation, and usability.

It also has the added benefit of being able to create a text layer on the PDF, so that if positioning matters (maybe there's tabular data?) you'll be able to take advantage of other tools.

Documentation¶

Installation¶

The first install line installs the base OCRmyPDF software. The pip install then allows you to use it from Python. You need both!

OS X

brew install ocrmypdf
pip install ocrmypdf

Windows

scoop install ocrmypdf
pip install ocrmypdf

Usage¶

You actually run this one from the command line!

ocrmypdf players-scan.pdf players-scan-ocr.pdf

This command will output a PDF with a text layer on it.

Alternatively, you can use it from Python if you also did the pip install ocrmypdf step.

In [13]:

            
                Copied!
                
!ocrmypdf --deskew players-scan.pdf players-scan-ocr.pdf
!ocrmypdf --deskew players-scan.pdf players-scan-ocr.pdf

Opened a file
Scanning contents: 100%|███████████████████████| 1/1 [00:00<00:00, 219.34page/s]
Opened a file
    1 Opened a file                                                             
    1 Opened a file                                                             
    1 Opened a file                                                             
OCR: 100%|██████████████████████████████████| 1.0/1.0 [00:04<00:00,  4.89s/page]
Postprocessing...
Opened a file
PDF/A conversion: 100%|█████████████████████████| 1/1 [00:00<00:00,  3.65page/s]
Opened a file
Opened a file
Opened a file
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 100%|█████████████████████████| 1/1 [00:00<00:00, 83.92image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.21 savings: 17.3%
Opened a file
Output file is a PDF/A-2B (as expected)
Opened a file
Opened a file

In [15]:

            
                Copied!
                
import ocrmypdf

ocrmypdf.ocr('players-scan.pdf', 'players-scan-ocr.pdf', deskew=True)
import ocrmypdf

ocrmypdf.ocr('players-scan.pdf', 'players-scan-ocr.pdf', deskew=True)

Scanning contents: 100%|██████████| 1/1 [00:00<00:00, 15.32page/s]
OCR: 100%|██████████| 1.0/1.0 [00:05<00:00,  5.93s/page]
PDF/A conversion: 100%|██████████| 1/1 [00:00<00:00,  4.31page/s]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 100%|██████████| 1/1 [00:00<00:00, 57.10image/s]
JBIG2: 0item [00:00, ?item/s]

Out[15]:

<ExitCode.ok: 0>

You can then use pdfminer.six to extract the text from the PDF.

In [9]:

            
                Copied!
                
# Use pdfminer.six to extract text from players-scan-ocr.pdf
from pdfminer.high_level import extract_text

text = extract_text("players-scan-ocr.pdf")
text[:100]
# Use pdfminer.six to extract text from players-scan-ocr.pdf
from pdfminer.high_level import extract_text

text = extract_text("players-scan-ocr.pdf")
text[:100]

Out[9]:

"Player \nRhett  Bomar \nJoe  Webb \nChristian  Ponder \nAdrian  Peterson \nLorenzo  Booker \nRyan  D'lmper"

Speed¶

OCRmyPDF took around 7 seconds.

PyTesseract¶

Pytesseract only works on images, not PDFs. Don't use it.

Tika¶

I love love love Tika, it can read any document ever in all of history. I did a large writeup and install walkthrough here, and you can see how to make it work with other languages here.

EasyOCR¶

Don't use this! It's only here because it shows up in a lot of Google searches. EasyOCR does not support PDF files. It only supports images. As a result it's not good for us!

Documentation¶

Installation¶

pip install easyocr

Easy will also need to download models the first time you use it.

Usage¶

EasyOCR only works with images, not PDFs! While you could convert your PDF into an image, it's honestly too much of a pain to deal with. If you really really want to use EasyOCR, though, you can use the pdf2image library to convert your PDF into series of images and use OCR on each page.

In [11]:

            
                Copied!
                
import easyocr

# I'm interested in english only
# Languages are at https://www.jaided.ai/easyocr/ under "Supported Languages"
reader = easyocr.Reader(['en'])

# detail=0 to mean, "only give me the text, not the bounding boxes"
# And yes, we're doing a jpg here instead of the pdf so it can be happy
result = reader.readtext('players-scan.jpg', detail=0)
import easyocr

# I'm interested in english only
# Languages are at https://www.jaided.ai/easyocr/ under "Supported Languages"
reader = easyocr.Reader(['en'])

# detail=0 to mean, "only give me the text, not the bounding boxes"
# And yes, we're doing a jpg here instead of the pdf so it can be happy
result = reader.readtext('players-scan.jpg', detail=0)

[2022-12-04 21:48:06,916] [ WARNING] easyocr.py:74 - CUDA not available - defaulting to CPU. Note: This module is much faster with a GPU.

Because we used detail=0, the result is a simple list of strings. If you remove detail=0 EasyOCR will provide coordinates for each word.

In [19]:

            
                Copied!
                
result[:10]
result[:10]

Out[19]:

['Player',
 'Pos',
 'Status',
 'Ht',
 'Wt',
 'DOB',
 'Rhett Bomar',
 'Quarterback',
 'Active',
 "6'2'"]

Speed¶

On my sample page, EasyOCR took 1m25s.

PaddleOCR¶

PaddleOCR is incredible fancy, but is pretty hard to install.

Documentation¶

GitHub
Quickstart
Tutorials (maybe this link will work better in the future)

Also, it can apparently read tables! It looks like it can currently only do one at a time, though, so OCRmyPDF + Camelot is probably your best bet.

Installation¶

More details here

I had to use PaddleOCR 2.5 because 2.6 was giving errors when working with PDFs (something about pdf.pageCount).

# I had to install swig, maybe you'll need to install other things?
# Windows people: https://pymupdf.readthedocs.io/en/latest/installation.html
brew install swig pymupdf
pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple
pip install paddleocr

PaddleOCR will also need to download models the first time you use it.

Usage¶

In [2]:

            
                Copied!
                
                    
                    
                
                

        
# Newer versions of PyMuPDF renamed pageCount to page_count
# and getPixmap to get_pixmap, but PaddleOCR still tries
# to use the old ones! By running these three lines we
# make sure the old names are still okay to use.

import fitz
fitz.Document.pageCount = fitz.Document.page_count
fitz.Page.getPixmap = fitz.Page.get_pixmap

# Okay now we can actually use PaddleOCR
from paddleocr import PaddleOCR

ocr = PaddleOCR(use_angle_cls=True, lang='en', page_num=2)
result = ocr.ocr('players-scan.pdf', cls=True)
# Newer versions of PyMuPDF renamed pageCount to page_count
# and getPixmap to get_pixmap, but PaddleOCR still tries
# to use the old ones! By running these three lines we
# make sure the old names are still okay to use.

import fitz
fitz.Document.pageCount = fitz.Document.page_count
fitz.Page.getPixmap = fitz.Page.get_pixmap

# Okay now we can actually use PaddleOCR
from paddleocr import PaddleOCR

ocr = PaddleOCR(use_angle_cls=True, lang='en', page_num=2)
result = ocr.ocr('players-scan.pdf', cls=True)

[2022/12/04 21:34:59] ppocr DEBUG: Namespace(help='==SUPPRESS==', use_gpu=False, use_xpu=False, use_npu=False, ir_optim=True, use_tensorrt=False, min_subgraph_size=15, precision='fp32', gpu_mem=500, image_dir=None, page_num=2, det_algorithm='DB', det_model_dir='/Users/soma/.paddleocr/whl/det/en/en_PP-OCRv3_det_infer', det_limit_side_len=960, det_limit_type='max', det_box_type='quad', det_db_thresh=0.3, det_db_box_thresh=0.6, det_db_unclip_ratio=1.5, max_batch_size=10, use_dilation=False, det_db_score_mode='fast', det_east_score_thresh=0.8, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_sast_score_thresh=0.5, det_sast_nms_thresh=0.2, det_pse_thresh=0, det_pse_box_thresh=0.85, det_pse_min_area=16, det_pse_scale=1, scales=[8, 16, 32], alpha=1.0, beta=1.0, fourier_degree=5, rec_algorithm='SVTR_LCNet', rec_model_dir='/Users/soma/.paddleocr/whl/rec/en/en_PP-OCRv3_rec_infer', rec_image_inverse=True, rec_image_shape='3, 48, 320', rec_batch_num=6, max_text_length=25, rec_char_dict_path='/Users/soma/.pyenv/versions/3.10.3/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/paddleocr/ppocr/utils/en_dict.txt', use_space_char=True, vis_font_path='./doc/fonts/simfang.ttf', drop_score=0.5, e2e_algorithm='PGNet', e2e_model_dir=None, e2e_limit_side_len=768, e2e_limit_type='max', e2e_pgnet_score_thresh=0.5, e2e_char_dict_path='./ppocr/utils/ic15_dict.txt', e2e_pgnet_valid_set='totaltext', e2e_pgnet_mode='fast', use_angle_cls=True, cls_model_dir='/Users/soma/.paddleocr/whl/cls/ch_ppocr_mobile_v2.0_cls_infer', cls_image_shape='3, 48, 192', label_list=['0', '180'], cls_batch_num=6, cls_thresh=0.9, enable_mkldnn=False, cpu_threads=10, use_pdserving=False, warmup=False, sr_model_dir=None, sr_image_shape='3, 32, 128', sr_batch_num=1, draw_img_save_dir='./inference_results', save_crop_res=False, crop_res_save_dir='./output', use_mp=False, total_process_num=1, process_id=0, benchmark=False, save_log_path='./log_output/', show_log=True, use_onnx=False, output='./output', table_max_len=488, table_algorithm='TableAttn', table_model_dir=None, merge_no_span_structure=True, table_char_dict_path=None, layout_model_dir=None, layout_dict_path=None, layout_score_threshold=0.5, layout_nms_threshold=0.5, kie_algorithm='LayoutXLM', ser_model_dir=None, re_model_dir=None, use_visual_backbone=True, ser_dict_path='../train_data/XFUND/class_list_xfun.txt', ocr_order_method=None, mode='structure', image_orientation=False, layout=True, table=True, ocr=True, recovery=False, use_pdf2docx_api=False, lang='en', det=True, rec=True, type='ocr', ocr_version='PP-OCRv3', structure_version='PP-StructureV2')
[2022/12/04 21:35:00] ppocr DEBUG: dt_boxes num : 311, elapse : 0.4760923385620117
[2022/12/04 21:35:02] ppocr DEBUG: cls num  : 311, elapse : 1.4697859287261963
[2022/12/04 21:35:17] ppocr DEBUG: rec_res num  : 311, elapse : 15.190897941589355

results looks pretty complicated at first glance because... well, it is! It's a list of lists of lists: along with including just the word, it includes bounding coordinates and a confidence score. Below I've turned the words into a single list, but if you're interested in more complex transformations check the documentation.

In [4]:

            
                Copied!
                
words = [page[1][0] for page in result[0]]
words[:10]
words = [page[1][0] for page in result[0]]
words[:10]

Out[4]:

['Player',
 'Pos',
 'Status',
 'Ht',
 'Wt DOB',
 'Rhett Bomar',
 'Quarterback',
 'Active',
 "6'2'",
 '215']

Speed¶

On my sample page, PaddleOCR took about 20s.