Breaking captchas with Selenium or Playwright¶
This page is about breaking simple, home-grown CAPTCHAs. I also have a writeup of solving reCAPTCHA and hCAPTCHA with NopeCHA and Playwright.
Visiting the page¶
I made a page to generate CAPTCHAS that you can try out if you want some to play around with.
You probably won't be doing this with BeautifulSoup since CAPTCHAs are usually JavaScript-based.
Saving the CAPTCHA image¶
Both Selenium and Playwright will grab an element off of the page and save a screenshot of it for you. In theory you can also stream it as a bytes object instead of saving it, but pytesseract doesn't seem to like that without a lot of fiddling around.
One thing to note that in the next step we remove 2 pixels from the top/left/bottom/right from the image. This is because there's a thin border that bleeds through into the screenshot and makes it harder for the CAPTCHA to be read.
Your downloaded image might look like this:
Deskew and clean¶
We'll use the ImageMagick library Wand along with the deskew library to convert the image into something a little easier to use text recognition on.
I like to use Wand because it's... it's a pain, but it's a little nicer than the other libraries that have documentation for the deskew library. If you can't get it to work on your machine, though, you can read the deskew documentation.
from deskew import determine_skew
from wand.image import Image
import numpy as np
with Image(filename='captcha.png') as image:
with image.clone() as cleaned:
# Pull a couple pixels off the edge to remove border noise
cleaned.crop(2, 2, image.width - 2, image.height - 2)
# Remove anything that isn't the text
cleaned.trim()
# Remove rotation
angle = determine_skew(np.array(cleaned))
print("Rotating", angle, "degrees")
cleaned.rotate(-angle, 'white', True)
# Save
cleaned.save(filename='captcha-cleaned.png')
Before cleaning
After cleaning
Breaking the captcha¶
Pytesseract¶
Now we'll use pytesseract to break the captcha. It's the best balance of accuracy and ease of use that I've found.
First, we'll install tesseract and pytesseract.
Then we'll use them.
import pytesseract
guess = pytesseract.image_to_string('captcha-cleaned.png').strip()
print("The guess is", guess)
Keras OCR¶
Keras OCR is a lot fancier, but works a lot better for edge cases.
First, we'll do the installation.
# macOS (or at least M1 macs)
pip install tensorflow-macos keras-ocr
# Windows
pip install tensorflow keras-ocr
Now we'll use them.