Solving CAPTCHAs in Playwright with NopeCHA¶
The internet is full of robots trying to do bad things, and as a result we're constantly having to prove that we're human by completing CAPTCHAs.
But sometimes we want to be those robots doing bad things (but only in the spirit of journalism). Let's learn how to break CAPTCHAs when scraping so we can accomplish good and true and beautiful things.
Download the automation version of the NopeCHA CAPTCHA solver¶
We're going to be using NopeCHA for this. NopeCHA has two versions of the browser extension: the graphical version and the automation version. We're going to use the automation version because Playwright is a browser automation tool. I think the graphical one is just installed in your "normal" browser.
The extension will be extracted into the nopecha
folder.
import requests
import zipfile
import json
# https://developers.nopecha.com/guides/extension/#loading-the-nopecha-extension-in-a-browser
with open('chromium_automation.zip', 'wb') as f:
f.write(requests.get('https://github.com/NopeCHALLC/nopecha-extension/releases/latest/download/chromium_automation.zip').content)
with zipfile.ZipFile('chromium_automation.zip', 'r') as zip_ref:
zip_ref.extractall("nopecha")
Update the API key¶
Services often require you to use an API key to talk back and forth, but this setup is a little different – we need to install the API key inside of the extension.
Inside of the extension is a manifest.json
which needs to hold the NopeCHA API key. We're going to open it up, add in our API key, and save it again.
You can get a NopeCHA API key on their website. It seems a little sketchy but it worked fine for me – twenty minutes after the credit card payment I had an API key.
# Open existing manifest
with open("nopecha/manifest.json") as fp:
data = json.load(fp)
# Update with API key
data['nopecha']['key'] = "YOUR API KEY"
# Save to manifest
with open("nopecha/manifest.json", "w") as fp:
json.dump(data, fp)
You only need to do this once! After you have the downloaded extension folder with an API key inside you're free to re-use it for every project.
Launch Playwright¶
According to the Playwright documentation, you need to launch the browser a specific way in order to use Chrome extensions. It looks awful but hey, it works! We also need to provide a place to save the data from the browser, in this case we're making a new directory called user-data
.
import asyncio
from playwright.async_api import async_playwright, Playwright
user_data_dir = "./user-data"
playwright = await async_playwright().start()
browser = await playwright.chromium.launch_persistent_context(
user_data_dir,
headless=False,
ignore_default_args=['--enable-automation'],
args=[
f"--disable-extensions-except=./nopecha",
f"--load-extension=./nopecha",
],
)
When the browser launches it might have some error-looking messages up top. You can ignore them.
Now we take the page it opened and head on over to the NopeCHA demo page.
# already has a page
# page = browser.new_page()
page = browser.pages[0]
await page.goto("https://nopecha.com/demo")
Then we click the Easy reCAPTCHA link, and it loads the page. The NopeCHA extension immediately gets to work, but note that the CAPTCHA won't be solved immediately, and maybe not even on the first try.
After you wait a bit you'll hopefully see it pass the test!