Scraping Twitter with headless Playwright (anti-bot detection)¶
Scraping Twitter with Playwright requires you to circumvent their anti-bot detection. Working with "normal" playwright is fine, it's only when you go headless – no screen – that Twitter tries to block you.
Basic setup¶
After you install playwright with pip install playwright
, you'll need to run playwright install
to install a browser (or two or three or four!).
Opening the browser¶
You can pick the browser you want to use, each one has its own ups and downs.
Chromium¶
The solution below allows you to use the Chromium browser (open source Chrome) to browse Twitter. This will look most similar to what you see if you're a Chrome user.
from playwright.async_api import async_playwright
playwright = await async_playwright().start()
device = playwright.devices["Desktop Chrome"]
browser = await playwright.chromium.launch()
context = await browser.new_context(**device)
page = await context.new_page()
await page.goto("https://twitter.com/dangerscarf/status/1638163206349651975")
Firefox¶
If you want to see videos, you'll need to use Firefox instead. I think there's an issue with the codec when using Chromium.
from playwright.async_api import async_playwright
playwright = await async_playwright().start()
device = playwright.devices["Desktop Firefox"]
browser = await playwright.firefox.launch()
context = await browser.new_context(**device)
page = await context.new_page()
await page.goto("https://twitter.com/dangerscarf/status/1638163206349651975")
Getting tweet content¶
Getting the tweet content is easy: you visit the page, wait for the "Reply" button to show up, then pull the HTML from the element that has data-testid='tweet'
.
await page.goto("https://twitter.com/dangerscarf/status/1638163206349651975")
await page.wait_for_selector("[aria-label=\"Reply\"]")
tweet = page.locator('[data-testid="tweet"]')
html = await tweet.inner_html()
If you want videos, embeds, etc to show up, you might want to import time
and add a time.sleep(4)
after you visit the page to be sure everything loads.
Screenshotting tweets¶
There are a few issues with screenshots right out of the gate:
- There are SO MANY BANNERS about signing up for Twitter
- Sometimes the height of the tweet means you can't see who tweeted it.
The solution below solves both of those, along with using Firefox to make sure you get screenshots from videos. If you don't have Firefox installed with Playwright, just run playwright install firefox
from the command line.
import time
from playwright.async_api import async_playwright
playwright = await async_playwright().start()
device = playwright.devices["Desktop Firefox"]
browser = await playwright.firefox.launch()
device['viewport'] = {
'width': 1280,
'height': 3000
}
context = await browser.new_context(**device)
page = await context.new_page()
# Visit the page
await page.goto("https://twitter.com/dangerscarf/status/1638163206349651975")
await page.wait_for_selector("[aria-label=\"Reply\"]")
# Hope everything loads
time.sleep(4)
# Clean up the page, remove banners
await page.evaluate("""
() => {
document.querySelector('[data-testid=\"BottomBar\"]').remove()
try {
document.querySelector('[aria-label=\"sheetDialog\"]').parentNode.remove()
} catch(err) {
}
}
""")
# Take the screenshot
tweet = page.locator('[data-testid="tweet"]')
await tweet.screenshot(path='screenshot.png')