Selenium-Playwright scraping command conversion reference¶
I've been using Selenium for automated scraping of interactive websites for hundreds if not thousands of years, but Playwright seems pretty good. Let's build a quick reference to compare the two.
Is Playwright better than Selenium?¶
Playwright is newer than Selenium, and oftentimes has better documentation. On the other hand, it's built for JavaScript and its Python usage is a little awkward compared to "normal" Python code.
Installation¶
Installation is slightly more difficult for Selenium, in that you need to install Selenium, a browser, and a webdriver, which is what talks to the browser. Playwright doesn't need a separate webdriver, but it does need a browser.
Basic imports¶
Selenium has approximately ten million imports. You don't necessarily need all of them, but if you're using dropdowns and waiting for the page to load and blah blah all sorts of particular things, they add up quickly.
Playwright has maybe two imports.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select
from webdriver_manager.chrome import ChromeDriverManager
Basic setup and usage¶
In the Selenium example, we're going to use Webdriver Manager to automatically download the latest version of ChromeDriver for us. This is a great way to avoid having to manually download and install the driver.
Showing the browser window¶
Sometimes you want to see the browser while you're scraping. It's useful for debugging, and it's also useful for seeing if you're getting blocked by a CAPTCHA or something. It also feels pretty cool to watch the browser do its thing.
Hiding the browser window (headless)¶
Running "headless" (hiding the browser window) is a good way to make your scraping faster and more efficient.
Why does playwright have
await
all over the place, while Selenium doesn't?Playwright is asynchronous, which means that it can do multiple things at once. This is great for scraping, because it means that you can do things like click a button and then wait for the page to load at the same time. However, it also means that you have to use
await
to tell the program to wait for the asynchronous function to finish before moving on to the next line of code. This is why you seeawait
everywhere in the Playwright code.Basically: put
await
everywhere and it will probably work.
Visiting pages¶
Description | Selenium | Playwright |
---|---|---|
Open a browser | driver = webdriver.Chrome() |
browser = await playwright.chromium.launch() |
Open a headless browser | driver = webdriver.Chrome(options=options) |
browser = await playwright.chromium.launch(headless=True) |
Visit a URL | driver.get('https://www.washingtonpost.com') |
await page.goto('https://www.washingtonpost.com') |
Wait for page to fully load | n/a |
await page.goto("https://www.washingtonpost.com", wait_until="networkidle") |
Wait for element to show up on page | WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'table.results'))) |
await page.locator('table.results').wait_for() |
Wait for a page to load (with timeout) | WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'table.results'))) |
await page.locator('table.results', { timeout: 10000 }.wait_for() |
Give the page HTML to BeautifulSoup | doc = BeautifulSoup(driver.page_source, 'html.parser') |
doc = BeautifulSoup(await page.content(), 'html.parser') |
Selecting elements on the page¶
Description | Selenium | Playwright |
---|---|---|
Find element by CSS selector | driver.find_element(By.CSS_SELECTOR, 'button.submit') |
await page.locator('button.submit') |
Selecting multiple elements by CSS selector | driver.find_elements(By.CSS_SELECTOR, '.row') |
await page.locator('.row') |
Find element by XPath | driver.find_element(By.XPATH, '//button') |
await page.locator('//button') |
Find element by complete text | driver.find_element(By.LINK_TEXT, 'Click me') |
await page.locator('text=Click me') |
Find element by partial text | driver.find_element(By.PARTIAL_LINK_TEXT, 'Click me') |
await page.locator('a:has-text("Click me")') |
Find element by partial text in href attribute |
driver.find_element(By.CSS_SELECTOR, 'a[href*="url-to-somewhere"]') |
await page.locator('a[href*="url-to-somewhere"]') |
Interacting with the page¶
Description | Selenium | Playwright |
---|---|---|
Click a button | driver.find_element(By.CSS_SELECTOR, 'button').click() |
await page.click('button') or await button.click() |
Fill a form | driver.find_element(By.CSS_SELECTOR, 'input.name').send_keys('My name') |
await page.fill('input.name', 'My name') |
Select an option | Select(driver.find_element(By.CSS_SELECTOR, 'select#company')).select_by_value('Pigeons LLC') |
await page.select_option('select#company', 'Pigeons LLC') |
Switching to a newly opened tab | driver.switch_to.window(driver.window_handles[-1]) |
it depends |
The Playwright docs have a great page on "multi-page scenarios," include handling new pages and popups
Closing the browser¶
Description | Selenium | Playwright |
---|---|---|
Close the browser | driver.quit() |
await browser.close() |