Avoiding bans
Are you a person who has been banned from AZLyrics for scraping too much?!?! CONGRATULATIONS!!! Your IP can’t access AZLyrics for a while!
Is the ban for 12 hours? 24 hours? 2 days? I have no idea.
The ban is based on your IP address, which some of you might share, which means one of you might cause other people to be banned! So exciting!
Step 1: Change your IP
Some of you changed your IP by scraping again at home, or at a coffee shop, or using a VPN or proxy. With that new IP address, AZLyrics didn’t know who you were and you everything worked fine! …for a little bit.
Then YOU GOT BANNED AGAIN!!! Amazing!!!
And when you got banned again, all of that extra scraping just went out the window! Because you started over from the beginning, maybe you made it halfway through twice before it banned you. But both times you only did the first half!
Right now all of the code you’re writing assumes you 100% finish your scraping when you start, which is not always realistic.
Step 2: Protect yourself from errors
Normally, we just do something like this to scrape:
scraped_df = df.apply(scrape_page, axis=1)
scraped_df
and then merge like this:
new_df = df.join(scraped_df, rsuffix='_scraped')
new_df
We need to make a little change! Let’s make sure our scraping function will NOT freak out if requests
runs into an error. We can do this by wrapping everything in try
/except
(any other try
/except
blocks inside will still work, don’t worry).
def scrape_page(row):
try:
url = "blah blah blah"
response = requests.get(url)
doc = BeautifulSoup(response.text)
data = {}
..... code code code .....
..... code code code .....
..... code code code .....
return pd.Series(data)
except:
return pd.Series({})
When we run this, it returns info for everything that it successfully scrapes and NaN
for every row it can’t scrape.
It’s nice to get a dataframe back instead of an error, but every song we try to scrape after we’re banned will have NaN
lyrics. How do we tell pandas to scrape those again?
Step 2: Set things up to resume
We just need to fill in the missing data, right? We can make a new dataframe of only missing data like this:
missing_df = new_df[new_df.lyrics.isna()]
missing_df.head()
and then use .apply
to run the scraper only for those rows, making a new scraped_df
that’s only the missing stuff.
scraped_df = missing_df.apply(scrape_page, axis=1)
scraped_df
Now we just need to move those new lyrics into new_df
(which is our original dataframe + the original scraping). But new_df
already has some lyrics in it - how do we have pandas fill in the missing parts?
Don’t try to merge them! That might lose all of the lyrics we already have! Instead, we use .fillna
and scraped_data
to fill in everywhere that has NaN
data:
new_df.fillna(scraped_df, inplace=True)
new_df
So after you merge your first scraping attempt and your original dataframe, just repeat those few lines again and again until you’re done - find the missing rows, scrape the missing rows, new_df.fillna
with the scraped data. Eventually you’ll have it all filled in!