Avoiding bans

Are you a person who has been banned from AZLyrics for scraping too much?!?! CONGRATULATIONS!!! Your IP can’t access AZLyrics for a while!

Is the ban for 12 hours? 24 hours? 2 days? I have no idea.

The ban is based on your IP address, which some of you might share, which means one of you might cause other people to be banned! So exciting!

Step 1: Change your IP

Some of you changed your IP by scraping again at home, or at a coffee shop, or using a VPN or proxy. With that new IP address, AZLyrics didn’t know who you were and you everything worked fine! …for a little bit.

Then YOU GOT BANNED AGAIN!!! Amazing!!!

And when you got banned again, all of that extra scraping just went out the window! Because you started over from the beginning, maybe you made it halfway through twice before it banned you. But both times you only did the first half!

Right now all of the code you’re writing assumes you 100% finish your scraping when you start, which is not always realistic.

Step 2: Protect yourself from errors

Normally, we just do something like this to scrape:

scraped_df = df.apply(scrape_page, axis=1)
scraped_df

and then merge like this:

new_df = df.join(scraped_df, rsuffix='_scraped')
new_df

We need to make a little change! Let’s make sure our scraping function will NOT freak out if requests runs into an error. We can do this by wrapping everything in try/except (any other try/except blocks inside will still work, don’t worry).

def scrape_page(row):
  try:
    url = "blah blah blah"
    response = requests.get(url)
    doc = BeautifulSoup(response.text)
    data = {}
    ..... code code code .....
    ..... code code code .....
    ..... code code code .....
    return pd.Series(data)
  except:
    return pd.Series({})

When we run this, it returns info for everything that it successfully scrapes and NaN for every row it can’t scrape.

It’s nice to get a dataframe back instead of an error, but every song we try to scrape after we’re banned will have NaN lyrics. How do we tell pandas to scrape those again?

Step 2: Set things up to resume

We just need to fill in the missing data, right? We can make a new dataframe of only missing data like this:

missing_df = new_df[new_df.lyrics.isna()]
missing_df.head()

and then use .apply to run the scraper only for those rows, making a new scraped_df that’s only the missing stuff.

scraped_df = missing_df.apply(scrape_page, axis=1)
scraped_df

Now we just need to move those new lyrics into new_df (which is our original dataframe + the original scraping). But new_df already has some lyrics in it - how do we have pandas fill in the missing parts?

Don’t try to merge them! That might lose all of the lyrics we already have! Instead, we use .fillna and scraped_data to fill in everywhere that has NaN data:

new_df.fillna(scraped_df, inplace=True)
new_df

So after you merge your first scraping attempt and your original dataframe, just repeat those few lines again and again until you’re done - find the missing rows, scrape the missing rows, new_df.fillna with the scraped data. Eventually you’ll have it all filled in!