# Scraping: http://www.bbc.co.uk/news

Let's try to scrape the frontpage of BBC News. We're looking for

* Headlines
* Summary
* Article link

## Getting started

We'll start by **importing the necessary libraries**.

In [5]:
import requests
from bs4 import BeautifulSoup

And then move into **downloading the page** and **importing it into BeautifulSoup**.

In [7]:
response = requests.get('http://www.bbc.co.uk/news')
doc = BeautifulSoup(response.text, 'html.parser')

A lot of people call the analyzed page variable `soup` but for once in my life I actually go against the popular thing - I like to call it `doc`, since it helps me remember that it's the *entire document*.

## ATTEMPT ONE: Grabbing the tags directly

If we look at the page, we try to use the little arrow-selecty-thing to pick up the headlines and **disaster strikes**. We can't touch it! Apparently it's the ENTIRE BLOCK or something crazy like that?

But luckily we understand HTML, so we can click around on the right-hand Elements page. We navigate to the `h3` tag, which we know is the headline based on the tag name and the content.

Hm, what if we just grab all of the `h3` tags?

In [9]:
headlines = doc.find_all('h3')

for headline in headlines:
    print(headline.text)

US 'will not repeat' GCHQ wiretap claims
George Osborne to edit London newspaper
'Huge advance' in fighting biggest killer
Sturgeon up for referendum date talks
No punishment for man who raped girl, 12
Dunham hits back after weight criticism
US warns N Korea of military option
US warns N Korea of military option
Guinness for royals on St Patrick's Day
Mother jailed for hiding baby's death
Man dies charging iPhone in bath
Hungary puts migrants in containers
Love Actually cast reunites for Comic Relief
Sizing John wins Cheltenham Gold Cup
Miliband's joke steals Osborne's limelight
BBC News Channel
BBC Radio 4 - PM
Jam? Meet the Michael Jackson traffic cop
Bake Off line-up: A winning recipe?
Weekly quiz: In which soap did Ed Sheeran appear?
What's the etiquette on vaping?
Tall driver told to 'grow up' by judge
The mystery of the murder in the Lucky Holiday Hotel
Britain’s bungled effort to clean up its first big oil spill
Sweden's Got Talent: The pop star with sleep paralysis
George Osbor

SO EASY, right? Kind of? Mostly it worked? ...except it doesn't have the link, nor does it have the summary.

Okay, so we could also get all of the `a` tags, but there are probably a lot of garbage `a` tags - footer content and stuff. Maybe the article `a` tags have a special class? If we take a look, we see `class="gs-c-promo-heading nw-o-link-split__anchor gs-o-faux-block-link__overlay-link gel-pica-bold"`. This isn't just one class, it's **many classes**.

* `gs-c-promo-heading`
* `nw-o-link-split__anchor`
* `gs-o-faux-block-link__overlay-link`
* `gel-pica-bold`

This is where guesswork comes ib. I think `gs-c-promo-heading` seems reasonable!

In [10]:
links = doc.find_all('a', { 'class': 'gs-c-promo-heading' })

for link in links:
    print(link.text)

US 'will not repeat' GCHQ wiretap claims
George Osborne to edit London newspaper
'Huge advance' in fighting biggest killer
Sturgeon up for referendum date talks
No punishment for man who raped girl, 12
Dunham hits back after weight criticism
US warns N Korea of military option
US warns N Korea of military option
VideoGuinness for royals on St Patrick's Day
Mother jailed for hiding baby's death
Man dies charging iPhone in bath
VideoHungary puts migrants in containers
VideoLove Actually cast reunites for Comic Relief
Sizing John wins Cheltenham Gold Cup
Miliband's joke steals Osborne's limelight
VideoBBC News Channel
AudioBBC Radio 4 - PM
VideoJam? Meet the Michael Jackson traffic cop
Bake Off line-up: A winning recipe?
Weekly quiz: In which soap did Ed Sheeran appear?
What's the etiquette on vaping?
Tall driver told to 'grow up' by judge
The mystery of the murder in the Lucky Holiday Hotel
Britain’s bungled effort to clean up its first big oil spill
Sweden's Got Talent: The pop star wit

That looks pretty good, too! It's getting the `h3` text because the `h3` is inside of the `a` tag, but it doesn't have the *actual link*, the URL. If we look at the `a` tag...

    <a class="gs-c-promo-heading nw-o-link-split__anchor gs-o-faux-block-link__overlay-link gel-pica-bold" href="/news/world-middle-east-39302560">
   
...the URL is hiding in the `href` attribute. Once we have the link, it's actually easy to get an attribute, you just use `['href']`

In [11]:
links = doc.find_all('a', { 'class': 'gs-c-promo-heading' })

for link in links:
    print(link.text)
    print(link['href'])

US 'will not repeat' GCHQ wiretap claims
/news/uk-39300191
George Osborne to edit London newspaper
/news/uk-39304944
'Huge advance' in fighting biggest killer
/news/health-39305640
Sturgeon up for referendum date talks
/news/uk-scotland-scotland-politics-39299305
No punishment for man who raped girl, 12
/news/uk-scotland-edinburgh-east-fife-39305042
Dunham hits back after weight criticism
/news/entertainment-arts-39303458
US warns N Korea of military option
/news/world-asia-39297031
US warns N Korea of military option
/news/world-asia-39297031
VideoGuinness for royals on St Patrick's Day
/news/uk-39308969
Mother jailed for hiding baby's death
/news/uk-england-london-39305951
Man dies charging iPhone in bath
/news/uk-39307418
VideoHungary puts migrants in containers
/news/world-europe-39301003
VideoLove Actually cast reunites for Comic Relief
/news/entertainment-arts-39301010
Sizing John wins Cheltenham Gold Cup
/sport/horse-racing/39307278
Miliband's joke steals Osborne's limelight
/ne

Cool, 'eh? But now we have one final problem: **we don't have the summaries**. So well, we can just use the Inspector to pick one out...

    <p class="gs-c-promo-summary gel-long-primer gs-u-mt nw-c-promo-summary">Trade and Nato are high on the agenda as the much-anticipated Washington talks begin.</p>

Once again, we have a selection of options. `gs-c-promo-summary` seems promising.

In [13]:
summaries = doc.find_all('p', { 'class': 'gs-c-promo-summary' })

for summary in summaries:
    print(summary.text)

Allegations that Donald Trump was wiretapped are "nonsense", the UK's intelligence agency says.
Former chancellor faces calls to stand down as MP after being named as editor of London Evening Standard.
The innovative new injection cuts cholesterol to lowest levels ever seen in medicine.
Nicola Sturgeon wants talks with Theresa May about a referendum date that suits both sides.
A man who admitted raping a 12-year-old girl in Edinburgh is given an absolute discharge.
Actress Lena Dunham, star and creator of Girls, says her recent weight loss "isn't a triumph".
America's top diplomat says the US could strike if Pyongyang's weapons threat rises.
America's top diplomat says the US could strike if Pyongyang's weapons threat rises.
The Duke and Duchess of Cambridge drank Guinness at a St Patrick's Day lunch with the Irish Guards.
Victoria Gayle pleaded guilty to preventing the lawful and decent burial of her baby son.
Camps for asylum seekers are being built out of shipping containers, despit

Great, but **now we're stuck:** we don't have a way of combining the headlines and links to the summaries, and even if we did (cough`zip`cough), we couldn't be sure that they'd match up.

What the heck do we do now?

## ATTEMPT TWO: Parent elements

When you're just grabbing one element - a link and the text inside, or a list of headlines - you are only interested in the element you're looking at. Sometimes, though, **you need to scrape multiple elements at the same time.** When this happens, you need to look at what they all have in common.

If we look at a summary, a link and a title, we might find something like the following. **It's a trainwreck, but it's what we want.**

	<div class="gs-c-promo nw-c-promo gs-o-faux-block-link gs-u-pb gs-u-pb+@m gs-c-promo--inline gs-c-promo--stacked@m nw-u-w-auto gs-c-promo--flex" data-entityid="container-top-stories#3">
		<div class="gs-c-promo-image gs-u-display-none gs-u-display-inline-block@xs gel-1/2@xs gel-1/1@m">
			<div class="gs-o-media-island">
				<div class="gs-o-responsive-image gs-o-responsive-image--16by9"></div>
			</div>
		</div>
		<div class="gs-c-promo-body gel-1/2@xs gel-1/1@m gs-u-mt@m">
			<div>
				<a class="gs-c-promo-heading nw-o-link-split__anchor gs-o-faux-block-link__overlay-link gel-pica-bold" href="/news/world-middle-east-39302560">
				<h3 class="gs-c-promo-heading__title gel-pica-bold nw-o-link-split__text">Attack on Yemen migrant boat kills 42</h3></a>
				<p class="gs-c-promo-summary gel-long-primer gs-u-mt nw-c-promo-summary">It is unclear who was behind a helicopter attack which killed 42 refugees and injured 80.</p>
			</div>
			<ul class="gs-o-list-inline gs-o-list-inline--divided gel-brevier gs-u-mt-">
				<li><span class="gs-c-timestamp gs-o-bullet gs-o-bullet- nw-c-timestamp"><span class="gs-o-bullet__icon gel-icon"><svg viewbox="0 0 32 32">
				<polygon points="17,15.4 17,6 15,6 15,16.6 23.8,21.7 24.8,19.9"></polygon>
				<path d="M16,4c6.6,0,12,5.4,12,12c0,6.6-5.4,12-12,12S4,22.6,4,16C4,9.4,9.4,4,16,4 M16,0C7.2,0,0,7.2,0,16c0,8.8,7.2,16,16,16 s16-7.2,16-16C32,7.2,24.8,0,16,0L16,0z"></path></svg></span><time class="gs-o-bullet__text date qa-status-date relative-time" data-datetime="1h" data-seconds="1489768430" data-timestamp-inserted="true" datetime="2017-03-17T16:33:50.000Z">48 minutes ago</time></span></li>
				<li>
					<a aria-label="From Middle East" class="gs-c-section-link gs-c-section-link--truncate nw-c-section-link nw-o-link nw-o-link--no-visited-state" href="/news/world/middle_east"><span aria-hidden="true">Middle East</span></a>
				</li>
			</ul>
		</div>
	</div>

The very top part is the **parent element**, all of the other elements are inside of it. In order to scrape them all together, we need to grab each parent (each *story*) and then grab the parts inside of it (the headline, links, image, etc).

The part's class is `class="gs-c-promo nw-c-promo gs-o-faux-block-link gs-u-pb gs-u-pb+@m gs-c-promo--inline gs-c-promo--stacked@m nw-u-w-auto gs-c-promo--flex" data-entityid="container-top-stories#3"`, which would be terrifying except that we've struck onto a theme and suspect `gs-c-promo` might be what we're looking for.

In [14]:
stories = doc.find_all('div', { 'class': 'gs-c-promo' })
for story in stories:
    print(story.text)

US 'will not repeat' GCHQ wiretap claimsAllegations that Donald Trump was wiretapped are "nonsense", the UK's intelligence agency says.18m19 minutes agoUKRelated contentVideoSpicer cites Fox News claimsVideoWiretap saga in two minutesDid Obama wiretap Trump Tower?
George Osborne to edit London newspaperFormer chancellor faces calls to stand down as MP after being named as editor of London Evening Standard.13m13 minutes agoUK Politics Comments
'Huge advance' in fighting biggest killerThe innovative new injection cuts cholesterol to lowest levels ever seen in medicine.4h4 hours agoHealth
Sturgeon up for referendum date talksNicola Sturgeon wants talks with Theresa May about a referendum date that suits both sides.3h3 hours agoScotland politics Comments
No punishment for man who raped girl, 12A man who admitted raping a 12-year-old girl in Edinburgh is given an absolute discharge.4h4 hours agoEdinburgh, Fife & East Scotland
Dunham hits back after weight criticismActress Lena Dunham, star 

So... kind of?

We apparently can't use `.text` because it's going to get take *all* of the text inside, it's going to take the headline *and* the summary. What we need to do instead is

* STEP ONE: Use the doc to get the story
* STEP TWO: Use the story to get the headline
* STEP THREE: Use the story to get the link
* STEP FOUR: Use the story to get the summary

### STEP ONE: Use the doc to get the story

In [15]:
stories = doc.find_all('div', { 'class': 'gs-c-promo' })
for story in stories:
    print("This is a story")

This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story


## STEP TWO: Use the story to get the headline

Now we can do the same thing to find the link, and then use `['href']` to grab the link URL.

In [18]:
stories = doc.find_all('div', { 'class': 'gs-c-promo' })
for story in stories:
    print("THIS IS A STORY")
    headline = story.find('h3')
    print(headline.text)
    link = story.find('a')
    print(link['href'])

THIS IS A STORY
US 'will not repeat' GCHQ wiretap claims
/news/uk-39300191
THIS IS A STORY
George Osborne to edit London newspaper
/news/uk-39304944
THIS IS A STORY
'Huge advance' in fighting biggest killer
/news/health-39305640
THIS IS A STORY
Sturgeon up for referendum date talks
/news/uk-scotland-scotland-politics-39299305
THIS IS A STORY
No punishment for man who raped girl, 12
/news/uk-scotland-edinburgh-east-fife-39305042
THIS IS A STORY
Dunham hits back after weight criticism
/news/entertainment-arts-39303458
THIS IS A STORY
US warns N Korea of military option
/news/world-asia-39297031
THIS IS A STORY
US warns N Korea of military option
/news/world-asia-39297031
THIS IS A STORY
Guinness for royals on St Patrick's Day
/news/uk-39308969
THIS IS A STORY
Mother jailed for hiding baby's death
/news/uk-england-london-39305951
THIS IS A STORY
Man dies charging iPhone in bath
/news/uk-39307418
THIS IS A STORY
Hungary puts migrants in containers
/news/world-europe-39301003
THIS IS A STOR

## STEP FOUR: Use the story to get the summary

Same thing again! This time we're looking for a `p`.

In [20]:
stories = doc.find_all('div', { 'class': 'gs-c-promo' })
for story in stories:
    print("THIS IS A STORY")
    headline = story.find('h3')
    print(headline.text)
    link = story.find('a')
    print(link['href'])
    summary = story.find('p')
    print(summary.text)

THIS IS A STORY
US 'will not repeat' GCHQ wiretap claims
/news/uk-39300191
Allegations that Donald Trump was wiretapped are "nonsense", the UK's intelligence agency says.
THIS IS A STORY
George Osborne to edit London newspaper
/news/uk-39304944
Former chancellor faces calls to stand down as MP after being named as editor of London Evening Standard.
THIS IS A STORY
'Huge advance' in fighting biggest killer
/news/health-39305640
The innovative new injection cuts cholesterol to lowest levels ever seen in medicine.
THIS IS A STORY
Sturgeon up for referendum date talks
/news/uk-scotland-scotland-politics-39299305
Nicola Sturgeon wants talks with Theresa May about a referendum date that suits both sides.
THIS IS A STORY
No punishment for man who raped girl, 12
/news/uk-scotland-edinburgh-east-fife-39305042
A man who admitted raping a 12-year-old girl in Edinburgh is given an absolute discharge.
THIS IS A STORY
Dunham hits back after weight criticism
/news/entertainment-arts-39303458
Actress 

AttributeError: 'NoneType' object has no attribute 'text'

### Missing elements

Oh god, an error! If you weren't paying attention, the error is

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-20-e55795264040> in <module>()
          7     print(link['href'])
          8     summary = story.find('p')
    ----> 9     print(summary.text)

    AttributeError: 'NoneType' object has no attribute 'text'

Since it showed up after we added in the `summary` part, I'm going to assume this is an issue because **not every story has a summary**. How do we get around it!!!

Well, just *ask if it has a summary*. If it does, you can use it. If it doesn't, ignore it. **It's just a simple `if` statement**.

In [22]:
stories = doc.find_all('div', { 'class': 'gs-c-promo' })
for story in stories:
    print("THIS IS A STORY")
    headline = story.find('h3')
    print(headline.text)
    link = story.find('a')
    print(link['href'])
    summary = story.find('p')
    if summary:
        print(summary.text)

THIS IS A STORY
US 'will not repeat' GCHQ wiretap claims
/news/uk-39300191
Allegations that Donald Trump was wiretapped are "nonsense", the UK's intelligence agency says.
THIS IS A STORY
George Osborne to edit London newspaper
/news/uk-39304944
Former chancellor faces calls to stand down as MP after being named as editor of London Evening Standard.
THIS IS A STORY
'Huge advance' in fighting biggest killer
/news/health-39305640
The innovative new injection cuts cholesterol to lowest levels ever seen in medicine.
THIS IS A STORY
Sturgeon up for referendum date talks
/news/uk-scotland-scotland-politics-39299305
Nicola Sturgeon wants talks with Theresa May about a referendum date that suits both sides.
THIS IS A STORY
No punishment for man who raped girl, 12
/news/uk-scotland-edinburgh-east-fife-39305042
A man who admitted raping a 12-year-old girl in Edinburgh is given an absolute discharge.
THIS IS A STORY
Dunham hits back after weight criticism
/news/entertainment-arts-39303458
Actress 

## Turning it into a CSV

Now that we have all of our elements, we can turn it into a CSV. There are three steps to building the CSV:
    
1. **Start with an empty list:** Each story we'll find, we'll add it to the list
2. **Build a dictionary** for each story element
3. **Convert the list to a DataFrame**, and then
4. **Export the DataFrame to a CSV**

The dictionary-buiding part can be complicated, so let's look at **two different ways of doing it**.

### Method One: All at once

For this method, we'll make our `story_dict` all at once, then add it to the `stories_list`.

In [43]:
# Start with an empty list
stories_list = []
stories = doc.find_all('div', { 'class': 'gs-c-promo' })
for story in stories:
    headline = story.find('h3')
    link = story.find('a')
    summary = story.find('p')
    # Does our story have a summary?
    if summary:
        # Build a dict that HAS a summary
        story_dict = {
            'headline': headline.text,
            'url': link['href'],
            'summary': summary.text
        }
    else:
        # Build a dict that does NOT have a summary
        story_dict = {
            'headline': headline.text,
            'url': link['href'],
        }    
    # Add the dict to our list
    stories_list.append(story_dict)

print(stories_list)

# Now that we're done, convert to a CSV and save.
# If you don't use index=False, you'll get an ugly dataframe!
import pandas as pd
df = pd.DataFrame(stories_list)
df.to_csv("bbc.csv", index=False)



### Method Two: Filling in the blanks

For this method, we'll make our `story_dict` in the beginning, then fill in any pieces that exist.

In [42]:
# Start with an empty list
stories_list = []
stories = doc.find_all('div', { 'class': 'gs-c-promo' })
for story in stories:
    # Create a dictionary without anything in it
    story_dict = {}
    headline = story.find('h3')
    if headline:
        story_dict['headline'] = headline.text
    link = story.find('a')
    if link:
        story_dict['url'] = link['href']
    summary = story.find('p')
    if summary:
        story_dict['summary'] = summary.text
    # Add the dict to our list
    stories_list.append(story_dict)
    
# Now that we're done, convert to a CSV and save
# If you don't use index=False, you'll get an ugly dataframe!
import pandas as pd
df = pd.DataFrame(stories_list)
df.to_csv("bbc.csv", index=False)