Scraping: http://www.bbc.co.uk/news

Let’s try to scrape the frontpage of BBC News. We’re looking for

  • Headlines
  • Summary
  • Article link

Getting started

We’ll start by importing the necessary libraries.

import requests
from bs4 import BeautifulSoup

And then move into downloading the page and importing it into BeautifulSoup.

response = requests.get('http://www.bbc.co.uk/news')
doc = BeautifulSoup(response.text, 'html.parser')

A lot of people call the analyzed page variable soup but for once in my life I actually go against the popular thing - I like to call it doc, since it helps me remember that it’s the entire document.

ATTEMPT ONE: Grabbing the tags directly

If we look at the page, we try to use the little arrow-selecty-thing to pick up the headlines and disaster strikes. We can’t touch it! Apparently it’s the ENTIRE BLOCK or something crazy like that?

But luckily we understand HTML, so we can click around on the right-hand Elements page. We navigate to the h3 tag, which we know is the headline based on the tag name and the content.

Hm, what if we just grab all of the h3 tags?

headlines = doc.find_all('h3')

for headline in headlines:
    print(headline.text)
US 'will not repeat' GCHQ wiretap claims
George Osborne to edit London newspaper
'Huge advance' in fighting biggest killer
Sturgeon up for referendum date talks
No punishment for man who raped girl, 12
Dunham hits back after weight criticism
US warns N Korea of military option
US warns N Korea of military option
Guinness for royals on St Patrick's Day
Mother jailed for hiding baby's death
Man dies charging iPhone in bath
Hungary puts migrants in containers
Love Actually cast reunites for Comic Relief
Sizing John wins Cheltenham Gold Cup
Miliband's joke steals Osborne's limelight
BBC News Channel
BBC Radio 4 - PM
Jam? Meet the Michael Jackson traffic cop
Bake Off line-up: A winning recipe?
Weekly quiz: In which soap did Ed Sheeran appear?
What's the etiquette on vaping?
Tall driver told to 'grow up' by judge
The mystery of the murder in the Lucky Holiday Hotel
Britain’s bungled effort to clean up its first big oil spill
Sweden's Got Talent: The pop star with sleep paralysis
George Osborne: From history buff to austerity editor
Reality Check: Is education spending at a record level?
When will a second referendum take place?
Defining moment for US and Germany

Live
Sportsday - Sizing John wins the Cheltenham Gold Cup


Sport
Leicester face Atletico in last eight


Sport
Jeremy Guscott's Six Nations hot steppers


Sport
Man Utd drawn against Anderlecht


Classic hits

SO EASY, right? Kind of? Mostly it worked? …except it doesn’t have the link, nor does it have the summary.

Okay, so we could also get all of the a tags, but there are probably a lot of garbage a tags - footer content and stuff. Maybe the article a tags have a special class? If we take a look, we see class="gs-c-promo-heading nw-o-link- split__anchor gs-o-faux-block-link__overlay-link gel-pica-bold". This isn’t just one class, it’s many classes.

  • gs-c-promo-heading
  • nw-o-link-split__anchor
  • gs-o-faux-block-link__overlay-link
  • gel-pica-bold

This is where guesswork comes ib. I think gs-c-promo-heading seems reasonable!

links = doc.find_all('a', { 'class': 'gs-c-promo-heading' })

for link in links:
    print(link.text)
US 'will not repeat' GCHQ wiretap claims
George Osborne to edit London newspaper
'Huge advance' in fighting biggest killer
Sturgeon up for referendum date talks
No punishment for man who raped girl, 12
Dunham hits back after weight criticism
US warns N Korea of military option
US warns N Korea of military option
VideoGuinness for royals on St Patrick's Day
Mother jailed for hiding baby's death
Man dies charging iPhone in bath
VideoHungary puts migrants in containers
VideoLove Actually cast reunites for Comic Relief
Sizing John wins Cheltenham Gold Cup
Miliband's joke steals Osborne's limelight
VideoBBC News Channel
AudioBBC Radio 4 - PM
VideoJam? Meet the Michael Jackson traffic cop
Bake Off line-up: A winning recipe?
Weekly quiz: In which soap did Ed Sheeran appear?
What's the etiquette on vaping?
Tall driver told to 'grow up' by judge
The mystery of the murder in the Lucky Holiday Hotel
Britain’s bungled effort to clean up its first big oil spill
Sweden's Got Talent: The pop star with sleep paralysis
George Osborne: From history buff to austerity editor
Reality Check: Is education spending at a record level?
When will a second referendum take place?
Defining moment for US and Germany

That looks pretty good, too! It’s getting the h3 text because the h3 is inside of the a tag, but it doesn’t have the actual link, the URL. If we look at the a tag…

<a class="gs-c-promo-heading nw-o-link-split__anchor gs-o-faux-block- link__overlay-link gel-pica-bold" href="/news/world-middle-east-39302560">

…the URL is hiding in the href attribute. Once we have the link, it’s actually easy to get an attribute, you just use ['href']

links = doc.find_all('a', { 'class': 'gs-c-promo-heading' })

for link in links:
    print(link.text)
    print(link['href'])
US 'will not repeat' GCHQ wiretap claims
/news/uk-39300191
George Osborne to edit London newspaper
/news/uk-39304944
'Huge advance' in fighting biggest killer
/news/health-39305640
Sturgeon up for referendum date talks
/news/uk-scotland-scotland-politics-39299305
No punishment for man who raped girl, 12
/news/uk-scotland-edinburgh-east-fife-39305042
Dunham hits back after weight criticism
/news/entertainment-arts-39303458
US warns N Korea of military option
/news/world-asia-39297031
US warns N Korea of military option
/news/world-asia-39297031
VideoGuinness for royals on St Patrick's Day
/news/uk-39308969
Mother jailed for hiding baby's death
/news/uk-england-london-39305951
Man dies charging iPhone in bath
/news/uk-39307418
VideoHungary puts migrants in containers
/news/world-europe-39301003
VideoLove Actually cast reunites for Comic Relief
/news/entertainment-arts-39301010
Sizing John wins Cheltenham Gold Cup
/sport/horse-racing/39307278
Miliband's joke steals Osborne's limelight
/news/blogs-trending-39302846
VideoBBC News Channel
/news/10318089
AudioBBC Radio 4 - PM
http://www.bbc.co.uk/iplayer/console/bbc_radio_four
VideoJam? Meet the Michael Jackson traffic cop
/news/world-africa-39290410
Bake Off line-up: A winning recipe?
/news/entertainment-arts-39301921
Weekly quiz: In which soap did Ed Sheeran appear?
/news/magazine-39295274
What's the etiquette on vaping?
/news/uk-39301430
Tall driver told to 'grow up' by judge
/news/uk-england-tyne-39303893
The mystery of the murder in the Lucky Holiday Hotel
/news/magazine-39297987
Britain’s bungled effort to clean up its first big oil spill
/news/uk-england-39223308
Sweden's Got Talent: The pop star with sleep paralysis
/news/entertainment-arts-39293139
George Osborne: From history buff to austerity editor
/news/entertainment-arts-39304904
Reality Check: Is education spending at a record level?
/news/education-39302746
When will a second referendum take place?
/news/uk-scotland-scotland-politics-39306159
Defining moment for US and Germany
/news/world-europe-39254553

Cool, ‘eh? But now we have one final problem: we don’t have the summaries. So well, we can just use the Inspector to pick one out…

<p class="gs-c-promo-summary gel-long-primer gs-u-mt nw-c-promo- summary">Trade and Nato are high on the agenda as the much-anticipated Washington talks begin.</p>

Once again, we have a selection of options. gs-c-promo-summary seems promising.

summaries = doc.find_all('p', { 'class': 'gs-c-promo-summary' })

for summary in summaries:
    print(summary.text)
Allegations that Donald Trump was wiretapped are "nonsense", the UK's intelligence agency says.
Former chancellor faces calls to stand down as MP after being named as editor of London Evening Standard.
The innovative new injection cuts cholesterol to lowest levels ever seen in medicine.
Nicola Sturgeon wants talks with Theresa May about a referendum date that suits both sides.
A man who admitted raping a 12-year-old girl in Edinburgh is given an absolute discharge.
Actress Lena Dunham, star and creator of Girls, says her recent weight loss "isn't a triumph".
America's top diplomat says the US could strike if Pyongyang's weapons threat rises.
America's top diplomat says the US could strike if Pyongyang's weapons threat rises.
The Duke and Duchess of Cambridge drank Guinness at a St Patrick's Day lunch with the Irish Guards.
Victoria Gayle pleaded guilty to preventing the lawful and decent burial of her baby son.
A coroner and safety experts have since issued warnings about using electrical appliances in the bathroom.
Camps for asylum seekers are being built out of shipping containers, despite international criticism.
Film-maker Richard Curtis has brought the original stars of Love Actually back together in a special sequel for Comic Relief.
Irish challenger Sizing John wins the Cheltenham Gold Cup for jockey Robbie Power and trainer Jessica Harrington.
The latest breaking and developing stories 
An evening look at the day's events

Great, but now we’re stuck: we don’t have a way of combining the headlines and links to the summaries, and even if we did (coughzipcough), we couldn’t be sure that they’d match up.

What the heck do we do now?

ATTEMPT TWO: Parent elements

When you’re just grabbing one element - a link and the text inside, or a list of headlines - you are only interested in the element you’re looking at. Sometimes, though, you need to scrape multiple elements at the same time. When this happens, you need to look at what they all have in common.

If we look at a summary, a link and a title, we might find something like the following. It’s a trainwreck, but it’s what we want.

    <div class="gs-c-promo nw-c-promo gs-o-faux-block-link gs-u-pb gs-u- pb+@m gs-c-promo--inline gs-c-promo--stacked@m nw-u-w-auto gs-c-promo--flex" data-entityid="container-top-stories#3">
            <div class="gs-c-promo-image gs-u-display-none gs-u-display- inline-block@xs gel-1/2@xs gel-1/1@m">
                    <div class="gs-o-media-island">
                            <div class="gs-o-responsive-image gs-o- responsive-image--16by9"><img alt="Yemeni police gather round bodies of Somali refugees (17/03/17)" class="qa-lazyload-image lazyautosizes lazyloaded" data- sizes="auto" data-srcset="https://ichef.bbci.co.uk/live- experience/cps/240/cpsprodpb/14F5A/production/_95205858_hi038526789.jpg 240w, https://ichef.bbci.co.uk/live- experience/cps/320/cpsprodpb/14F5A/production/_95205858_hi038526789.jpg 320w, https://ichef.bbci.co.uk/live- experience/cps/480/cpsprodpb/14F5A/production/_95205858_hi038526789.jpg 480w, https://ichef.bbci.co.uk/live- experience/cps/624/cpsprodpb/14F5A/production/_95205858_hi038526789.jpg 624w, https://ichef.bbci.co.uk/live- experience/cps/800/cpsprodpb/14F5A/production/_95205858_hi038526789.jpg 800w" data-widths="[240,320,480,624,800]" sizes="177px" src="data:image/gif;base64,R0l GODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" srcset="https://ichef.bbci.co.uk/live- experience/cps/240/cpsprodpb/14F5A/production/_95205858_hi038526789.jpg 240w, https://ichef.bbci.co.uk/live- experience/cps/320/cpsprodpb/14F5A/production/_95205858_hi038526789.jpg 320w, https://ichef.bbci.co.uk/live- experience/cps/480/cpsprodpb/14F5A/production/_95205858_hi038526789.jpg 480w, https://ichef.bbci.co.uk/live- experience/cps/624/cpsprodpb/14F5A/production/_95205858_hi038526789.jpg 624w, https://ichef.bbci.co.uk/live- experience/cps/800/cpsprodpb/14F5A/production/_95205858_hi038526789.jpg 800w"></div>
                    </div>
            </div>
            <div class="gs-c-promo-body gel-1/2@xs gel-1/1@m gs-u-mt@m">
                    <div>
                            <a class="gs-c-promo-heading nw-o-link- split__anchor gs-o-faux-block-link__overlay-link gel-pica-bold" href="/news/world-middle-east-39302560">
                            <h3 class="gs-c-promo-heading__title gel-pica- bold nw-o-link-split__text">Attack on Yemen migrant boat kills 42</h3></a>
                            <p class="gs-c-promo-summary gel-long-primer gs- u-mt nw-c-promo-summary">It is unclear who was behind a helicopter attack which killed 42 refugees and injured 80.</p>
                    </div>
                    <ul class="gs-o-list-inline gs-o-list-inline--divided gel-brevier gs-u-mt-">
                            <li><span class="gs-c-timestamp gs-o-bullet gs- o-bullet- nw-c-timestamp"><span class="gs-o-bullet__icon gel-icon"><svg viewbox="0 0 32 32">
                            <polygon points="17,15.4 17,6 15,6 15,16.6 23.8,21.7 24.8,19.9"></polygon>
                            <path d="M16,4c6.6,0,12,5.4,12,12c0,6.6-5.4,12-12,12S4,22.6,4,16C4,9.4,9.4,4,16,4 M16,0C7.2,0,0,7.2,0,16c0,8.8,7.2,16,16,16 s16-7.2,16-16C32,7.2,24.8,0,16,0L16,0z"></path></svg></span><time class="gs-o- bullet__text date qa-status-date relative-time" data-datetime="1h" data- seconds="1489768430" data-timestamp-inserted="true" datetime="2017-03-17T16:33:50.000Z">48 minutes ago</time></span></li>
                            <li>
                                    <a aria-label="From Middle East" class="gs-c-section-link gs-c-section-link--truncate nw-c-section-link nw-o-link nw-o-link--no-visited-state" href="/news/world/middle_east"><span aria- hidden="true">Middle East</span></a>
                            </li>
                    </ul>
            </div>
    </div>

The very top part is the parent element, all of the other elements are inside of it. In order to scrape them all together, we need to grab each parent (each story) and then grab the parts inside of it (the headline, links, image, etc).

The part’s class is class="gs-c-promo nw-c-promo gs-o-faux-block-link gs-u-pb gs-u-pb+@m gs-c-promo--inline gs-c-promo--stacked@m nw-u-w-auto gs-c-promo-- flex" data-entityid="container-top-stories#3", which would be terrifying except that we’ve struck onto a theme and suspect gs-c-promo might be what we’re looking for.

stories = doc.find_all('div', { 'class': 'gs-c-promo' })
for story in stories:
    print(story.text)
US 'will not repeat' GCHQ wiretap claimsAllegations that Donald Trump was wiretapped are "nonsense", the UK's intelligence agency says.18m19 minutes agoUKRelated contentVideoSpicer cites Fox News claimsVideoWiretap saga in two minutesDid Obama wiretap Trump Tower?
George Osborne to edit London newspaperFormer chancellor faces calls to stand down as MP after being named as editor of London Evening Standard.13m13 minutes agoUK Politics Comments
'Huge advance' in fighting biggest killerThe innovative new injection cuts cholesterol to lowest levels ever seen in medicine.4h4 hours agoHealth
Sturgeon up for referendum date talksNicola Sturgeon wants talks with Theresa May about a referendum date that suits both sides.3h3 hours agoScotland politics Comments
No punishment for man who raped girl, 12A man who admitted raping a 12-year-old girl in Edinburgh is given an absolute discharge.4h4 hours agoEdinburgh, Fife & East Scotland
Dunham hits back after weight criticismActress Lena Dunham, star and creator of Girls, says her recent weight loss "isn't a triumph".4h4 hours agoEntertainment & Arts
US warns N Korea of military optionAmerica's top diplomat says the US could strike if Pyongyang's weapons threat rises.1m2 minutes agoAsia Comments
US warns N Korea of military optionAmerica's top diplomat says the US could strike if Pyongyang's weapons threat rises.1m2 minutes agoAsia Comments
VideoVideoGuinness for royals on St Patrick's DayThe Duke and Duchess of Cambridge drank Guinness at a St Patrick's Day lunch with the Irish Guards.2h2 hours agoUK
Mother jailed for hiding baby's deathVictoria Gayle pleaded guilty to preventing the lawful and decent burial of her baby son.1han hour agoLondon
Man dies charging iPhone in bathA coroner and safety experts have since issued warnings about using electrical appliances in the bathroom.1h43 minutes agoUK
VideoVideoHungary puts migrants in containersCamps for asylum seekers are being built out of shipping containers, despite international criticism.10h10 hours agoEurope
VideoVideoLove Actually cast reunites for Comic ReliefFilm-maker Richard Curtis has brought the original stars of Love Actually back together in a special sequel for Comic Relief.6h6 hours agoEntertainment & Arts
Sizing John wins Cheltenham Gold CupIrish challenger Sizing John wins the Cheltenham Gold Cup for jockey Robbie Power and trainer Jessica Harrington.1ma minute agoBBC Sport
Miliband's joke steals Osborne's limelight
VideoWatch LiveVideoBBC News ChannelThe latest breaking and developing stories 
AudioListen LiveAudioBBC Radio 4 - PMAn evening look at the day's events
VideoVideoJam? Meet the Michael Jackson traffic cop
Bake Off line-up: A winning recipe?
Weekly quiz: In which soap did Ed Sheeran appear?
What's the etiquette on vaping?
Tall driver told to 'grow up' by judge
The mystery of the murder in the Lucky Holiday Hotel
Britain’s bungled effort to clean up its first big oil spill
Sweden's Got Talent: The pop star with sleep paralysis
George Osborne: From history buff to austerity editor
Reality Check: Is education spending at a record level?
When will a second referendum take place?
Defining moment for US and Germany

So… kind of?

We apparently can’t use .text because it’s going to get take all of the text inside, it’s going to take the headline and the summary. What we need to do instead is

  • STEP ONE: Use the doc to get the story
  • STEP TWO: Use the story to get the headline
  • STEP THREE: Use the story to get the link
  • STEP FOUR: Use the story to get the summary

STEP ONE: Use the doc to get the story

stories = doc.find_all('div', { 'class': 'gs-c-promo' })
for story in stories:
    print("This is a story")
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story

STEP TWO: Use the story to get the headline

Now we can do the same thing to find the link, and then use ['href'] to grab the link URL.

stories = doc.find_all('div', { 'class': 'gs-c-promo' })
for story in stories:
    print("THIS IS A STORY")
    headline = story.find('h3')
    print(headline.text)
    link = story.find('a')
    print(link['href'])
THIS IS A STORY
US 'will not repeat' GCHQ wiretap claims
/news/uk-39300191
THIS IS A STORY
George Osborne to edit London newspaper
/news/uk-39304944
THIS IS A STORY
'Huge advance' in fighting biggest killer
/news/health-39305640
THIS IS A STORY
Sturgeon up for referendum date talks
/news/uk-scotland-scotland-politics-39299305
THIS IS A STORY
No punishment for man who raped girl, 12
/news/uk-scotland-edinburgh-east-fife-39305042
THIS IS A STORY
Dunham hits back after weight criticism
/news/entertainment-arts-39303458
THIS IS A STORY
US warns N Korea of military option
/news/world-asia-39297031
THIS IS A STORY
US warns N Korea of military option
/news/world-asia-39297031
THIS IS A STORY
Guinness for royals on St Patrick's Day
/news/uk-39308969
THIS IS A STORY
Mother jailed for hiding baby's death
/news/uk-england-london-39305951
THIS IS A STORY
Man dies charging iPhone in bath
/news/uk-39307418
THIS IS A STORY
Hungary puts migrants in containers
/news/world-europe-39301003
THIS IS A STORY
Love Actually cast reunites for Comic Relief
/news/entertainment-arts-39301010
THIS IS A STORY
Sizing John wins Cheltenham Gold Cup
/sport/horse-racing/39307278
THIS IS A STORY
Miliband's joke steals Osborne's limelight
/news/blogs-trending-39302846
THIS IS A STORY
BBC News Channel
/news/10318089
THIS IS A STORY
BBC Radio 4 - PM
http://www.bbc.co.uk/iplayer/console/bbc_radio_four
THIS IS A STORY
Jam? Meet the Michael Jackson traffic cop
/news/world-africa-39290410
THIS IS A STORY
Bake Off line-up: A winning recipe?
/news/entertainment-arts-39301921
THIS IS A STORY
Weekly quiz: In which soap did Ed Sheeran appear?
/news/magazine-39295274
THIS IS A STORY
What's the etiquette on vaping?
/news/uk-39301430
THIS IS A STORY
Tall driver told to 'grow up' by judge
/news/uk-england-tyne-39303893
THIS IS A STORY
The mystery of the murder in the Lucky Holiday Hotel
/news/magazine-39297987
THIS IS A STORY
Britain’s bungled effort to clean up its first big oil spill
/news/uk-england-39223308
THIS IS A STORY
Sweden's Got Talent: The pop star with sleep paralysis
/news/entertainment-arts-39293139
THIS IS A STORY
George Osborne: From history buff to austerity editor
/news/entertainment-arts-39304904
THIS IS A STORY
Reality Check: Is education spending at a record level?
/news/education-39302746
THIS IS A STORY
When will a second referendum take place?
/news/uk-scotland-scotland-politics-39306159
THIS IS A STORY
Defining moment for US and Germany
/news/world-europe-39254553

STEP FOUR: Use the story to get the summary

Same thing again! This time we’re looking for a p.

stories = doc.find_all('div', { 'class': 'gs-c-promo' })
for story in stories:
    print("THIS IS A STORY")
    headline = story.find('h3')
    print(headline.text)
    link = story.find('a')
    print(link['href'])
    summary = story.find('p')
    print(summary.text)
THIS IS A STORY
US 'will not repeat' GCHQ wiretap claims
/news/uk-39300191
Allegations that Donald Trump was wiretapped are "nonsense", the UK's intelligence agency says.
THIS IS A STORY
George Osborne to edit London newspaper
/news/uk-39304944
Former chancellor faces calls to stand down as MP after being named as editor of London Evening Standard.
THIS IS A STORY
'Huge advance' in fighting biggest killer
/news/health-39305640
The innovative new injection cuts cholesterol to lowest levels ever seen in medicine.
THIS IS A STORY
Sturgeon up for referendum date talks
/news/uk-scotland-scotland-politics-39299305
Nicola Sturgeon wants talks with Theresa May about a referendum date that suits both sides.
THIS IS A STORY
No punishment for man who raped girl, 12
/news/uk-scotland-edinburgh-east-fife-39305042
A man who admitted raping a 12-year-old girl in Edinburgh is given an absolute discharge.
THIS IS A STORY
Dunham hits back after weight criticism
/news/entertainment-arts-39303458
Actress Lena Dunham, star and creator of Girls, says her recent weight loss "isn't a triumph".
THIS IS A STORY
US warns N Korea of military option
/news/world-asia-39297031
America's top diplomat says the US could strike if Pyongyang's weapons threat rises.
THIS IS A STORY
US warns N Korea of military option
/news/world-asia-39297031
America's top diplomat says the US could strike if Pyongyang's weapons threat rises.
THIS IS A STORY
Guinness for royals on St Patrick's Day
/news/uk-39308969
The Duke and Duchess of Cambridge drank Guinness at a St Patrick's Day lunch with the Irish Guards.
THIS IS A STORY
Mother jailed for hiding baby's death
/news/uk-england-london-39305951
Victoria Gayle pleaded guilty to preventing the lawful and decent burial of her baby son.
THIS IS A STORY
Man dies charging iPhone in bath
/news/uk-39307418
A coroner and safety experts have since issued warnings about using electrical appliances in the bathroom.
THIS IS A STORY
Hungary puts migrants in containers
/news/world-europe-39301003
Camps for asylum seekers are being built out of shipping containers, despite international criticism.
THIS IS A STORY
Love Actually cast reunites for Comic Relief
/news/entertainment-arts-39301010
Film-maker Richard Curtis has brought the original stars of Love Actually back together in a special sequel for Comic Relief.
THIS IS A STORY
Sizing John wins Cheltenham Gold Cup
/sport/horse-racing/39307278
Irish challenger Sizing John wins the Cheltenham Gold Cup for jockey Robbie Power and trainer Jessica Harrington.
THIS IS A STORY
Miliband's joke steals Osborne's limelight
/news/blogs-trending-39302846



---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-20-e55795264040> in <module>()
      7     print(link['href'])
      8     summary = story.find('p')
----> 9     print(summary.text)


AttributeError: 'NoneType' object has no attribute 'text'

Missing elements

Oh god, an error! If you weren’t paying attention, the error is

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-20-e55795264040> in <module>()
      7     print(link['href'])
      8     summary = story.find('p')
----> 9     print(summary.text)

AttributeError: 'NoneType' object has no attribute 'text'

Since it showed up after we added in the summary part, I’m going to assume this is an issue because not every story has a summary. How do we get around it!!!

Well, just ask if it has a summary. If it does, you can use it. If it doesn’t, ignore it. It’s just a simple if statement.

stories = doc.find_all('div', { 'class': 'gs-c-promo' })
for story in stories:
    print("THIS IS A STORY")
    headline = story.find('h3')
    print(headline.text)
    link = story.find('a')
    print(link['href'])
    summary = story.find('p')
    if summary:
        print(summary.text)
THIS IS A STORY
US 'will not repeat' GCHQ wiretap claims
/news/uk-39300191
Allegations that Donald Trump was wiretapped are "nonsense", the UK's intelligence agency says.
THIS IS A STORY
George Osborne to edit London newspaper
/news/uk-39304944
Former chancellor faces calls to stand down as MP after being named as editor of London Evening Standard.
THIS IS A STORY
'Huge advance' in fighting biggest killer
/news/health-39305640
The innovative new injection cuts cholesterol to lowest levels ever seen in medicine.
THIS IS A STORY
Sturgeon up for referendum date talks
/news/uk-scotland-scotland-politics-39299305
Nicola Sturgeon wants talks with Theresa May about a referendum date that suits both sides.
THIS IS A STORY
No punishment for man who raped girl, 12
/news/uk-scotland-edinburgh-east-fife-39305042
A man who admitted raping a 12-year-old girl in Edinburgh is given an absolute discharge.
THIS IS A STORY
Dunham hits back after weight criticism
/news/entertainment-arts-39303458
Actress Lena Dunham, star and creator of Girls, says her recent weight loss "isn't a triumph".
THIS IS A STORY
US warns N Korea of military option
/news/world-asia-39297031
America's top diplomat says the US could strike if Pyongyang's weapons threat rises.
THIS IS A STORY
US warns N Korea of military option
/news/world-asia-39297031
America's top diplomat says the US could strike if Pyongyang's weapons threat rises.
THIS IS A STORY
Guinness for royals on St Patrick's Day
/news/uk-39308969
The Duke and Duchess of Cambridge drank Guinness at a St Patrick's Day lunch with the Irish Guards.
THIS IS A STORY
Mother jailed for hiding baby's death
/news/uk-england-london-39305951
Victoria Gayle pleaded guilty to preventing the lawful and decent burial of her baby son.
THIS IS A STORY
Man dies charging iPhone in bath
/news/uk-39307418
A coroner and safety experts have since issued warnings about using electrical appliances in the bathroom.
THIS IS A STORY
Hungary puts migrants in containers
/news/world-europe-39301003
Camps for asylum seekers are being built out of shipping containers, despite international criticism.
THIS IS A STORY
Love Actually cast reunites for Comic Relief
/news/entertainment-arts-39301010
Film-maker Richard Curtis has brought the original stars of Love Actually back together in a special sequel for Comic Relief.
THIS IS A STORY
Sizing John wins Cheltenham Gold Cup
/sport/horse-racing/39307278
Irish challenger Sizing John wins the Cheltenham Gold Cup for jockey Robbie Power and trainer Jessica Harrington.
THIS IS A STORY
Miliband's joke steals Osborne's limelight
/news/blogs-trending-39302846
THIS IS A STORY
BBC News Channel
/news/10318089
The latest breaking and developing stories 
THIS IS A STORY
BBC Radio 4 - PM
http://www.bbc.co.uk/iplayer/console/bbc_radio_four
An evening look at the day's events
THIS IS A STORY
Jam? Meet the Michael Jackson traffic cop
/news/world-africa-39290410
THIS IS A STORY
Bake Off line-up: A winning recipe?
/news/entertainment-arts-39301921
THIS IS A STORY
Weekly quiz: In which soap did Ed Sheeran appear?
/news/magazine-39295274
THIS IS A STORY
What's the etiquette on vaping?
/news/uk-39301430
THIS IS A STORY
Tall driver told to 'grow up' by judge
/news/uk-england-tyne-39303893
THIS IS A STORY
The mystery of the murder in the Lucky Holiday Hotel
/news/magazine-39297987
THIS IS A STORY
Britain’s bungled effort to clean up its first big oil spill
/news/uk-england-39223308
THIS IS A STORY
Sweden's Got Talent: The pop star with sleep paralysis
/news/entertainment-arts-39293139
THIS IS A STORY
George Osborne: From history buff to austerity editor
/news/entertainment-arts-39304904
THIS IS A STORY
Reality Check: Is education spending at a record level?
/news/education-39302746
THIS IS A STORY
When will a second referendum take place?
/news/uk-scotland-scotland-politics-39306159
THIS IS A STORY
Defining moment for US and Germany
/news/world-europe-39254553

Turning it into a CSV

Now that we have all of our elements, we can turn it into a CSV. There are three steps to building the CSV:

  1. Start with an empty list: Each story we’ll find, we’ll add it to the list
  2. Build a dictionary for each story element
  3. Convert the list to a DataFrame, and then
  4. Export the DataFrame to a CSV

The dictionary-buiding part can be complicated, so let’s look at two different ways of doing it.

Method One: All at once

For this method, we’ll make our story_dict all at once, then add it to the stories_list.

# Start with an empty list
stories_list = []
stories = doc.find_all('div', { 'class': 'gs-c-promo' })
for story in stories:
    headline = story.find('h3')
    link = story.find('a')
    summary = story.find('p')
    # Does our story have a summary?
    if summary:
        # Build a dict that HAS a summary
        story_dict = {
            'headline': headline.text,
            'url': link['href'],
            'summary': summary.text
        }
    else:
        # Build a dict that does NOT have a summary
        story_dict = {
            'headline': headline.text,
            'url': link['href'],
        }    
    # Add the dict to our list
    stories_list.append(story_dict)

print(stories_list)

# Now that we're done, convert to a CSV and save.
# If you don't use index=False, you'll get an ugly dataframe!
import pandas as pd
df = pd.DataFrame(stories_list)
df.to_csv("../bbc.csv", index=False)
[{'summary': 'Allegations that Donald Trump was wiretapped are "nonsense", the UK\'s intelligence agency says.', 'headline': "US 'will not repeat' GCHQ wiretap claims", 'url': '/news/uk-39300191'}, {'summary': 'Former chancellor faces calls to stand down as MP after being named as editor of London Evening Standard.', 'headline': 'George Osborne to edit London newspaper', 'url': '/news/uk-39304944'}, {'summary': 'The innovative new injection cuts cholesterol to lowest levels ever seen in medicine.', 'headline': "'Huge advance' in fighting biggest killer", 'url': '/news/health-39305640'}, {'summary': 'Nicola Sturgeon wants talks with Theresa May about a referendum date that suits both sides.', 'headline': 'Sturgeon up for referendum date talks', 'url': '/news/uk-scotland-scotland-politics-39299305'}, {'summary': 'A man who admitted raping a 12-year-old girl in Edinburgh is given an absolute discharge.', 'headline': 'No punishment for man who raped girl, 12', 'url': '/news/uk-scotland-edinburgh-east-fife-39305042'}, {'summary': 'Actress Lena Dunham, star and creator of Girls, says her recent weight loss "isn\'t a triumph".', 'headline': 'Dunham hits back after weight criticism', 'url': '/news/entertainment-arts-39303458'}, {'summary': "America's top diplomat says the US could strike if Pyongyang's weapons threat rises.", 'headline': 'US warns N Korea of military option', 'url': '/news/world-asia-39297031'}, {'summary': "America's top diplomat says the US could strike if Pyongyang's weapons threat rises.", 'headline': 'US warns N Korea of military option', 'url': '/news/world-asia-39297031'}, {'summary': "The Duke and Duchess of Cambridge drank Guinness at a St Patrick's Day lunch with the Irish Guards.", 'headline': "Guinness for royals on St Patrick's Day", 'url': '/news/uk-39308969'}, {'summary': 'Victoria Gayle pleaded guilty to preventing the lawful and decent burial of her baby son.', 'headline': "Mother jailed for hiding baby's death", 'url': '/news/uk-england-london-39305951'}, {'summary': 'A coroner and safety experts have since issued warnings about using electrical appliances in the bathroom.', 'headline': 'Man dies charging iPhone in bath', 'url': '/news/uk-39307418'}, {'summary': 'Camps for asylum seekers are being built out of shipping containers, despite international criticism.', 'headline': 'Hungary puts migrants in containers', 'url': '/news/world-europe-39301003'}, {'summary': 'Film-maker Richard Curtis has brought the original stars of Love Actually back together in a special sequel for Comic Relief.', 'headline': 'Love Actually cast reunites for Comic Relief', 'url': '/news/entertainment-arts-39301010'}, {'summary': 'Irish challenger Sizing John wins the Cheltenham Gold Cup for jockey Robbie Power and trainer Jessica Harrington.', 'headline': 'Sizing John wins Cheltenham Gold Cup', 'url': '/sport/horse-racing/39307278'}, {'headline': "Miliband's joke steals Osborne's limelight", 'url': '/news/blogs-trending-39302846'}, {'summary': 'The latest breaking and developing stories ', 'headline': 'BBC News Channel', 'url': '/news/10318089'}, {'summary': "An evening look at the day's events", 'headline': 'BBC Radio 4 - PM', 'url': 'http://www.bbc.co.uk/iplayer/console/bbc_radio_four'}, {'headline': 'Jam? Meet the Michael Jackson traffic cop', 'url': '/news/world-africa-39290410'}, {'headline': 'Bake Off line-up: A winning recipe?', 'url': '/news/entertainment-arts-39301921'}, {'headline': 'Weekly quiz: In which soap did Ed Sheeran appear?', 'url': '/news/magazine-39295274'}, {'headline': "What's the etiquette on vaping?", 'url': '/news/uk-39301430'}, {'headline': "Tall driver told to 'grow up' by judge", 'url': '/news/uk-england-tyne-39303893'}, {'headline': 'The mystery of the murder in the Lucky Holiday Hotel', 'url': '/news/magazine-39297987'}, {'headline': 'Britain’s bungled effort to clean up its first big oil spill', 'url': '/news/uk-england-39223308'}, {'headline': "Sweden's Got Talent: The pop star with sleep paralysis", 'url': '/news/entertainment-arts-39293139'}, {'headline': 'George Osborne: From history buff to austerity editor', 'url': '/news/entertainment-arts-39304904'}, {'headline': 'Reality Check: Is education spending at a record level?', 'url': '/news/education-39302746'}, {'headline': 'When will a second referendum take place?', 'url': '/news/uk-scotland-scotland-politics-39306159'}, {'headline': 'Defining moment for US and Germany', 'url': '/news/world-europe-39254553'}]

Method Two: Filling in the blanks

For this method, we’ll make our story_dict in the beginning, then fill in any pieces that exist.

# Start with an empty list
stories_list = []
stories = doc.find_all('div', { 'class': 'gs-c-promo' })
for story in stories:
    # Create a dictionary without anything in it
    story_dict = {}
    headline = story.find('h3')
    if headline:
        story_dict['headline'] = headline.text
    link = story.find('a')
    if link:
        story_dict['url'] = link['href']
    summary = story.find('p')
    if summary:
        story_dict['summary'] = summary.text
    # Add the dict to our list
    stories_list.append(story_dict)
    
# Now that we're done, convert to a CSV and save
# If you don't use index=False, you'll get an ugly dataframe!
import pandas as pd
df = pd.DataFrame(stories_list)
df.to_csv("../bbc.csv", index=False)