Three ways to scrape websites with ChatGPT, BeautifulSoup and custom LangChain tools

An introduction to LangChain tools and several approaches to parsing information scraped with BeautifulSoup
Author

Jonathan Soma

Published

April 8, 2023

Hi, I’m Soma! You can find me on email at jonathan.soma@gmail.com, on Twitter at @dangerscarf, or maybe even on this newsletter I’ve never sent.

Introducing our old friend BeautifulSoup to our new best pal ChatGPT

This tutorial is two-in-one: how to build custom LangChain tools powered by large language models, along with how to combine a tiny bit of Python scraping with GPT-4’s processing power!

Using everyone’s favorite library LangChain and the classic Python scraping library BeautifulSoup, we’ll look at three use cases:

  1. Extracting one single part of a page to feed ChatGPT information
  2. Converting a section (or sections) of a page into GPT-parseable data without doing much prep work
  3. Saving effort and money by pre-processing pages we’re sending to the LLM for analysis

Along the way you’ll learn how to build custom LangChain tools, include writing proper descriptions for them and returning the “right” kind of data when they’re done doing their work! By the end we’ll have a fully-functioning scraper that can answer natural-language questions about songs featured in TV shows.

Setup

We’ll start by using python-dotenv to set up our API keys to access ChatGPT, along with a handful of LangChain- and scraping-related imports.

%load_ext dotenv
%dotenv
from langchain.agents import initialize_agent, Tool
from langchain.tools import BaseTool
from langchain.chat_models import ChatOpenAI
from langchain.agents import tool
import requests
from bs4 import BeautifulSoup
import json

Now we’ll create our connection to ChatGPT – specifically, GPT-4.

Closer to the end of the tutorial we do some Python/HTML cross-pollination that results in some strange looking content, and I’ve found GPT-4 is a lot better at understanding this HTML than the much-cheaper GPT-3.5-turbo. Even though I hate spending the money, we need that extra performance!

llm = ChatOpenAI(model='gpt-4', temperature=0)

Now let’s start scraping!

Method one: Single-element extraction

Sometimes when you’re scraping it isn’t too hard: the URL is simple, and you’re just trying to grab one thing off of the page. Our first situation is like that: we’ll start on the Tunefind search results page.

The URL is a simple fill-in-the-blanks with https://www.tunefind.com/search/site?q=SHOW_NAME.

Search results page for Grey’s Anatomy

The results page provides a list of shows. We pull out the first match – where class is tf-promo-tile – and use that element to extract the URL to the show as well as the show name.

If we did this manually, it might look like the code below.

# Build the URL
query = "grey's anatomy"
url = f"https://www.tunefind.com/search/site?q={query}"

# Make the request
headers = { 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0' }
response = requests.get(url, headers=headers)

# Extract the link
soup = BeautifulSoup(response.text)
link = soup.select_one(".tf-promo-tile")

# Save the URL and name
url = f"https://www.tunefind.com{link['href']}"
name = link['title']

That’s nice and fine, but we want to turn this into a LangChain tool that can be used to interact with the outside world. This requires three changes:

  • We change this code into a function
  • We write a description to the LLM can understand how it works
  • We add a return statement to send back the final information

A simple version might look something like the code below. You provide a query to the function and get back a sentence with the name and URL.

@tool
def tunefind_search(query: str) -> str:
    """Searches Tunefind for a given TV show. Required to find the base URL for
    a given show so you can find its seasons, episodes or songs.
    The input to this tool should be the show we're searching for."""

    url = f"https://www.tunefind.com/search/site?q={query}"
    headers = { 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0' }
    response = requests.get(url, headers=headers)

    soup = BeautifulSoup(response.text)

    link = soup.select_one(".tf-promo-tile")
    url = f"https://www.tunefind.com{link['href']}"
    name = link['title']

    return f"{name} can be found at {url}"

The function gets an @tool added right before it. This is a shortcut to enabling it to be used with LangChain! Otherwise you have to jump through a lot of hoops to create them.

The description is the text right below the function name, also known as a docstring. It describes what the tool can do (find a TV show) and what it needs to send it to get good results (a TV show name). LangChain sends this description to ChatGPT and says “hey, if you ever need me to run this code to get some data for you, just let me know.”

At the end of the function, we returns a sentence describing what was found on the search results page, including both the show name and the URL. In a “normal” Python situation you might return a dictionary with name and url keys, but this is a little different. LangChain tools need to provide something that makes sense to ChatGPT, and writing a sentence is a perfectly valid approach (although we’ll cover dictionaries and JSON later).

To allow LangChain – and us! – to use this new search tool, we’ll create an agent that has access to it. When there’s a question that might be answered by our new tool (based on the tool description), the agent will run off and try to use it.

# Create a tool-using agent
tools = [tunefind_search]
agent = initialize_agent(tools, llm, agent="chat-zero-shot-react-description", verbose=True)

Using this setup is as simple as agent.run with our question. Between the knowledge that the GPT already has and the ability to use the tool, it will try to answer our question!

# Get the results
result = agent.run("What's the Tunefind URL for Grey's Anatomy?")

print(result)


> Entering new AgentExecutor chain...
Thought: I need to find the base Tunefind URL for Grey's Anatomy.
Action:
```
{
  "action": "tunefind_search",
  "action_input": "Grey's Anatomy"
}
```
Observation: Grey's Anatomy can be found at https://www.tunefind.com/show/greys-anatomy
Thought:I now know the final answer
Final Answer: The Tunefind URL for Grey's Anatomy is https://www.tunefind.com/show/greys-anatomy

> Finished chain.
The Tunefind URL for Grey's Anatomy is https://www.tunefind.com/show/greys-anatomy

Perfect! When given access to the tool GPT now sees the sentence "Grey's Anatomy can be found at https://www.tunefind.com/show/greys-anatomy" which allows it to determine the show’s URL.

Method two: Searching parts of the page

Sometimes the page you want to scrape is a little more complicated. You don’t just want a tiny piece of content of the page, but rather a specific portion of the page or several separate elements.

Now that we know how to find the Grey’s Anatomy page, let’s scrape all of the episodes from Grey’s Anatomy Season 1. If we visit the Season 1 page for Grey’s Anatomy Season 1 we’re presented with a ton of links, one for each episode:

Season 1 page for Grey’s Anatomy

If we want GPT to have access a list of the episodes and their URLs, we have a few options. Let’s work through them one by one!

Write a full BeautifulSoup scraper, return a list of dictionaries

We could absolutely use traditional scraping to extract the information on the page. We’d search for the results, loop through them one by one, and build a list of dictionaries with all of the necessary data.

Unfortunately, we’re too lazy for that! We want GPT to do the work for us. We aren’t doing this.

Send the whole page to GPT and let it figure things out

One common approach when scraping with ChatGPT is sending all of the content on the page to ChatGPT, and asking it to find the information you’re interested in. Sending the whole page to GPT along with the question of “find the links” has two downsides:

First, the page might be too long for GPT to handle. When you send data to GPT along with a question, it can only handle so much text! Web pages often have too much content for GPT to be able to process them all at once.

Second, all of that unnecessary information drives up our OpenAI bill! We’re being charged on how much text we send and receive, so if we can pare things down we can save a good amount of money.

We aren’t doing this, either.

Near the end of my documenting undocumented APIs writeup I do a cost comparison of two approaches. Doing a little extra work can save you a ton of money when working with GPT!

Carve out the portions of the page we’re interested in

My preferred approach is a third option that combines the of the first two: we use BeautifulSoup to grab only the sections of the page that we’re interested in, but we don’t process them. We just turn them into strings of HTML and send them on over to GPT along with our question.

By taking this approach, we save time and avoid the more confusing parts of scraping: no loops, nothing nested, no try/pass statements. Just “hey, take this data!” Before it goes into a LangChain tool, the code might look like the code below.

# Prepare the page to be used with BeautifulSoup
soup = BeautifulSoup(response.text)

# Find all of the elements that contain 'EpisodeListItem_title' in their class
elements = soup.select("[class*='EpisodeListItem_title']")

# Convert them into strings
str(elements)
'[<h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-1/238">S1 · E1 · A Hard Day\'s Night</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-1/239">S1 · E2 · The First Cut is the Deepest</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-1/240">S1 · E3 · Winning a Battle, Losing the War</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-1/255">S1 · E4 · No Man\'s Land</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-1/256">S1 · E5 · Shake Your Groove Thing</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-1/280">S1 · E6 · If Tomorrow Never Comes</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-1/283">S1 · E7 · The Self-Destruct Button</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-1/378">S1 · E8 · Save Me</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-1/381">S1 · E9 · Who\'s Zoomin\' Who?</a></h3>]'

Check out the last line: when you use str(...) with a BeautifulSoup object it returns the HTML representation of the objects. But in this case we didn’t have one object, we had a list of them. The result is a horrifying mishmash of HTML code for the page elements – <h3 class="..."><a href="..."> – and Python code for the list – [A, B, C].

It doesn’t matter how we feel about it, though: ChatGPT needs a string, and it’s able to understand this!

Remember how I said we’re using GPT-4 because GPT-3.5-turbo can’t always understand what we’ve scraped? This is where it happens! If we spent a little more time cleaning up the data and formatting it nicely, maybe we could convert it into something like below:

* Season 1, Episode 1: "A Hard Day's Night" - /show/greys-anatomy/season-1/238
* Season 1, Episode 2: "The First Cut is the Deepest" - /show/greys-anatomy/season-1/239
* Season 1, Episode 3: "Winning a Battle, Losing the War" - /show/greys-anatomy/season-1/240
...

I’m sure this would enable us to use the less-expensive GPT-3.5-turbo, but we’re trying to be lazy. str(list_of_things) is about as little pre-processing as you can get, and we’re sticking with it.

Now let’s turn this code into a LangChain tool. It’s the same process as last time, adding @tool, making it a function, adding a description. Here’s how we’ll describe it:

Queries Tunefind for the episodes from a season, given the URL for that season.

The input to this tool should be a URL to that season.

Season URLs are formed by takng the show’s Tunefind URL and adding /season-NUM after it.

For example, if the show A Million Little Things is at https://www.tunefind.com/show/a-million-little-things/

Season 3 of A Million Little Things could be found at https://www.tunefind.com/show/a-million-little-things/season-3

An important thing to note is that unlike last time where we just wanted the show name, this time we’re asking to be provided the complete URL to the season page. Our last tool found the show at https://www.tunefind.com/show/greys-anatomy, and we’re hoping ChatGPT is smart enough to add /season-1 if it knows we’re looking for season 1 (don’t worry, it is!).

@tool
def get_shows_from_season(url: str) -> str:
    """Queries Tunefind for the episode list from a season, given the URL for that season.
    The input to this tool should be a URL to that season.

    Season URLs are formed by takng the show's Tunefind URL and adding /season-NUM after it.
    For example, if a show's URL is https://www.tunefind.com/show/a-million-little-things/
    you can find episode links for season 3 at https://www.tunefind.com/show/a-million-little-things/season-3
    """
    
    headers = { 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0' }
    response = requests.get(url, headers=headers)

    soup = BeautifulSoup(response.text)
    elements = soup.select("[class*='EpisodeListItem_title']")

    return str(elements)

Now that we’ve converted it to a tool, let’s use it.

It’s the same process as before, but this time our agent has access to two tools:

  • the tool that can find the show page, and
  • the tool that can pull the episode lists from the episode page.
# Create a tool-using agent
tools = [tunefind_search, get_shows_from_season]
agent = initialize_agent(tools, llm, agent="chat-zero-shot-react-description", verbose=True)
result = agent.run("What is the name and URL for episode 8 season 3 for Grey's Anatomy?")

print(result)


> Entering new AgentExecutor chain...
Thought: First, I need to find the base URL for Grey's Anatomy on Tunefind.
Action:
```
{
  "action": "tunefind_search",
  "action_input": "Grey's Anatomy"
}
```
Observation: Grey's Anatomy can be found at https://www.tunefind.com/show/greys-anatomy
Thought:Now that I have the base URL for Grey's Anatomy, I can find the episodes for season 3.
Action:
```
{
  "action": "get_shows_from_season",
  "action_input": "https://www.tunefind.com/show/greys-anatomy/season-3"
}
```
Observation: [<h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2019">S3 · E1 · Time Has Come Today</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2046">S3 · E2 · I Am a Tree</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2047">S3 · E3 · Sometimes a Fantasy</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2048">S3 · E4 · What I Am</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2049">S3 · E5 · Oh, The Guilt</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2050">S3 · E6 · Let The Angels Commit</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2051">S3 · E7 · Where the Boys Are</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2119">S3 · E8 · Staring at the Sun</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2120">S3 · E9 · From a Whisper to a Scream</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2190">S3 · E10 · Don't Stand So Close to Me</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2191">S3 · E11 · Six Days (Part 1)</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2252">S3 · E12 · Six Days (Part 2)</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2216">S3 · E13 · Great Expectations</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2306">S3 · E14 · Wishin' and Hopin'</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2253">S3 · E15 · Walk on Water</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2307">S3 · E16 · Drowning on Dry Land</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2369">S3 · E17 · Some Kind of Miracle</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2370">S3 · E18 · Scars and Souvenirs</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2371">S3 · E19 · My Favorite Mistake</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2476">S3 · E20 · Time After Time</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2477">S3 · E21 · Desire</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2554">S3 · E22 · The Other Side of This Life</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2555">S3 · E23 · Testing 1-2-3</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-3/2556">S3 · E24 · Didn't We Almost Have It All</a></h3>]
Thought:I now know the name and URL for episode 8 of season 3 for Grey's Anatomy.
Final Answer: The name of episode 8 for season 3 of Grey's Anatomy is "Staring at the Sun" and the URL is https://www.tunefind.com/show/greys-anatomy/season-3/2119

> Finished chain.
The name of episode 8 for season 3 of Grey's Anatomy is "Staring at the Sun" and the URL is https://www.tunefind.com/show/greys-anatomy/season-3/2119

Success again! Note the separate steps in the Observation/Thought/Action chain:

  1. First LangChain searches for the show using the tunefind_search tool, acquiring the show URL.
  2. Then it adds /season-3 to the show URL so it can use our new get_shows_from_season tool. This gives it all of the (ugly) information about the episodes.
  3. This information is parsed by ChatGPT along with our question to get the answer!

Note that along with understanding that giant mishmash of HTML, GPT also turned /show/greys-anatomy/season-3/2119 into https://www.tunefind.com/show/greys-anatomy/season-3/2119. So polite!

Method three: Convert your HTML

Let’s say things are even more complicated. Here’s a look at the page songs featured in Season 1, Episode 3 of Grey’s Anatomy:

Songs from season 1 episode 3

There’s just so much there! Long classes, a million tags, just a ton of unnecessary information. Unlike last time, we’re going to take a few more steps to clean it up.

Here’s the thing: I don’t want to send all that to ChatGPT. While it’s easy to be lazy and have ChatGPT do your data cleaning, relying on ChatGPT to parse unnecessary data incurs direct financial consequences. If we can reduce the amount of text we send to GPT we get charged less, and that is sure to make our wallet very very happy!

In this case we’re going to write a whole scraper for the page! When given a URL, we’ll give back a list of nicely-formatted artists and song titles.

# Get the page
url = "https://www.tunefind.com/show/greys-anatomy/season-1/240"
headers = { 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0' }
response = requests.get(url, headers=headers)

# Scrape the content
soup = BeautifulSoup(response.text)
titles = soup.select("[class^='SongTitle']")
artists = soup.select("[class^='ArtistSub']")
results = [{'artist': a.text, 'title': t.text} for t, a in (zip(titles, artists))]

# Turn into a JSON string
json.dumps(results)
'[{"artist": "The Ditty Bops", "title": "There\'s a Girl"}, {"artist": "Tegan and Sara", "title": "There\'s a Girl"}, {"artist": "The Ditty Bops", "title": "I Won\'t Be Left"}, {"artist": "stuart reid", "title": "I Won\'t Be Left"}, {"artist": "Reindeer Section", "title": "Wishful Thinking"}, {"artist": "Lisa Loeb", "title": "Wishful Thinking"}, {"artist": "Psapp", "title": "Hear You Breathing"}, {"artist": "Rilo Kiley", "title": "Hear You Breathing"}, {"artist": "Keane", "title": "You Are My Joy"}, {"artist": "Interpol", "title": "You Are My Joy"}, {"artist": "Psapp", "title": "Fools Like Me"}, {"artist": "Tegan and Sara", "title": "Fools Like Me"}]'

A lot nicer than what we saw in the last section, right? While both work fine with GPT-4, this one took a little more effort but costs noticeably less.

One important thing to note is that we’re using json.dumps , which converts the Python object – a list of dictionaries that include artists and titles – into the string representation. We need to do this because every LangChain tool must return a string: that’s what language models understand, so that’s what we send. It looks exactly the same, but without the json.dumps it just won’t work.

Why json.dumps(results) instead of just str(results)? The more complicated your data gets, the more likely json.dumps will be a better choice as compared to str. They do basically the same thing, but relying on the official JSON format is just a liiiiittle bit fancier. Feel free to try both!

Now let’s convert our code into a LangChain tool.

@tool
def get_songs_from_episode(url: str) -> str:
    """Queries Tunefind for the songs for the specific episode of a show.
    The input to this tool should be the URL to an episode.
    The URL will look like https://www.tunefind.com/show/greys-anatomy/season-6/4120
    You must visit the season page to obtain an episode URL
    """

    headers = { 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0' }
    response = requests.get(url, headers=headers)

    soup = BeautifulSoup(response.text)

    titles = soup.select("[class^='SongTitle']")
    artists = soup.select("[class^='ArtistSub']")
    results = [{'artist': a.text, 'title': t.text} for t, a in (zip(titles, artists))]

    return json.dumps(results)

Notice the grumpy demand in our tool description: You must visit the season page to obtain an episode URL. Just like ChatGPT likes to hallucinate journalism stories and academic papers, it loves to think it knows an episode URL just by guessing a number after the season. That sentence is enough to keep it in line.

Finally, we’ll string all three tools together. The show search, the episode lister, and the song lister.

# Buld the agent using all three tools
tools = [tunefind_search, get_shows_from_season, get_songs_from_episode]
agent = initialize_agent(tools, llm, agent="chat-zero-shot-react-description", verbose=True)

And now, the moment of truth!

result = agent.run("What was the song by Lisa Loeb on Grey's Anatomy season 1 episode 3?")
print(result)


> Entering new AgentExecutor chain...
Thought: I need to find the base URL for Grey's Anatomy on Tunefind.
Action:
```
{
  "action": "tunefind_search",
  "action_input": "Grey's Anatomy"
}
```
Observation: Grey's Anatomy can be found at https://www.tunefind.com/show/greys-anatomy
Thought:Now I need to find the episodes for season 1.
Action:
```
{
  "action": "get_shows_from_season",
  "action_input": "https://www.tunefind.com/show/greys-anatomy/season-1"
}
```
Observation: [<h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-1/238">S1 · E1 · A Hard Day's Night</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-1/239">S1 · E2 · The First Cut is the Deepest</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-1/240">S1 · E3 · Winning a Battle, Losing the War</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-1/255">S1 · E4 · No Man's Land</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-1/256">S1 · E5 · Shake Your Groove Thing</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-1/280">S1 · E6 · If Tomorrow Never Comes</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-1/283">S1 · E7 · The Self-Destruct Button</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-1/378">S1 · E8 · Save Me</a></h3>, <h3 class="EpisodeListItem_title__PkSzj"><a href="/show/greys-anatomy/season-1/381">S1 · E9 · Who's Zoomin' Who?</a></h3>]
Thought:I found the episode URL for season 1 episode 3, now I need to get the songs from that episode.
Action:
```
{
  "action": "get_songs_from_episode",
  "action_input": "https://www.tunefind.com/show/greys-anatomy/season-1/240"
}
```

Observation: [{"artist": "The Ditty Bops", "title": "There's a Girl"}, {"artist": "Tegan and Sara", "title": "There's a Girl"}, {"artist": "The Ditty Bops", "title": "I Won't Be Left"}, {"artist": "stuart reid", "title": "I Won't Be Left"}, {"artist": "Reindeer Section", "title": "Wishful Thinking"}, {"artist": "Lisa Loeb", "title": "Wishful Thinking"}, {"artist": "Psapp", "title": "Hear You Breathing"}, {"artist": "Rilo Kiley", "title": "Hear You Breathing"}, {"artist": "Keane", "title": "You Are My Joy"}, {"artist": "Interpol", "title": "You Are My Joy"}, {"artist": "Psapp", "title": "Fools Like Me"}, {"artist": "Tegan and Sara", "title": "Fools Like Me"}]
Thought:I found the song by Lisa Loeb in Grey's Anatomy season 1 episode 3.
Final Answer: The song by Lisa Loeb in Grey's Anatomy season 1 episode 3 is "Wishful Thinking".

> Finished chain.
The song by Lisa Loeb in Grey's Anatomy season 1 episode 3 is "Wishful Thinking".

Perfect!

Final thoughts

LangChain tools are great! While there are certainly other approaches to what we did above – external requests plugins, for example – combinng a small amount of manual scraping while letting GPT handle the details is a good combination of convenient and cost-effectve. Instead of sending all of the HTML (too big, too expensive) or just sending the text (too unpredictable, also potentially too large), carving out the bits you’re actually interested in can do a lot for a tiny project.

If you’re interested in learning more about comparing costs between different approaches to LangChain and ChatGPT, check out the final section of my documenting undocumented APIs writeup.

If you have data with a little less structure than HTML, you might think about combine this with something like kor or guardrails. They both enable you to build schemas for the returned data, so instead of getting a response in the format ChatGPT feels like providing at that particular moment you’ll be able to rely on a consistent data format.