Getting started
First, you need to install the right tools.
pip install beautifulsoup4
pip install requests
pip install lxml
These are the ones we will use for the scraping. Create a new python file and import them at the top of your file.
from bs4 import BeautifulSoup
import requests
Fetch with Requests
The Requests
library will be used to fetch the pages. To make a GET request, you simply use the GET method.
result = requests.get("www.google.com")
You can get a lot of information from the request.
# Get the status code
result.status_code
# Get the headers
result.headers
To be able to scrape your page, you need to use the Beautiful Soup
library. You need to save the response content to turn it into a soup object.
# Save the content
content = result.content
# Create soup
soup = BeautifulSoup(content, features="lxml")
You can see the HTML in a readable format with the prettify
method.
print(soup.prettify())
Scrape with Beautiful Soup
Now to the actual scraping. Getting the data from the HTML code.
Using CSS Selector
The easiest way is probably to use the CSS selector, which can be copied within Chrome.
Here, I have selected the first Google result. Inspected the HTML. Right clicked the element, selected copy and choose the Copy selector
alternative.
samples = soup.select("div > div:nth-child(4) > div:nth-child(4)")
The select element will, however, return an array. If you only want one object, you can use the select_one
method instead.
Using Tags
You can also scrape by tags (a
, h1
, p
, div
) with the following syntax.
# All a elements
samples = soup.select("a")
# Chain tags in the following order: (html -> head -> a)
samples = soup.select("html head a")
# Chain tags in the exact following order: (html -> head -> a)
samples = soup.select("html > head > a")
It is also possible to use the id
or class
attribute to scrape the HTML.
sample_id = soup.select("#id")
sample_class = soup.select(".class")
Using find_all
Another method you can use is find_all
. It will basically return all elements that match.
# Return all elements with an a tag
samples = soup.find_all("a")
# Return all elements with a specific ID
samples = soup.find_all(id="specific_id")
# Return all elements with a "a" tag with a specific CSS class
samples = soup.find_all("a", "specific_css_class")
# Same as above, more specific
samples = soup.find_all("a", class_="specific_css_class")
# Search for any attribute within an a tag
samples = soup.find_all("a", attrs={"class": "specific_css_class"})
You can also use the find
method, which will return a single element instead of an array.
Get the values
The most important part of scarping is getting the actual values (or text) from the element.
<h3 class="LC20lb DKV0Md" href="https://someurl.com/">Beautiful Soup</h3>
Get the inner text (the actual text printed on the page) with this method.
sample = element.get_text()
If you want to get a specific attribute of an element, like the href
, use this syntax:
sample = element.get("href")