Web scraping in Python

Web scraping can be defined as the process of extracting data from the internet automatically. The data gathered from the internet can be used for various purposes such as analysis, research, machine learning project or to populate databases. 

Before scraping a website, you have to understand the website structure, which is necessary to gather relevant information. 

In this course, we will learn about the website structure required to extract essential information. We will also learn about the Python requests module and the BeautifulSoup library. We will also see how we can use the Python requests module and BeautifulSoup library to extract relevant information from a website and store it in text format. 

The website we will be scraping is a movie site

Inspecting the site.

A lot of information is contained in the URL of the website. The base URL of the site is timeout.com. The path of the site gives us a good understanding of the website we are going to scrape. The other path of the URL tells us we are navigating into the film section, then into the best-movies-of-all-time section.

Now, let's inspect the website using the developer tools built directly into the Google Chrome browser. The developer tools give us information on the HTML page of the website. 

In the Chrome browser, you can navigate to the dev tools as follows:

More tools > Developer tools > Element.

The element section is where you will see the information on the website.

Now, let's use the requests module and BeautifulSoup library to retrieve the titles and years of the best movies of all time.

Request Module.

The requests library is used for making HTTP requests to a specific URL and returns the response. Requests allow one to send HTTP requests extremely easily. The Python requests module uses HTTP verbs like GET, POST, PUT, or PATCH to request a specified URL.

To, install the requests module, we use the following line of code on the command terminal.

pip install requests

BeautifulSoup Library:

The BeautifulSoup is a Python library that is used for extracting information from HTML and XML files. To, install the BeautifulSoup library, we can run the following line of code on the command line.

pip install beautifulsoup4

The URL of the site we want to scrape is https://www.timeout.com/film/best-movies-of-all-time. We will scrape the title and year of the best 100 all-time movies. 

from bs4 import BeautifulSoup
import requests

response = requests.get("https://www.timeout.com/film/best-movies-of-all-time")
web_page = response.text

soup = BeautifulSoup(web_page, "html.parser")
all_100_movies = soup.find_all("h3", class_="_h3_cuogz_1")
movies_title = [movie.getText().replace('\xa0', '') for movie in all_100_movies]
del movies_title[-1]
with open("movies_title.txt", mode='w') as file:
for i in movies_title:
file.write(f"{i}\n")

The first and the second line of code is used to import the requests module and the beautifulSoup.

The requests module is used to get the URL and store the information in a text format in the variable, web_page.

Then the object BeautifulSoup is used to parse the HTML document. 

We have to find the HTML tag used to house the titles of the 100 movies. The tag used is h3 and the specific class is "_h3_cuogz_1". 

Finally, we get the titles of the movies and saved them in a text file.

The link to the code is on GitHub. 

Conclusion:

Web scraping is an act of extracting data from a web application. The data extracted can be used for various purposes ranging from analysis, machine learning, and research. The basic tools most developers use for web scraping are the Python request module and the beautifulSoup library.