We’re continuing the web scraping tutorial series, and this is another blog on how to scrape data from the IMDB website. The data and the python notebook file are given at the bottom of the page for download.
The Internet Movie Database (IMDB) is one of the most popular websites in the world. It has a vast database of movies, actors, and directors, as well as trivia and reviews on each movie.
Because it's so popular, many people use it to find information about their favorite actors and movies. However, IMDB's website is not easy to navigate and doesn't always provide the most relevant information on a particular actor or movie.
If you want to build a movie recommendation engine that recommends movies according to your taste, you'll need data sets of different movies of different genres. Scraping IMDB makes it possible to extract all of this data in an automated fashion, which can then be analyzed by computer programs. This allows you to get precisely what you want without spending hours looking through pages of unorganized information.
This blog has a few key differences from the previous tutorial. They are listed below.
We will be scraping data from the IMDB website.
The data we scrape will be stored in json format.
We will use Xpaths instead of CSS selectors to locate the elements on the HTML page.
Goal:
We will extract the top 250 movies from IMDb using Python beautifulsoup, lxml, and a few other libraries.
The Source:
We will extract the data from the IMDB top 250 list. We chose this example for a couple of reasons.
The first reason is that - it is a simple website, and scraping the data will be straightforward. Even people learning the python programming language should be able to build a web scraper to scrape data from IMDB.
The second reason is to introduce them to JSON, a format that many people use. Most tutorials focus on data extraction into CSV/EXCEL, and we wanted to give JSON a try.
The Attributes:
We will be extracting the following data attributes from the individual pages.
The movie URL - The URL gets us to the target page
Rank - The rank of the movie in the top list
The movie name - The unique name of the movie
Movie Year - The year movie is released.
Genre - The Genre of the movie could be a single genre or a list of genres.
Director Name - The name of the director
Rating - The IMDB rating of the movie
Actors List - The list of the cast of movie.
Importing required Libraries
Here is the python code to import the required libraries first. We imported the requests library, Beautifulsoup Etree module from the lxml, the time library, random library, json library, unidecode library. If any of the libraries are not installed, install them first.
We will explain the use of each library in due course.
import requests
from bs4 import BeautifulSoup
from lxml import etree as et
import time
import random
import json
from unidecode import unidecode
Any web scraper needs to know where to start the scraping process. We usually refer to it as the start URL. We also need to create a user agent and an empty list to store the movie URLs.
Requests
Request library is a Python module that allows you to send HTTP requests and process the responses. It’s used by many applications, which makes creating web applications with Python straightforward.
The Python request library makes sending HTTP requests easy and receiving the responses. You can use this library to make simple calls, such as retrieving information from a website or sending data to a server over HTTP.
start_url = "https://www.imdb.com/chart/top" #request
header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"}
movie_urls = []
The IMDb page looks like this. The first step is to find the URLs to each of the 250 movies listed on the page.
We will use chrome developer tools to find the link on the page. Instead of a CSS selector, we will use xpaths to locate the element. Open the developer tools and hover over the first movie name.
You can see the data is enclosed between the ‘a’ tag. This ‘a’ tag is enclosed in a table with 250 other links. We can use the following XPath to get the information out of it.
//td[@class="titleColumn"]/a/@href'.
It says we need to go to the table and find class named “titleColumn.” Within that titleColumn, there is an ‘a’ tag. Find the link in the ‘a’ tag using @href means. Content attribute of a BeautifulSoup object is a list with all its child elements.
response = requests.get(start_url, headers=header) #explain beautfiful soup and etree
soup = BeautifulSoup(response.content, 'html.parser')
dom = et.HTML(str(soup))
movie_urls_list = dom.xpath('//td[@class="titleColumn"]/a/@href')
We've also used the Beautiful Soup and etree libraries here. BeautifulSoup is a Python library that makes it easy to parse HTML and XML documents. It's useful for everything from quick, simple tasks to complex data mining and analysis. Whereas, Etree is a Python library for parsing and generating XML data. It's an alternative to the standard ElementTree package, which allows you to easily parse, generate, validate, and otherwise manipulate XML data.
First, we get a response using the requests library. The next step is to make a beautifulsoup object using the response and the HTML parser. The etree or element tree converts the page into an XML tree structure. The XML tree structure makes programmatic navigation simple. Using the above code, we create a list of movie URLs.
Upon inspection, we can find that the data in the movies_urls_list is not in the format we need. It does not have the IMDb domain name, and the URL is too long.
We concatenated IMDB URL into the URL string we obtained. However, upon further inspection, we can see that even if we remove all items after the question mark (“?”) - it is still a valid link going to the same page. We need to add only this to the movie_urls list. The code below achieves this. Experiment with the data, and you’ll see.
for i in movie_urls_list:
long_url = "https://www.imdb.com" + i
short_url = long_url.split("?")[0]
movie_urls.append(short_url)
How to add a time delay between requests using Python
It is always a good idea to give time delays between the successive requests. This is to ensure that we're not burdening the target website server. If somebody does it aggressively, it can violate a law called trespass to chattel.
We will use a simple time delay function to activate this. The function gets a random number between 2 and five, giving that many seconds delay when the subsequent request is delivered. See the function below.
def time_delay():
time.sleep(random.randint(2, 5))
How to write Scraped data to a JSON file
We need to use the JSON library to write the scraped data into a JSON file. We will write a function to write the elements to the json file and invoke the function every time a new movie data is scraped.
with open("data_v1.json", "w") as f:
json.dump([], f)
def write_to_json(new_data, filename='data_v1.json'):
with open(filename, 'r+') as file:
file_data = json.load(file)
file_data.append(new_data)
file.seek(0)
json.dump(file_data, file, indent=4)
We use the json dump function to write the scraped data and a dictionary to the file we created.
The extraction from the page URLs in the list movie_urls
In this step, we iterate through the movie_urls list one by one. Every time the URL is picked up by the loop, we use Xpath to find the attributes listed above. Once the attributes are extracted - this is made into a dictionary format and written into the JSON file using the write function. We use a time delay function to give gaps between the successive requests.
We used a library unidecode. The function unidecode() takes Unicode data and tries to represent it in ASCII characters. The best way to understand this is to not use it and inspect the data - you'll see some strange letters in between text. Using unidecode eliminates that problem.
for movie_url in movie_urls:
response = requests.get(movie_url, headers=header)
soup = BeautifulSoup(response.content, 'html.parser')
dom = et.HTML(str(soup))
rank = movie_urls.index(movie_url) + 1
movie_name = dom.xpath('//h1[@data-testid="hero-title-block__title"]/text()')[0]
movie_year = dom.xpath('//a[@class="ipc-link ipc-link--baseAlt ipc-link--inherit-color sc-8c396aa2-1 WIUyh"]/text()')[0]
genre = dom.xpath('//span[@class="ipc-chip__text"]/text()')
director_name = dom.xpath('//a[@class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link"]/text()')[0]
rating = dom.xpath('//span[@class="sc-7ab21ed2-1 jGRxWM"]/text()')[0]
actors_list = dom.xpath('//a[@data-testid="title-cast-item__actor"]/text()')
actors_list = [unidecode(i) for i in actors_list]
write_to_json({'rank': rank,
'movie_name': movie_name,
'movie_url': movie_url,
'movie_year': movie_year,
'genre': genre,
'director_name': unidecode(director_name),
'rating': rating,
'actors': actors_list})
time_delay()
print("{}% data is written to json file".format(round((rank * 100) / len(movie_urls))),2)
Download the Python notebook with the source code here:
Download the data from here: Download IMDB data
Wrapping up
In this tutorial, we have learned how to scrape IMDB data using Python. We have looked at different tools and libraries that can be used. We have also seen how to use BeautifulSoup and lxml to scrape the data. This can be very useful if you want to get information on movies or TV shows from IMDB.
If you're trying to extract data at scale - the beautifulsoup - lxml combination just won't cut it. Handling the challenges of scale requires more tools and skill sets. We recommend using the open-source libraries for low-volume one-off low-volume web scraping. For large volume data extraction requirements - you need the expertise of people at Datahut.
We're scraping millions of pages every day at Datahut. Contact us today to discuss your web scraping needs. We can get you the data without any coding or hassles. Click the chatbot link on the right side to start chatting with us.