Tutorial - Web scraping

Top 5 Open Source Web Scraping Frameworks and Libraries

Top 5 Open Source Web Scraping Frameworks and Libraries

Web scraping is a process to extract data from websites. The extracted data can then be transformed and analyzed in other formats like XML, CSV, and JSON to perform other tasks as per needs.

In this post, we are going to discuss various open-source web scraping frameworks and libraries available in Python. 

1. Request

Request is an open-source library written by Kenneth Reitz. As the name suggests, the primary objective is to access a web page or a URL and return the response. The response could be HTML, CSV, a file, etc. To install, run the following pip command:

pip install requests

Let’s see a couple of examples:

Fetching HTML

import request

headers = {‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36’}

response = requests.get(‘https://www.reddit.com/r/python’,headers=headers)

html = response.text

 

Fetching JSON

import request

headers = {‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36’}

response = requests.get(‘https://www.reddit.com/r/python.json’,headers=headers)

json_data = response.json

 

The first example returns the markup of the specified URL while the second one returns JSON data in dict format. You can also set headers by passing a named parameter,

2. Beautifulsoup

Beautifulsoup4 or bs4 is an HTML/XML parser library written in Python. The sole purpose of this library is to access different elements in DOM. Usually, developers use requests to access the markup of the page and bs4 for parsing the HTML document. To install Beautifulsoup run the following command:

pip install beautifulsoup4

Once installed, you can import it as:

from bs4 import BeautifulSoup

Let’s write a fundamental example. Continuing the one we cover above, first, we will access a Python Reddit subpage and then will print links of all entries on the page.

import request

from bs4 import BeautifulSoup

headers = {‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36’}

response = requests.get(‘https://www.reddit.com/r/python’,headers=headers)

# HTML of the main page

html = response.text

# Creating a bs4 object and passing the HTML parser as the second parameter.

soup = Beautifulsoup(html,’lxml’)

# fetch all links

links = soup.select(‘.SQnoC3ObvgnGjWt90zD9Z’)

#Iterate and print href

for link in links:

 print(link[‘href’])

Before we go through the code, we need to learn about the DOM we are going to access. Our goal is the fetch all links. If you inspect the element on chrome, you will find something like given below:

You can see that <a> has a class with the name SQnoC3ObvgnGjWt90zD9Z. So all I have to do is to use the select method which takes a standardized CSS selector as input and returns the result. If you have used jQuery, then you should like home in this case. Since links could be multiple, you have to iterate the list to access the single instance of it. To access the attribute you just past it as a key here. In our case, it is an href parameter.

3. PyQuery

Pyquery is another HTML/XML parser library which helps to fetch data by accessing DOM. If you have used the popular Javascript library, jQuery, then you should find the syntax very familiar here.

Install it by running the pip command:

pip install pyquery

We will be using the same example of fetching information from Reddit.

import requests

from pyquery import PyQuery as pq

headers = {

   ‘user-agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36’

}

r = requests.get(‘http://reddit.com/r/python’, headers=headers)

if r.status_code == 200:

   html = r.text

   doc = pq(html)

   links = doc(‘.SQnoC3ObvgnGjWt90zD9Z’)

   for link in links:

       print(link.attrib[‘href’])

 

After fetching the page content, you create the instance of the PyQuery object which has been given an alias as pq. Once done, you can now access any DOM element with the help of CSS selectors. In our case, it was SQnoC3ObvgnGjWt90zD9Z class.

To get an attribute, you call attrib() method. In our case, we are accessing href attribute.

4. LXML

Lxml is another XML/HTML parser library that you can use for scraping HTML documents. Beautifulsoup also has the option to use Lxml as an HTML parser. Let’s work on a similar use case:

import requests

from lxml.cssselect import CSSSelector

from lxml.html import fromstring

headers = {

‘user-agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36’

}

r = requests.get(‘http://reddit.com/r/python’, headers=headers)

if r.status_code == 200:

html = r.text

doc = fromstring(html)

sel = CSSSelector(‘.SQnoC3ObvgnGjWt90zD9Z’)

for a_href in sel(doc):

print(a_href.get(‘href’))

 

Using same request library to access HTML, once the HTML is available, you create an object by calling fromstring with markup as a parameter. You then create a CSSSelector object by passing the anchor class we have to access. Once done, we have to iterate elements to fetch the href parameter.

5. Selenium

Selenium is an automation library that automates browser activities. It simulates different actions which you usually perform while visiting a page. For instance, filling forms, clicking buttons, etc. Since it also provides methods to access a page and it’s DOM, you can easily use for scraping purpose as well.

from selenium import webdriver

driver = None

driver = webdriver.Firefox()

driver.get(‘http://reddit.com/r/python’)

links = driver.find_elements_by_css_selector(‘.SQnoC3ObvgnGjWt90zD9Z’)

for l in links:

print(l.get_attribute(‘href’))

driver.quit()

 

After importing the library, create an instance of the browser you are willing to automate. In our case, it is the Firefox browser. Once an instance is created, you can access any URL and then access the required class by calling find_elements_by_css_selector. Post that, you can iterate links and get the href attribute.

Conclusion

In this post, we’ve elucidated the different libraries available to access a webpage and parse HTML. 

If you are going to access static HTML pages, then beautifulsoup is the easiest library to use. If you are coming from jQuery background then pyQuery should be a good start. Lxml is the lightest library but unlike the previous two, it is difficult to understand for beginners. On the other hand, selenium is best where a page is rendered via Javascript. Try to avoid using Selenium as much as you can as it is is a heaviest library among all.

Have we missed out any open source web scraping frameworks and libraries that ought to have made it in the top 5 list above? Let us know in the comments below.

 

You may also like
Web Scraping at Large: Data Extraction Challenges You Must Know
Uncategorized
Web Scraping at Large: Data Extraction Challenges You Must Know
How Big Data Analytics can help the Travel and Tourism Industry grow
Big data
Big Data in Tourism: How Big Data Analytics can Help the Travel and Tourism Industry Grow