Top 5 Open Source Web Scraping Frameworks and Libraries
Web scraping

Top 5 Open Source Web Scraping Frameworks and Libraries

Web scraping is a process to extract data from websites. The extracted data can then be transformed and analyzed in other formats like XML, CSV, and JSON to perform other tasks as per needs. In this post, we are going to discuss various open-source web scraping frameworks and libraries available in Python. 

Among the plethora of web scrapers available, there are some good open-source web scraping frameworks and libraries which allow users to code based on their source code. Individuals and organizations can leverage these frameworks and scrape in a fast, simple yet extensive way.

Also Read: How to Bypass Anti-Scraping Tools on Websites

Top 5 Open Source Web Scraping Frameworks and Libraries

1. Request

Request is an open-source web scraping library written by Kenneth Reitz. As the name suggests, the primary objective is to access a web page or a URL and return the response. The response could be HTML, CSV, a file, etc. To install, run the following pip command:

pip install requests

Let’s see a couple of examples:

Fetching HTML

import request

 

headers = {‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko)

 

Chrome/67.0.3396.99 Safari/537.36′}

 

response = requests.get(‘https://www.reddit.com/r/python’,headers=headers)

 

html = response.text

 

Fetching JSON

import request

 

headers = {‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko)

 

Chrome/67.0.3396.99 Safari/537.36′}

 

response = requests.get(‘https://www.reddit.com/r/python.json’,headers=headers)

 

json_data = response.json

 

The first example returns the markup of the specified URL while the second one returns JSON data in dict format. You can also set headers by passing a named parameter,

2. Beautifulsoup

Beautifulsoup4 or bs4 is an HTML/XML parser library written in Python. The sole purpose of this library is to access different elements in DOM. Usually, developers use requests to access the markup of the page and bs4 for parsing the HTML document. To install Beautifulsoup run the following command:

pip install beautifulsoup4

Once installed, you can import it as:

from bs4 import BeautifulSoup

Let’s write a fundamental example. Continuing the one we cover above, first, we will access a Python Reddit subpage and then will print links of all entries on the page.

import request

 

from bs4 import BeautifulSoup

 

headers = {‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko)

 

Chrome/67.0.3396.99 Safari/537.36′}

 

response = requests.get(‘https://www.reddit.com/r/python’,headers=headers)

 

# HTML of the main page

 

html = response.text

 

# Creating a bs4 object and passing the HTML parser as the second parameter.

 

soup = Beautifulsoup(html,’lxml’)

 

# fetch all links

 

links = soup.select(‘.SQnoC3ObvgnGjWt90zD9Z’)

 

#Iterate and print href

 

for link in links:

 

print(link[‘href’])

Before we go through the code, we need to learn about the DOM we are going to access. Our goal is the fetch all links. If you inspect the element on chrome, you will find something like given below:

You can see that <a> has a class with the name SQnoC3ObvgnGjWt90zD9Z. So all I have to do is to use the select method which takes a standardised CSS selector as input and returns the result. If you have used jQuery, then you should like home in this case. Since links could be multiple, you have to iterate the list to access the single instance of it. To access the attribute you just past it as a key here. In our case, it is an href parameter.

3. PyQuery

Pyquery is another HTML/XML parser library which helps to fetch data by accessing DOM. If you have used the popular Javascript library, jQuery, then you should find the syntax very familiar here.

Install it by running the pip command:

pip install pyquery

We will be using the same example of fetching information from Reddit.

import requests

 

from pyquery import PyQuery as pq

 

headers = {

 

   ‘user-agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko)

 

Chrome/67.0.3396.99 Safari/537.36′

 

}

 

r = requests.get(‘http://reddit.com/r/python’, headers=headers)

 

if r.status_code == 200:

 

   html = r.text

 

   doc = pq(html)

 

   links = doc(‘.SQnoC3ObvgnGjWt90zD9Z’)

 

   for link in links:

 

       print(link.attrib[‘href’])

After fetching the page content, you create the instance of the PyQuery object which has been given an alias as pq. Once done, you can now access any DOM element with the help of CSS selectors. In our case, it was SQnoC3ObvgnGjWt90zD9Z class.

To get an attribute, you call attrib() method. In our case, we are accessing href attribute.

4. LXML

Lxml is another XML/HTML parser library that you can use for scraping HTML documents. Beautifulsoup also has the option to use Lxml as an HTML parser. Let’s work on a similar use case:

import requests

 

from lxml.cssselect import CSSSelector

 

from lxml.html import fromstring

 

headers = {

 

‘user-agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko)

 

Chrome/67.0.3396.99 Safari/537.36′

 

}

 

r = requests.get(‘http://reddit.com/r/python’, headers=headers)

 

if r.status_code == 200:

 

html = r.text

 

doc = fromstring(html)

 

sel = CSSSelector(‘.SQnoC3ObvgnGjWt90zD9Z’)

 

for a_href in sel(doc):

 

print(a_href.get(‘href’))

Using same request library to access HTML, once the HTML is available, you create an object by calling fromstring with markup as a parameter. You then create a CSSSelector object by passing the anchor class we have to access. Once done, we have to iterate elements to fetch the href parameter.

5. Selenium

Selenium is an automation library that automates browser activities. It simulates different actions which you usually perform while visiting a page. For instance, filling forms, clicking buttons, etc. Since it also provides methods to access a page and it’s DOM, you can easily use for scraping purpose as well.

from selenium import webdriver

driver = None

 

driver = webdriver.Firefox()

 

driver.get(‘http://reddit.com/r/python’)

 

links = driver.find_elements_by_css_selector(‘.SQnoC3ObvgnGjWt90zD9Z’)

 

for l in links:

 

print(l.get_attribute(‘href’))

 

driver.quit()

After importing the library, create an instance of the browser you are willing to automate. In our case, it is the Firefox browser. Once an instance is created, you can access any URL and then access the required class by calling find_elements_by_css_selector. Post that, you can iterate links and get the href attribute.

Also Read: How to Build a Web Crawler in Python from Scratch

Conclusion

In this post, we’ve listed out the top 5 open source libraries available to access a webpage and parse HTML. 

If you are going to access static HTML pages, then beautifulsoup is the easiest library to use. If you are coming from jQuery background then pyQuery should be a good start. Lxml is the lightest library but unlike the previous two, it is difficult to understand for beginners. On the other hand, selenium is best where a page is rendered via Javascript. Try to avoid using Selenium as much as you can as it is is a heaviest library among all.

Have we missed out any open source web scraping frameworks and libraries that ought to have made it in the top 5 list above? Let us know in the comments below.

About the author

Bhagyeshwari Chauhan

Content creator and Digital Marketing Strategist at Datahut