Web scraping is a process to extract data from websites. The extracted data can then be transformed and analyzed in other formats like XML, CSV, and JSON to perform other tasks as per needs. In this post, we are going to discuss various open-source web scraping frameworks and libraries available in Python.
Among the plethora of web scrapers available, there are some good open-source web scraping frameworks and libraries which allow users to code based on their source code. Individuals and organizations can leverage these frameworks and scrape in a fast, simple yet extensive way.
Top 5 Open Source Web Scraping Frameworks and Libraries
1. Request
Request is an open-source web scraping library written by Kenneth Reitz. As the name suggests, the primary objective is to access a web page or a URL and return the response. The response could be HTML, CSV, a file, etc. To install, run the following pip command:
pip install requests
Let’s see a couple of examples:
Fetching HTML
import request
headers = {‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/67.0.3396.99 Safari/537.36′}
response = requests.get(‘https://www.reddit.com/r/python’,headers=headers)
html = response.text
Fetching JSON
import request
headers = {‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/67.0.3396.99 Safari/537.36′}
response = requests.get(‘https://www.reddit.com/r/python.json’,headers=headers)
json_data = response.json
The first example returns the markup of the specified URL while the second one returns JSON data in dict format. You can also set headers by passing a named parameter.
2. Beautifulsoup
Beautifulsoup4 or bs4 is an HTML/XML parser library written in Python. The sole purpose of this library is to access different elements in DOM. Usually, developers use requests to access the markup of the page and bs4 for parsing the HTML document. To install Beautifulsoup run the following command:
pip install beautifulsoup4
Once installed, you can import it as:
from bs4 import BeautifulSoup
Let’s write a fundamental example. Continuing the one we cover above, first, we will access a Python Reddit subpage and then will print links of all entries on the page.
import request
from bs4 import BeautifulSoup
headers = {‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/67.0.3396.99 Safari/537.36′}
response = requests.get(‘https://www.reddit.com/r/python’,headers=headers)
# HTML of the main page
html = response.text
# Creating a bs4 object and passing the HTML parser as the second parameter.
soup = Beautifulsoup(html,’lxml’)
# fetch all links
links = soup.select(‘.SQnoC3ObvgnGjWt90zD9Z’)
#Iterate and print href
for link in links:
print(link[‘href’])
Before we go through the code, we need to learn about the DOM we are going to access. Our goal is the fetch all links. If you inspect the element on chrome, you will find something like given below:
You can see that <a> has a class with the name SQnoC3ObvgnGjWt90zD9Z. So all I have to do is to use the select method which takes a standardised CSS selector as input and returns the result. If you have used jQuery, then you should like home in this case. Since links could be multiple, you have to iterate the list to access the single instance of it. To access the attribute you just past it as a key here. In our case, it is an href parameter.
3. PyQuery
Pyquery is another HTML/XML parser library which helps to fetch data by accessing DOM. If you have used the popular Javascript library, jQuery, then you should find the syntax very familiar here.
Install it by running the pip command:
pip install pyquery
We will be using the same example of fetching information from Reddit.
import requests
from pyquery import PyQuery as pq
headers = {
‘user-agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/67.0.3396.99 Safari/537.36′
}
r = requests.get(‘http://reddit.com/r/python’, headers=headers)
if r.status_code == 200:
html = r.text
doc = pq(html)
links = doc(‘.SQnoC3ObvgnGjWt90zD9Z’)
for link in links:
print(link.attrib[‘href’])
After fetching the page content, you create the instance of the PyQuery object which has been given an alias as pq. Once done, you can now access any DOM element with the help of CSS selectors. In our case, it was SQnoC3ObvgnGjWt90zD9Z class.
To get an attribute, you call attrib() method. In our case, we are accessing href attribute.
4. LXML
Lxml is another XML/HTML parser library that you can use for scraping HTML documents. Beautifulsoup also has the option to use Lxml as an HTML parser. Let’s work on a similar use case:
import requests
from lxml.cssselect import CSSSelector
from lxml.html import fromstring
headers = {
‘user-agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/67.0.3396.99 Safari/537.36′
}
r = requests.get(‘http://reddit.com/r/python’, headers=headers)
if r.status_code == 200:
html = r.text
doc = fromstring(html)
sel = CSSSelector(‘.SQnoC3ObvgnGjWt90zD9Z’)
for a_href in sel(doc):
print(a_href.get(‘href’))
Using same request library to access HTML, once the HTML is available, you create an object by calling fromstring with markup as a parameter. You then create a CSSSelector object by passing the anchor class we have to access. Once done, we have to iterate elements to fetch the href parameter.
5. Selenium
Selenium is an automation library that automates browser activities. It simulates different actions which you usually perform while visiting a page. For instance, filling forms, clicking buttons, etc. Since it also provides methods to access a page and it’s DOM, you can easily use for scraping purpose as well.
from selenium import webdriver
driver = None
driver = webdriver.Firefox()
driver.get(‘http://reddit.com/r/python’)
links = driver.find_elements_by_css_selector(‘.SQnoC3ObvgnGjWt90zD9Z’)
for l in links:
print(l.get_attribute(‘href’))
driver.quit()
After importing the library, create an instance of the browser you are willing to automate. In our case, it is the Firefox browser. Once an instance is created, you can access any URL and then access the required class by calling find_elements_by_css_selector. Post that, you can iterate links and get the href attribute.
Conclusion
In this post, we’ve listed out the top 5 open source libraries available to access a webpage and parse HTML.
If you are going to access static HTML pages, then beautifulsoup is the easiest library to use. If you are coming from jQuery background then pyQuery should be a good start. Lxml is the lightest library but unlike the previous two, it is difficult to understand for beginners. On the other hand, selenium is best where a page is rendered via Javascript. Try to avoid using Selenium as much as you can as it is is a heaviest library among all.
Have we missed out any open source web scraping frameworks and libraries that ought to have made it in the top 5 list above? Let us know in the comments below.