Beginner’s guide to Web Scraping with Python lxml

Web Scraping with Python is a popular subject around data science enthusiasts. Here is a piece of content aimed at beginners who want to learn Web Scraping with Python lxml library.
What is lxml?

 

 
lxml is the most feature-rich and easy-to-use library for processing XML and HTML in Python programming language. lxml is a reference to the XML toolkit in a pythonic way which is internally being bound with two specific libraries of C language, libxml2, and libxslt. lxml is unique in a way that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API.
With the continued growth of both Python and XML, there are a plethora of packages out there that help you read, generate, and modify XML files from Python scripts. Compared to most of them, the python lxml package has two big advantages:
  • Performance: Reading and writing even fairly large XML files takes almost imperceptible amount of time.
  • Ease of programming: python lxml library has easy syntax and more adaptive nature than other packages.
lxml is similar in many ways to two other earlier packages which are called as parent packages for lxml.
  • ElementTree: This is used to create and parse tree structure of XML nodes.

  • xml.etree.ElementTree: This is now an official part of the Python library. There is a C-language version called cElementTree which may be even faster than lxml for some applications.
However, lxml is preferred by most of the python developers because it provides a number of additional features that make life easier. In particular, it supports XPath, which makes it considerably easy to manage more complex XML structures.
python lxml library can be used to either create XML/HTML structure using elements, or parse XML/HTML structure to retrieve information from them. This library can be used to get information from different web services and web resources, as these are implemented in XML/HTML format. The objective of this tutorial is throw light on how lxml helps us to get and process information from different web resources.
How to install lxml?
 
lxml can be installed as a python package using pip which is a package manager tool for python. Below is the command which is needs to be fired to install it on your system.
pip install lxml
pip automatically installs all the dependencies for installing python lxml as well.
lxml can be installed as a system package using binary installers depending upon system OS. I would prefer to install it using the former method, as many systems do not have a better and clean way to install this package if the latter is used.
How to use lxml?
 
Python is a very easy language to learn but libraries which are written using python are as easy. Getting a clear picture of the function of library is ambiguous. Practical implementation will take us closer to creating an idea of what is the library actually doing. Let us pick few examples and use lxml in practical scenarios. A successful implementation of Web Scraping with Python takes time and practice.
As discussed earlier, we can use python lxml to create as well as parse XML/HTML structures.
In a first and very basic example, let’s create a html web page structure using python lxml and define some elements and its attributes. So, let us begin!
lxml has many modules and one of the module is a etree which is responsible for creating elements and structure using these elements.
First, let’s import the “require” module in python. I generally prefer to use Ipython command shell to execute python programs because it gives an extensive and clear command prompt to use python features in a very broad way.

After importing etree module, we can use Element class API  to create multiple elements. In general, elements can be called as nodes as well.

XML/HTML pages designed on parent-child paradigm where elements can play the role of parents and children for other element nodes. To create  a parent-child relationship using python lxml, we can use SubElement method of etree module.

Element nodes have multiple properties. For example a text property can be used to set a text value for a node which we can be inferred as an information for the end user. We can also set attributes for any node in the tree structure. As you can see below, I have created a html tree structure using lxml and its etree which can be saved as a html web page as well.
We can set attributes for elements.

 

Now, let’s take another example in which we shall see how to parse html tree structure. This process is a part of scraping content from web so you can follow this process if you want to scrap data from the web and process the data further.
In this example, let us use requests python module, which is used to send HTTP requests to web URLs. requests module has improved speed and readability when compared to  the built-in urllib2 module. So, using requests module is a better choice. Along with requestshtml module is made use of from lxml, to parse the response of the request.
First, let’s import require modules,

Using requests module, let’s send a get request to cnn.com website to retrieve top news stories. HTTP web server sends the response as a Response<200> object. We store this in a page variable and then use html module to parse it and save the results in a tree. Response object has multiple properties like response headers, contents, cookies etc. We can use python dir() method to see all these object properties. Here, I am using page.content instead of page.text because html.fromstring implicitly expects bytes as input where the page.text provides content in simple text format (ASCII or utf-8, depending upon web server configuration).
 

html module also provides multiple functions to access the parsed object. For example, to iterate children of html object, we can use iterchildren(). The tree now contains the whole HTML file in a nice tree structure which we can go over two different ways: XPath and CSSSelect. In this example, we will focus on the former.
XPath is a way of locating information in structured documents such as HTML or XML documents. XPath uses path expressions to select nodes or node-sets in an XML document. The node is selected by following a path or steps.
The most useful path expressions are listed below:
            Description              Selects all nodes with the name “nodename”              Selects from the root node             Selects nodes in the document from the current node that match the selection no matter            where they are              Selects the current node              Selects the parent of the current node              Selects attributes
 

/
//
.
..
@

Expression
nodename

 

Following are some path expressions and their results:

 

 

        Result       Selects all nodes with the name “bookstore”      Selects the root element bookstore     Note: If the path starts with a slash ( / ) it always represents an absolute path to an          element!      Selects all book elements that are children of bookstore     Selects all book elements no matter where they are in the document     Selects all book elements that are descendant of the bookstore element, no matter          where they are under the bookstore element 

  Selects all attributes that are named lang

 
 

/bookstore
bookstore/book
//book
bookstore//book
//@lang

Path Expression
bookstore
Lets get back to our scraping example. so far we have downloaded and made a tree structure from html web page. We are using XPath to select nodes from this tree structure. As, we want to get top stories, we have to analyse the web page to find the tags that are storing this information. Upon analysis we can see that h3 tag with data-analytic attribute contains this information. Selecting this node allows us to fetch the text of news stories and appropriate web links to read for complete news.

To give a better representation to this scraped data, I am zipping news stories and links together and storing them in a list, which later can be processed in form of printing or storing in a database for further process.

Ta da! We have successfully covered scraping using python lxml and requests. We have it stored in memory as a lists. Now we can do all sorts of cool stuff with it: analyze it using Python or  save it in a file and share it with the world.
We have covered most of the stuff related toWeb Scraping with python lxml module and also understood how can we combine it with other python modules to do some impressive work. Below are few references which can be helpful in knowing more about it.
Do share this if you enjoyed reading this blog post on Web Scraping with Python. Write a web scraper on your own and share your experience with us.
References
  1.  lxml – XML and HTML with Python
  2.  lxml.etree Tutorial
  3.  Parsing XML and HTML using lxml

About Datahut

Datahut helps companies get structured data feeds from websites.