How to Scrape Amazon Reviews using Scrapy
Amazon scraping, Python, Web scraping

Scraping Amazon Reviews using Scrapy in Python

Are you looking for a method of scraping Amazon reviews and do not know where to begin with? In that case, you may find this blog very useful in scraping Amazon reviews. In this blog, we will discuss scraping amazon reviews using Scrapy in python. Web scraping is a simple means of collecting data from different websites, and Scrapy is a web crawling framework in python. Web scraping allows the user to manage data for their requirements, for example, online merchandising, price monitoring and driving marketing decisions.

In case you are wondering whether this process is even legal or not, you can find the answer to this query here

Before digging into scraping Amazon for product reviews, let us first have a look at a few use-cases of scraping Amazon reviews at the first place

Why the need for scraping Amazon reviews?

  1. Sentiment Analysis over the product reviews
    Sentiment analysis can be performed over the reviews scraped from products on Amazon. Such study helps in identifying the user’s emotion towards a particular product. This can help in sellers or even other prospective buyers in understanding the public sentiment related to the product.
  2. Optimising dropshipping sales
    Drop shipping is a business type that allows a particular company to work without an inventory or a depository for the storage of its products. You can use web scraping for getting product pricing, user opinions, understanding the needs of the customer and following up with the trend.
  3. Web scraping for online reputation monitoring
    It is difficult for large-scale companies to monitor their reputation of products. Web scraping can help in extracting relevant review data which can act as input to different analysis tool to measure user’s sentiment towards the organisation.

What is Scrapy?

Scrapy is a web crawling framework for a developer to write code to create, which define how a particular site (or a group of websites) will be scrapped. The most significant feature is that it is built on Twisted, an asynchronous networking library, which makes the spider performance is very significant.

Let us now have a look at a necessary pipeline for scraping amazon reviews

Scraping Amazon reviews Pipeline

I always feel that it is essential to have a holistic idea of the work before you start doing it which in our case is scraping Amazon reviews. Hence, before we begin with the coded implementation with Scrapy, let us have an uber look at the complete pipeline for scraping Amazon reviews. In this section, we will look at the different stages involved in scraping amazon reviews along with their short description. This will give you an overall idea of the task which we are going to do using python in the later section.

  1. Analysing HTML structure of the webpage

    Scraping is about finding a pattern in the web pages and extracting them out. Before starting to write a scraper, we need to understand the HTML structure of the target web page and identify patterns in it. The pattern can be related to usage of classes, ids and other HTML elements in a repetitive manner.

  2. Scrapy parser implementation in Python

    After analysing the structure of the target web page, we work on the coded implementation in python. Scrapy parser’s responsibility is to visit the targeted web page and extract out the information as per the mentioned rules.

  3. Collection and Storage of Information

    The parser can dump out the results in any format you wish for be it CSV or JSON. This is the final output while in which your scraped data resides.

Python code implementation for scraping Amazon reviews

Installing Scrapy 

We will start by installing Scrapy in our system. There can be two cases here though. If you are using conda, then you can install scrapy from the conda-forge using the following command


conda install -c conda-forge scrapy

In case you are not using conda, you can use pip and directly install it in your system using the below command


pip install scrapy

We will start by creating a scrapy project. A scrapy project enables users to collate different components of the crawlers into a single folder. To create a scrapy project use following command


scrapy startproject amazon_reviews_scraping

Once you have created the project, you will find the following two contents in it. One is a folder which contains your scrapy code, and other is your spacy configuration file. Spacy configuration while helps in running and deploying the Scrapy project on a server. 

Scraping Amazon Reviews using Scrapy in Python
Scrapy config file

Once we have the project in place, we need to create a spider. A spider is a chunk of python code which determines how a web page will be scrapped. It is the main component which crawls different web pages and extracts content out of it. In our case, this will be the code chuck that will perform the task of visiting Amazon and scraping Amazon reviews. To create a spider, you can use the following command


scrapy genspider amazon_review your-link-here

Spider gets created within a spiders folder inside the project directory. Once you go into the scrapy project, you will see a directory structure like the one below

Scraping Amazon Reviews using Scrapy in Python
Scrapy project directory structure

Scrapy files description

Let us understand the Scrapy project structure and supporting files inside in a bit more detail. Main files inside Scrapy project directory includes

  1. items.py
    Items are containers that will be loaded with the scraped data.
  2. Middleware .py
    The spider middleware is a framework of hooks into Scrapy’s spider processing mechanism where you can plug custom functionality to process the responses that are sent to Spiders for processing and to handle the requests and items that are generated from spiders.
  3. Pipelines .py
    After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially. Each item pipeline component is a Python class
  4. settings.py
    It allows one to customise the behaviour of all Scrapy components, including the core, extensions, pipelines and spiders themselves
  5. spiders folder
    The Spiders is a directory which contains all spiders/crawlers as Python classes. Whenever one runs/crawls any spider, then scrapy looks into this directory and tries to find the spider with its name provided by the user. Spiders define how a certain site or a group of sites will be scraped, including how to perform the crawl and how to extract data from their pages.

For more detailed information on Scrapy components, you can refer to this link

Analysing HTML structure of the webpage

Now before we actually start writing spider implementation in python for scraping Amazon reviews, we need to identify patterns in the target web page. Below is the page we are trying to scrape which contains different reviews about the MacBook air on Amazon.

Scraping Amazon Reviews using Scrapy in Python
Amazon reviews web page

We start by opening the web page using the inspect-element feature in the browser. There you can see the HTML code of the web page. After a little bit of exploration, I found the following HTML structure which renders the reviews on the web page

Scraping Amazon Reviews using Scrapy in Python
HTML code snippet for Amazon reviews

On the reviews page, there is a division with id cm_cr-review_list. This division multiple sub-division within which the review content resides. We are planning to extract both rating stars and review comment from the web page. We need to one more level deep into one other sub-divisions to prepare a scheme on fetching both star rating and review comment.

Scraping Amazon Reviews using Scrapy in Python
Detailed HTML code snippet of reviews

Upon further inspection, we can see that every review subdivision is further divided into multiple blocks. One of these blocks contain required star ratings, and other includes the text of review needed. By looking more closely, we can easily see that rating star division is represented by the class attribute “review-rating” and review texts are represented by the class “review-text”. All we need to do now is just to pick these patterns up using our Scrapy parser

Defining Scrapy Parser in Python

Now once we have our spider template ready and we have analysed the pattern in the target web page, we can start writing the logic for the extraction of reviews from Amazon. We begin by extending the Spider class and mentioning the URLs we plan on scraping. Variable start_urls contains the list of the URLs to be crawled by the spider.

Scraping Amazon Reviews using Scrapy in Python
Basic Scrapy spider template

Then we need to define a parse function which gets fired up whenever our spider visits a new page. In the parse function, we need to identify patterns in the targeted page structure. Spider then looks for these patterns and extracts them out from the web page.

Below is a code sample of Scrapy parser for scraping Amazon reviews

# -*- coding: utf-8 -*-

# Importing Scrapy Library
import scrapy

# Creating a new class to implement Spide
class AmazonReviewsSpider(scrapy.Spider):
    
    # Spider name
    name = 'amazon_reviews'
    
    # Domain names to scrape
    allowed_domains = ['amazon.in']
    
    # Base URL for the MacBook air reviews
    myBaseUrl = "https://www.amazon.in/Apple-MacBook-Air-13-3-inch-MQD32HN/product- 
    reviews/B073Q5R6VR/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews
    &pageNumber="
    start_urls=[]
   
    # Creating list of urls to be scraped by appending page number a the end of base url
    for i in range(1,121):
        start_urls.append(myBaseUrl+str(i))
   
    # Defining a Scrapy parser
    def parse(self, response):
            data = response.css('#cm_cr-review_list')
            
            # Collecting product star ratings
            star_rating = data.css('.review-rating')
            
            # Collecting user reviews
            comments = data.css('.review-text')
            count = 0
            
            # Combining the results
            for review in star_rating:
                yield{'stars': ''.join(review.xpath('.//text()').extract()),
                      'comment': ''.join(comments[count].xpath(".//text()").extract())
                     }
                count=count+1

Storing Scraped Results

Finally, we have successfully built our spider. The only task now left is to run this spider. We can run this spider by using the runspider command. It takes to input the spider file to run and the output file to store the collected results. In the case below, spider file is amazon_reviews.py and the output file is reviews.csv

scrapy runspider amazon_reviews_scraping/amazon_reviews_scraping/spiders/amazon_reviews.py -o reviews.csv
Scraping Amazon Reviews using Scrapy in Python

EDA on Amazon reviews

In this section, we will try to do some exploratory data analysis on the data obtained after scraping Amazon reviews. We will be counting the overall rating of the product along with the most common words used for the product. Using pandas, we can read the CSV containing the scraped data.

import pandas as pd
import matplotlib as plt

pd.read_csv("reviews.csv")
summarised_results = dataset["stars"].value_counts()
plt.bar(summarised_results.keys(), summarised_results.values)
plt.show()

Above code summarises all the ratings and finds their total count. After that, it plots a bar chart to visualise the findings. We have used matlplotlib library here to visualise the results.

Scraping Amazon Reviews using Scrapy in Python
Distribution of star ratings

Let us now try to visualise some of the keywords that are present in the scraped reviews. We can visualise these keywords using a word cloud. Word cloud works on the principle that most frequent words in the text should be much more prominent and bolder among the set of different words. The code snippet below can help you in making a word cloud in python

def visualise_word_map():
    words=" "
    for msg in dataset["comment"]:
    msg = str(msg).lower()
        words = words+msg+" "
    wordcloud = WordCloud(width=3000, height=2500, background_color='white').generate(words)
    fig_size = plt.rcParams["figure.figsize"]
    fig_size[0] = 14
    fig_size[1] = 7
    plt.show(wordcloud)
    plt.axis("off")

The image below is a word cloud generated by the above code snippet. Words like the laptop, apple, product and Amazon are represented by much more significant and bolder fonts representing that there are many frequent words used. Furthermore, this word cloud makes sense because we scraped MacBook air’s user reviews from Amazon. Also, you can see words like amazing, good, awesome and excellent indicating that indeed many of the users actually liked the product.

word cloud of results after scraping amazon reviews
Word cloud for craped Amazon reviews

Datahut as your reliable scraping partner

There are a lot of tools that can help you scrape data yourself. However, if you need professional assistance, companies like Datahut can help you. We have a well-structured and transparent process for the same. We have helped enterprises across various industrial verticals. From assistance to the recruitment industry to retail solutions, Datahut has designed sophisticated solutions for most of these use-cases.

You should join the bandwagon of using data-scraping in your operations before it is too late. It will help you boost the performance of your organisation. Furthermore, it will help you derive insights that you might not know currently. This will enable informed decision-making in your business processes.

Conclusion

Using Scrapy, we were able to devise a method for scraping amazon reviews using python. Additionally, there can be some roadblocks while scraping Amazon reviews as Amazon tends to block IP’s if you try scraping Amazon frequently. This can be a hindrance to your work. In such cases, make sure you are shuffling your IP’s periodically and are making less frequent requests to Amazon server to prevent yourself from blocking out. You can read more about it here. Additionally, you can use the proxy servers which serves as a protection to your home IP from blocking out while scraping Amazon reviews. With Datahut as your web-scraping partner, you will never worry about such issues.

This is an animated gif image, but it does not move