Blocket is a popular Swedish marketplace where individuals and businesses can buy and sell a wide range of products and goods. From a research and analysis point of view, the data from Blocket can provide valuable insights into consumer preferences and industry trends.
In this blog, we will learn how to scrape Blocket to extract mobile device data from the website. It will help us track the popularity of different mobile brands, identify the pricing patterns, and gain a deeper understanding of consumer behavior in the used mobile phone market.
Why Scrape Blocket?
Scraping data from Blocket opens a gateway to a wealth of insights that can be immensely advantageous for various purposes. Here's how delving into Blocket's data through web scraping can benefit retailers:
Market Intelligence: By scraping Blocket's data, you gain access to real-world transactions and interactions. This enables you to grasp what products are in demand, what prices they fetch, and the overall pulse of the market.
Consumer Preferences: The data extracted from Blocket offers a direct window into the choices of consumers. You can uncover what types of products gain traction, which brands are more popular, and the features that attract buyers.
Pricing Patterns: Understanding how prices fluctuate on Blocket can aid businesses in setting competitive prices. By gauging the relationship between prices and factors like condition, brand, and features, you can make informed pricing decisions.
Trend Spotting: Blocket's data holds the power to reveal emerging trends. Whether it's certain brands gaining momentum or new product categories surfacing, this information can guide your strategic planning.
Business Strategy: For sellers, scraping Blocket provides a strategic edge by offering insights into successful selling tactics. Learning from other listings' strengths and weaknesses can inform your own sales approach.
Competitive Analysis: Scraping helps you monitor your competitors. By observing how others position their products, set prices, and engage with customers, you can adapt and refine your own strategies.
Data-Driven Decisions: In today's data-driven world, making decisions without reliable information can be risky. Scraping Blocket equips you with the data needed to back up your choices and minimize uncertainties.
Let's dive into the scraping process.
The Attributes
To begin the scraping process, we first need to identify the attributes which need to be extracted. The following attributes are extracted for each mobile phone ad on the website.
product_url: It is the unique address of the mobile ad on the Blocket website.
product_name: It specifies the model of the mobile.
price: It is the selling price of the mobile.
description: It is a short detail about the device.
seller_name: It provides the name of the individual or organization selling the device.
Required Libraries
The first step in any scraping process is to import the required libraries. We will be scraping Blocket using Selenium which is a tool used to automate web browsers. The following libraries are imported for the scraping process.
Selenium web driver is a tool used for web automation. It allows a user to automate web browser actions such as clicking a button, filling in fields, and navigating to different websites.
ChromeDriverManager is a library that simplifies the process of downloading and installing the Chrome driver, which is required by Selenium to control the Chrome web browser.
BeautifulSoup is a python library that is used for parsing and pulling data out of HTML and XML files.
The lxml library of Python is used for the processing of HTML and XML files. An ElementTree or etree is a module in lxml used to parse XML documents.
The csv library is used to read and write tabular data in CSV format.
The time library is used to represent time in different ways.
#importing required libraries
from bs4 import BeautifulSoup
from lxml import etree as et
from csv import writer
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time
Scraping Process
After importing the required libraries, the next step is to initialize a few variables which we will be using later in the program. The first variable is the pagination_url. On examining the Blocket website, we found that there are 40 web pages of used mobile ads and each page consists of 40 products. We need to scrape each web page and for that purpose we use the pagination_url variable. It is a URL we identified, to which we will append the web page number. This will form the complete URL of a web page and then we will scrape the mobile ad URLs of each product from that page.
The scraped product links are not complete and therefore invalid. They are appended to a base_url to form a complete and valid URL. The number of web pages, 40, is assigned to a variable named total_no_of_pages. We also initialize an empty list named product_list to which we will be storing the URL of each product from each of the 40 pages.
#Initialization of variables
base_url = "https://www.blocket.se"
pagination_url = "https://www.blocket.se/annonser/hela_sverige/elektronik/telefoner_tillbehor/telefoner?cg=5061&page="
total_no_of_pages = 40
product_list = []
driver = webdriver.Chrome(ChromeDriverManager().install())
We need to open the web browser so that Selenium can interact with it and scrape the required details. For this, we create an instance of the Chrome web driver using the ChromeDriverManager method. This instance is assigned to a variable named driver. The web driver is downloaded and installed. Next, we define a function named get_dom() which takes as input the URL of a website.
In this method, the web Chrome driver will first open the URL and retrieve the page source code using the driver.page_source attribute. It will contain the HTML code of the loaded page and will be stored in a variable named page_content. Then, we will create a BeautifulSoup object called product_soup by parsing the page source code using the 'html.parser' HTML parser converts it to an ElementTree object using the et.HTML() method, and returns the resulting DOM tree. This DOM is a hierarchical representation of the HTML structure of the page.
#function to get DOM from given URL
def get_dom(url):
driver.get(url)
time.sleep(8)
page_content = driver.page_source
product_soup = BeautifulSoup(page_content, 'html.parser')
current_dom = et.HTML(str(product_soup))
return current_dom
Extraction Process
As mentioned earlier, there are a total of 40 web pages with used mobile ads and each page consists of 40 ads. First, we will navigate through each web page and extract the links of all the products. We will be navigating through the web pages using a for loop. During each iteration of the loop, the value is concatenated with the pagination_url and the URL for a page is created.
Next, we call the get_dom() function by passing the URL as the parameter, extract the link of all the products on that page and store it in a list named page_product_list. This list is then added to the list product_list which we initialized as an empty list in the beginning. When the for loop terminates, we will have the link of all the products on all the pages in the list product_list.
# Extracting product link
for page_no in range(1, total_no_of_pages+1):
page_url = pagination_url + str(page_no)
dom = get_dom(page_url)
page_product_list = dom.xpath('//a[@class="Link-sc-6wulv7-0 styled__StyledTitleLink-sc-1kpvi4z-8 iBZYJF kFdxX"]/@href')
product_list += page_product_list
Now that we have the link of all the products, we can extract the required details by iterating through this list. During each iteration, the following functions are called. Each function is used to extract a particular detail.
# Extracting product name
def get_product_name(product_dom):
try:
product_name = product_dom.xpath('//h1[@class="TextHeadline3__TextHeadline3Wrapper-sc-10e1s2p-0 bcGlQz Hero__StyledSubject-sc-1mjgwl-4 keRXlo"]/text()')[0]
except Exception as e:
product_name = "Not available"
return product_name
# Extracting product price
def get_price(product_dom):
try:
price = product_dom.xpath('//div[@class="TextHeadline2__TextHeadline2Wrapper-sc-1itsg3n-0 jufxzk Price__StyledPrice-sc-crp2x0-0 bqLhbX"]/text()')[0]
except Exception as e:
price = "Not available"
return price
# Extracting product description
def get_desc(product_dom):
try:
desc = product_dom.xpath('//div[@class="TextBody__TextBodyWrapper-sc-cuv1ht-0 jigUjJ BodyCard__DescriptionPart-sc-15r463q-1 ixAjFo"]/text()')[0].replace('\n', ',')
except Exception as e:
desc = "Not available"
return desc
# Extracting seller name
def get_seller(product_dom):
try:
seller = product_dom.xpath('//div[@class="styled__PrivateAdvertiser-sc-1f8y0be-4 JxCyo"]/div/text()')[0]
except Exception as e:
seller = "Not available"
return seller
Writing data to a CSV File
Extracting the data is not enough. We need to store it somewhere so that we can use it for other purposes like analysis. Now we will see how to store the extracted data to a csv file.
The following code opens a CSV file named ‘blocket_mobile_data.csv’ in the write mode. Then we initialize a writer object named theWriter. The column names are written into a list named heading and then written to the csv file using the writerow() function.
Then we iterate through each element of the list product_list. The list elements are incomplete product links. Each element is then concatenated with the base_url to form a complete and valid URL. The get_dom() function is called by passing the URL as a parameter and the returned dom is stored in a variable named product_dom. Then each attribute of the product is extracted by calling the functions mentioned earlier and passing the product_dom as a parameter to the function call. The extracted attributes are then written to the csv file. After extracting each attribute, we call the sleep() method of the time library, which causes the program to pause for a few seconds. This is a way to avoid getting blocked during scraping. After all the process has been completed, we call the driver.quit() command, which closes the web browser that was opened by the selenium web driver.
# Writing to a CSV file
with open('blocket_mobile_data.csv','w',newline='', encoding='utf-8') as f:
theWriter = writer(f)
heading = ['product_url', 'product_name', 'price','description', 'seller_name']
theWriter.writerow(heading)
for product in product_list:
product_link = base_url + product
product_dom = get_dom(product_link)
product_name = get_product_name(product_dom)
time.sleep(3)
product_price = get_price(product_dom)
time.sleep(3)
desc = get_desc(product_dom)
time.sleep(3)
seller = get_seller(product_dom)
time.sleep(3)
record = [product_link, product_name, product_price, desc, seller]
theWriter.writerow(record)
# Closing the web browser
driver.quit()
Wrapping up
In a nutshell, scraping data from Blocket's used mobile phone section arms both buyers and sellers with vital insights. Buyers can make smarter choices, picking the right mobile device, while sellers can better understand demand and pricing.
Ready to supercharge your e-commerce data game? Turn to DataHut's web scraping services and convert raw data into smart decisions. Contact us today and kickstart your journey towards smarter insights and success.