The Home Depot is a well-known American home improvement retail giant. They offer a wide range of products for home improvement and construction, including smart devices. In today's world, smart devices have become an integral part of our lives, from smart thermostats that control our home's temperature to smart speakers that answer our questions. Home Depot's website holds a wealth of data that can help us understand consumer interests, the latest trends in smart technology, and pricing dynamics.
In this blog, we will learn how to scrape smart device data from Home Depot's website. It will help us to understand more about people’s involvement and interest in smart technology.
Scraping Homedepot: The Attributes
Before we dive into the scraping process, let's define the attributes we want to extract:
Product_url: The unique web address of a smart device on the Home Depot website.
Product_name: The name and model of the smart device.
Mrp: The selling price of the product.
rating: The user rating or review score of the smart device.
No_of_reviews: The total number of reviews for the smart device.
description: A brief description of the smart device's features.
Importing the Required Libraries
To begin our web scraping project, we need to import some essential libraries first. The libraries to be imported are:
Selenium web driver is a tool used for web automation. It allows a user to automate web browser actions such as clicking a button, filling in fields, and navigating to different websites.
By class from selenium.webdriver.common.by which is used to locate elements on a web page using different strategies like ID, class name, XPATH etc.
The writer class from csv library is used to read and write tabular data in CSV format.
The sleep function from the time library is used to provide a pause or delay in the execution of a program for a specified number of seconds.
# Importing the required libraries
from selenium import webdriver
from time import sleep
from csv import writer
from selenium.webdriver.common.by import By
Initialization Process
After importing the necessary libraries, we need to initialize our web scraping project. First, we initialize a web driver by creating an instance of the Chrome web driver using the ChromeDriver executable path. It is used to establish a connection with the web browser, here which is Google Chrome. Once initialized, a Chrome web browser will be opened, and the window is maximized using the maximize_window() function.
We then initialize two variables. One is page_url, which is initialized to the first page of the search result. The second variable is product_links, which is an empty list. This will be used to store the link of all the products.
# Specify the full path to the ChromeDriver executable
chrome_driver_path = r"C:\Users\Dell\Downloads\chromedriver-win64\chromedriver-win64\chromedriver.exe"
driver = webdriver.Chrome(executable_path=chrome_driver_path)
driver.maximize_window()
page_url = 'https://www.homedepot.com/b/Smart-Home-Smart-Devices/N-5yc1vZ2fkp3e0'
product_links = []
Getting the Products’ Link
Home Depot's website loads products dynamically as the user scrolls down. To ensure we capture all the products, we need to scroll down the page and wait for the content to load. After the products are loaded, the product elements are located on the web page using XPath, and the find_elements() function is used to scrape the product elements. This function returns the product elements as a list. To get the actual product link from these elements, we will be calling get_attribute() method on each of these elements and extract the corresponding ‘href’ property and each link to the product_links list we created earlier.
After all the product links on that page are extracted, we locate the ‘next’ page button using XPATH and perform the click operation on it using the click() function. This will open the next web page and we get the URL of the current page using current_url function and update the page_url with it. Now, the same extraction process is performed on this page to get the links of all the products on that page. On the last page, there will be no ‘next’ button and hence an error will be thrown on which the program will exit the while loop.
while 1:
driver.get(page_url)
sleep(5)
# Scrolling the page
driver.execute_script("window.scrollTo(0, 1000)")
sleep(5)
# Extracting product links
page_product_links = driver.find_elements(By.XPATH, '//div[@class="product-pod--ef6xv"]/a')
for product in page_product_links:
product_link = product.get_attribute('href')
product_links.append(product_link)
# Locating and clicking the next button
try:
next_button = driver.find_elements(By.XPATH, '//li[@class="hd-pagination__item hd-pagination__button"]')[-1]
next_button.click()
page_url = driver.current_url
except Exception as e:
break
Defining Functions
Next, we define functions to extract each attribute:
# Extracting product name
def get_product_name():
try:
product_name = driver.find_element(By.XPATH, '//h1[@class="sui-h4-bold sui-line-clamp-unset"]').text
except Exception as e:
product_name = 'Not available'
return product_name
# Extracting product price
def get_mrp():
try:
mrp = driver.find_elements(By.XPATH, '//div[@class="price-format__large price-format__main-price"]/span')
mrp = mrp[1].text
except Exception as e:
mrp = 'Not available'
return mrp
# Extracting product rating
def get_rating():
try:
rating = driver.find_elements(By.XPATH, '//div[@class="ratings-reviews__accordion-subheader"]/span')[0].text
except Exception as e:
rating = 'Not available'
return rating
# Extracting number of reviews
def get_reviews():
try:
reviews = driver.find_elements(By.XPATH, '//div[@class="ratings-reviews__accordion-subheader"]/span')[1].text
except Exception as e:
reviews = 'Not available'
return reviews
# Extracting product description
def get_desc():
try:
desc = driver.find_element(By.XPATH, '//ul[@class="sui-text-base sui-list-disc list list--type-square"]').text
except Exception as e:
desc = 'Not available'
return desc
Writing to a CSV File
Now that we have defined our extraction functions let's write the data to a CSV file.
First, we will open a file named “homedepot_data.csv” in the write mode and initialize an object of the writer class named theWriter. The headings of different columns of the csv file are first initialized as a list and then written to the file using the writerow() function.
Now we will extract the information about each product. For this, we will iterate through each product link in the product_links and call the get() function and the functions defined earlier to extract the required attributes. The attribute values returned are first stored as a list and then written into the csv file using the writerow() function. After the process is completed, the quit() command is called, which closes the web browser that was opened by the selenium web driver.
It can be noted that sleep() function is called in between different function calls. It is provided to avoid getting blocked by the website.
# Writing to a CSV File
with open('homedepot_data.csv','w',newline='', encoding='utf-8') as f:
theWriter = writer(f)
heading = ['product_url', 'product_name', 'mrp', 'rating', 'no_of_reviews', 'description']
theWriter.writerow(heading)
for product in product_links:
driver.get(product)
sleep(5)
driver.execute_script("window.scrollTo(0, 1000)")
sleep(8)
product_name = get_product_name()
sleep(3)
mrp = get_mrp()
sleep(3)
rating = get_rating()
sleep(3)
no_of_reviews = get_reviews()
sleep(3)
desc = get_desc()
sleep(3)
record = [product, product_name, mrp, rating, no_of_reviews, desc]
theWriter.writerow(record)
driver.quit()
Wrapping up
Web scraping is a powerful technique for gathering data from websites like Home Depot, where valuable insights into consumer preferences and the latest trends in smart devices can be found. In this blog, we learned how to scrape data related to smart devices from Home Depot's website using Python and Selenium. The same process can be adapted to explore other product categories and websites, enabling you to make data-driven decisions in today’s fast-evolving world.