How to Automate Lowe’s Product Data Extraction Using Web Scraping
- Ambily Biju
- Apr 2
- 34 min read

Introduction
Did you know that most of the e-commerce businesses utilize web scratching to track competitor assessing, screen thing availability, and analyze client estimation? In today's data-driven economy, businesses depend on mechanized data collection to choose up a competitive edge. Instead of physically gathering points of interest, companies can use web scraping to extract vast amounts of data quickly and efficiently.
This project focuses on automating the extraction of product data from Lowe's, one of the largest home improvement retailers. By utilizing Python , the scraper navigates through Lowe's location, collects principal thing details—including titles, costs, evaluations, and reviews—and stores the data in an SQLite database. The structured data can be used for market analysis, price comparisons, and inventory tracking, helping businesses make data-driven decisions.
This documentation is designed for e-commerce managers, data analysts, and developers who want to leverage web scraping for business intelligence. Whether you're monitoring pricing trends, analyzing customer feedback, or researching market demand, this project provides a reliable and scalable approach to collecting valuable product insights.
What is Web Scraping?
Web scraping is the process of automatically extracting data from websites using scripts or programs. Instead of manually copying and pasting information, a web scraper accesses web pages, retrieves relevant data, and organizes it for further analysis. This technique is widely used in various industries, particularly in e-commerce, finance, and research, where large-scale data collection is necessary.
Businesses use web scraping for multiple purposes, such as tracking competitor prices, analyzing customer reviews, monitoring stock availability, and gathering product specifications for database management. By automating data extraction, companies can save time, reduce errors, and gain real-time insights into market trends.
In this project, web scraping enables the automated collection of product data from Lowe’s website. Using Playwright, the scraper efficiently navigates through the website, interacts with dynamic elements, and extracts structured data while handling challenges like changing page structures, missing data, and connection failures.
overview of the project
The initial stage consists of collecting the URLs of the product pages from the Lowe’s category pages. The scraper works by stepping into different segments of the website, detects HTML anchor () tags with product links, and saves them to an SQLite database. The scraper tries to maintain data integrity by cleaning duplicate URLs and retrieving information that will be used later.
After acquiring the product URLs, the scraper extracts the data from every single product page and gets information such as product name, cost, ratings, total reviews, stock status and other details. The scraper collects information only through static HTML selectors, so only the necessary information is targeted. The information that was captured is then saved in an SQLite database for structured and easy access to data in the future.
To increase reliability and performance, this project encompasses a number of vital functionalities. It captures dynamic content with the help of Playwright with the guarantee that elements present in the JavaScript will be rendered correctly. It also features error-handling for common problems, such as pages failing to load, losing connection, and data being stored in an unsatisfactory state. Moreover, the scraper logs all steps taken to help analyze the set of completed work and any mistakes that occurred within it.
This project aids companies and researchers by automating the data extraction process from Lowe’s.
An Overview of Libraries for Seamless Data Extraction
Playwright
Playwright is a powerful automation library, and users can control web browsers entirely via the power of code. It supports a rich browser API and is extremely well-suited for web scrapers that rely on content-heavy JavaScript rendering. For example, it makes easy interactive work with web pages--scrolling down through pages of content, clicking buttons or filling out forms, or taking screenshots of web pages just in time.
Asyncio
Asyncio is a standard library in Python which supports the concept of asynchronous programming. The asyncio library provides support to write concurrent code by using async and await syntax. In web scraping, Asyncio is important because it allows for running numerous requests in parallel to ensure increased efficiency of the process. Asyncio allows the scraper to do other work while waiting for responses from web servers. This makes the entire time taken to scrape a set of URLs reduced, which consequently makes the scraping process more responsive and faster.
SQLite3
SQLite3 is a lightweight disk-based database that can be easily set up and used within Python applications. It is very useful for storing structured data, like the URLs and product details fetched during the scraping process. By using SQLite3, scrapers can create a persistent database that avoids duplicate entries and organizes data efficiently. In this way, it becomes easy to query and retrieve scraped information for further analysis or reporting.
Random
It is a native library of Python and can generate random numbers with the help of it so that users can choose their random elements. It can include random delay between successive requests for the website while performing web scraping. Thus, including random sleep time, web scrapers will look like human browsing instead of bots. So, they are less likely to be noticed by anti-bot mechanisms in the websites. This makes it possible to connect consistently with the target website, and hence one is sure to follow all its policies for usage.
Beautiful Soup
Beautiful Soup library uses Python for parsing HTML and XML documents. It provides a good and well-structured parse tree of the page source; hence searching document structures is easy. It scrapes through the web, using BeautifulSoup to fetch special elements from HTML pages, probably product titles, prices, descriptions, or specifications. Its syntax is very friendly for scrapers as it lets them query and manipulate the HTML content based on tags, attributes, and text in an efficient way that enables the extraction of accurate and reliable data.
Why SQLite Outperforms CSV for Web Scraping Projects
Given its simplicity, reliability and efficiency, SQLite is rather a good option to consider when it comes to the order of storing scraped data. Being self-sufficient, serverless and requiring no administrative tasks.It writes data on disks which is also useful in these types of tasks since a lot of information in this case URLs have to be saved and fast as well. This lightweight construction enables the user to fetch back the data at immense speeds in the case of complex queries rather than the use of CSV files that becomes difficult and slow with too much data set.
STEP 1 :Product Link Scraping
Importing Libraries
import asyncio
from playwright.async_api import async_playwright
import random
import sqlite3
This code imports essential libraries for web scraping and data handling. It includes requests for making HTTP requests, random and time for adding delays, sqlite3 for database interaction.
Defining Base URLs for Categories
# Base URLs for each category
BASE_URLS = {
"security_cameras": "https://www.lowes.com/pl/home-security/security-surveillance-cameras/security-cameras/4294546211?offset={}",
"smart_doorbells_locks": "https://www.lowes.com/pl/smart-home/smart-home-security/smart-doorbells-locks/37721669146465?offset={}",
"smart_home_bundles": "https://www.lowes.com/pl/smart-home/smart-devices/smart-home-bundles/2311714614847?offset={}",
"smart_light_bulbs": "https://www.lowes.com/pl/smart-home/smart-lighting/smart-light-bulbs/37721669146457?offset={}",
"smart_speakers_displays": "https://www.lowes.com/pl/smart-home/smart-devices/smart-speakers-displays/2011455432077?offset={}",
"home_alarms_sensors": "https://www.lowes.com/pl/home-security/home-alarms-sensors/1217527669?offset={}"
}
This section of the code configures a dictionary named BASE_URLS to store the base URLs for various categories of products at Lowe's. The URL for each category such as "security cameras," "smart doorbells and locks," etc is associated with it. Also, the URLs include an offset parameter (offset={}), to deal with pagination while web scraping multiple pages of products belonging to each category. In the code, later down the page, the base URL will be formatted with all offset values to navigate throughout the list of products for that category. This configuration allows dynamic page construction for different pages; this will make the scraping more scalable over different product categories.
Setting Offset Values for Pagination
# Offset values for pagination in each category
OFFSET_VALUES = {
"security_cameras": [0, 24, 48, 72, 96],
"smart_doorbells_locks": [0, 24, 48, 72, 96, 120, 144],
"smart_home_bundles": [0, 24, 48, 72, 96],
"smart_light_bulbs": [0, 24, 48],
"smart_speakers_displays": [0], # Only one page for this category
"home_alarms_sensors": [0, 24, 48]
}
In the following section, a dictionary OFFSET_VALUES is defined, in which offsets for pagination of all the categories of products are maintained. The Lowe's website displays a limited number of products per page, and these offset values are used to navigate through multiple pages. For example, the "security cameras" category has products spread across five pages, with each page displaying 24 products, hence the offset values of [0, 24, 48, 72, 96]. Other categories, like "smart speakers and displays," have only one page, so the offset value is [0]. These values will be appended to the URLs in the BASE_URLS dictionary so that all product data are scraped from multiple pages for each category, so no data is missed.
Base URL for Constructing Complete Product Links
# Base URL to prepend to relative URLs
BASE_URL_PREFIX = "https://www.lowes.com"
Here is the definition of the constant BASE_URL_PREFIX that is the base URL for the Lowe's website. Sometimes, relative URLs (i.e., without the full domain) are scraped for product links. These relative URLs are prepended using BASE_URL_PREFIX so that all extracted links are complete and valid. By appending the relative paths to this base URL, the scraper can generate full product URLs that can be stored in the database or used for further data extraction.
Setting Up the SQLite Database
# Database setup
def setup_database(db_path):
"""
Set up the SQLite database for storing scraped product URLs.
This function creates a connection to an SQLite database using the provided
file path and initializes a table called `product_links` if it doesn't
already exist. The table contains two columns:
- `product_url`: Stores unique product URLs.
- `status`: An integer that represents the processing status of each URL,
which is initialized to 0 by default.
Args:
db_path (str): The file path of the SQLite database.
Returns:
sqlite3.Connection: A connection object to interact with the SQLite database.
"""
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
# Create the table with product_url and status (default 0)
cursor.execute('''
CREATE TABLE IF NOT EXISTS product_links (
product_url TEXT UNIQUE,
status INTEGER DEFAULT 0
)
''')
conn.commit()
return conn
The setup_database function initializes the SQLite database used to store the scraped product URLs. It connects to the database from the provided file path (db_path) and creates an empty table named product_links that doesn't exist in that database. The table now contains two columns: namely, product_url for unrepeatable product links as well as status that serves to monitor the processing status for each URL which is set 0 by default. The function returns a connection object that allows interaction with the SQLite database, which guarantees that scraped data will be efficiently stored and accessed for further use.
Inserting URLs into the Database
# Insert URLs into the SQLite table, ignoring duplicates
def insert_urls_to_db(conn, urls):
"""
Insert a list of product URLs into the SQLite database, ignoring duplicates.
This function inserts each URL from the provided list into the `product_links`
table. If a URL already exists in the table, it will be ignored to avoid
duplicate entries.
Args:
conn (sqlite3.Connection): The connection object for the SQLite database.
urls (list of str): A list of product URLs to be inserted into the database.
Returns:
None
"""
cursor = conn.cursor()
cursor.executemany('''
INSERT OR IGNORE INTO product_links (product_url) VALUES (?)
''', [(url,) for url in urls])
conn.commit()
The insert_urls_to_db function was used to insert a set of product URLs into the SQLite database, while ignoring duplication. This function takes two arguments-conn, which is the SQLite connection object used to interact with the database, and urls, the list of product URLs to insert. It uses a cursor to perform an INSERT OR IGNORE SQL statement, which tries to insert every URL into the product_links table. The OR IGNORE clause would ensure that if the URL exists in the table, then it is skipped, thus preventing duplications. Finally, it commits the change using conn.commit(). This function does not return anything but handles the insertion of unique URLs into the database in an efficient manner.
Scraping Product URLs from a Webpage
# Scrape URLs from a given page
async def scrape_page(page, url, all_hrefs):
"""
Scrape product URLs from a given page and append them to the provided list.
This function navigates to the specified URL using a Playwright `page` object,
scrolls to the bottom of the page to load all dynamic content, and extracts
URLs from `<a>` tags within `<h3>` elements. The URLs are then formatted
with the base URL if they are relative links and appended to the provided list.
Args:
page (playwright.async_api.Page): The Playwright page object used for navigation
and scraping.
url (str): The URL of the page to be scraped.
all_hrefs (list of str): A list to store the scraped product URLs.
Returns:
None
Raises:
Exception: Logs any error encountered during scraping and continues the process.
"""
try:
await page.goto(url, timeout=60000)
await page.wait_for_load_state('load')
# Scroll to the bottom to load all content
previous_height = await page.evaluate("document.body.scrollHeight")
while True:
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await asyncio.sleep(2)
current_height = await page.evaluate("document.body.scrollHeight")
if current_height == previous_height:
break
previous_height = current_height
await asyncio.sleep(5) # Ensuring all content is fully loaded
# Locate <a> tags within <h3> and extract href
h3_links = await page.locator("h3 a").all()
hrefs = [await link.get_attribute('href') for link in h3_links if await link.get_attribute('href') is not None]
# Prepend the base URL to relative URLs
complete_urls = [BASE_URL_PREFIX + href if not href.startswith('http') else href for href in hrefs]
all_hrefs.extend(complete_urls)
print(f"Scraped {len(complete_urls)} URLs from {url}")
except Exception as e:
print(f"Error scraping {url}: {e}")
The scrape_page function is a method that scrapes product URLs from a given webpage by using Playwright's asynchronous API. It takes a Playwright page object, a target url, and a list all_hrefs in which the scraped URLs will be stored. The function navigates to the page, waits for that page to load, and then scrolls down to the end of the page using execution of JavaScript to ensure that dynamic content is fully loaded, which it does by continuous scrolling and checking the page's height until the same height is shown twice; that is, no other content is loading. After this, it catches all <a> tags inside <h3> tags and it returns the href attribute of all the of them, which are product URLs. If any of the foregoing are not absolute (i.e. do not have a complete path), it prepends BASE_URL_PREFIX to the beginning in order to make them absolute. All these URLs are put into the all_hrefs list so they are all prepared for further processing. The function catches errors by logging them but continues to run, thus allowing robust scraping operations. Finally, it prints the number of URLs scraped from the given page, which helps track the scraping progress.
Scraping URLs for Specific Categories and Inserting Them into the Database
# Process each category and insert URLs to the database
async def process_category(browser, conn, category, base_url, offsets):
"""
Scrape product URLs for a specific category and insert them into the database.
This function handles scraping product URLs for a specific category by iterating
through pages using the provided offset values. For each page, it formats the
base URL with the current offset, opens the page in a new Playwright browser
tab, scrapes the URLs, and then closes the tab. After scraping all pages,
the URLs are inserted into the SQLite database. A random delay is introduced
between each page request to avoid overloading the server.
Args:
browser (playwright.async_api.Browser): The Playwright browser instance used for
opening new pages.
conn (sqlite3.Connection): The SQLite database connection object.
category (str): The name of the category being scraped (for logging purposes).
base_url (str): The base URL of the category with a placeholder for pagination
offset.
offsets (list of int): A list of offset values for pagination to iterate over
pages.
Returns:
None
"""
all_hrefs = [] # To store scraped URLs for this category
for offset in offsets:
url = base_url.format(offset)
page = await browser.new_page()
await scrape_page(page, url, all_hrefs)
await page.close()
# Random delay before next request
delay = random.uniform(2, 5)
print(f"Waiting for {delay:.2f} seconds before the next request...")
await asyncio.sleep(delay)
# Insert URLs to the database after the category scraping is done
insert_urls_to_db(conn, all_hrefs)
The process_category function is responsible for scraping product URLs for a given category and inserting them into an SQLite database. It needs a Playwright browser instance, an SQLite connection conn, category name, a base_url which contains pagination placeholder and offsets - the list of integers for page numbers to be scraped. In the list, it then proceeds to loop for each offset; format the base_url for current pagination; and open new tab in the Playwright browser. It now calls the scrape_page function to fetch URLs from the page and append them to all_hrefs list. After scraping every page, it closes the tab and introduces a random delay between 2 and 5 seconds to prevent flood crashing. After parsing through the entire pages that concern category, all URLs fetched at all_hrefs is added to the database for making further usage of this fetched URLs later using a function named as insert_urls_to_db method which inserts the fetched URL into database in a safe way. It allows to loop through many pages for multi-structured categories .
Running the Main Scraping Process for Multiple Categories
# Main function to run the scraping process
async def run():
"""
Main function to execute the web scraping process for multiple categories.
This function orchestrates the entire scraping process by:
1. Setting up the SQLite database to store product URLs.
2. Launching a Playwright browser instance.
3. Iterating through predefined categories and scraping URLs from multiple pages
using offset values for pagination.
4. Inserting the scraped URLs into the database.
5. Closing the browser and database connection after the scraping process is complete.
Args:
None
Returns:
None
"""
db_path = 'lowes_webscraping.db'
# Set up database connection
conn = setup_database(db_path)
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
# Scrape each category
for category, base_url in BASE_URLS.items():
print(f"Scraping category: {category}")
await process_category(browser, conn, category, base_url, OFFSET_VALUES[category])
await browser.close()
# Close the database connection
conn.close()
# Run the asynchronous function
asyncio.run(run())
The run function is the orchestrator of the whole web scraping process. It first initializes the connection to an SQLite database through the setup_database function with a database file called lowes_webscraping.db. The database will store all product URLs scraped. It launches a Playwright browser instance, which can be used for automated interaction with web pages.
The function prints the category name for every defined category in the BASE_URLS dictionary and then calls process_category to scrape URLs from multiple pages. The process_category function handles navigation through every category's pages by predefined pagination offsets, scraping URLs, and inserting them into the database. After all categories have been processed, the browser is closed, and the database connection is terminated to ensure proper resource management.
This design allows the scraping of many categories in sequence, all of whose URLs will be stored efficiently within one database, and manages the resources such as browser and database connection correctly throughout.
STEP 2 :Detailed Product Data Scraping From Product Links
Importing Libraries
import asyncio
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
import random
import sqlite3
This code imports essential libraries for web scraping and data handling. It includes requests for making HTTP requests, random and time for adding delays, sqlite3 for database interaction, and BeautifulSoup from the bs4 library for parsing HTML content.
Initializing the SQLite Database and Tables
# Initialize SQLite database and tables
def init_db():
"""
Initializes the SQLite database and creates necessary tables.
This function establishes a connection to the SQLite database and creates two tables:
`final_product_data` for storing successfully scraped product details, and
`failed_urls` for logging URLs that could not be scraped, along with associated error
messages.
The `final_product_data` table includes the following columns:
- id: Primary key, auto-incremented.
- product_url: The URL of the product.
- title: The title of the product.
- price: The price of the product.
- review_count: The number of reviews for the product.
- rating: The rating of the product.
- description: A brief description of the product.
- specification: Additional specifications of the product.
- features: Key features of the product.
The `failed_urls` table includes the following columns:
- id: Primary key, auto-incremented.
- product_url: The URL of the product that failed to scrape.
- error_message: The error message describing the reason for the failure.
Returns:
None
"""
conn = sqlite3.connect('lowes_webscraping.db')
c = conn.cursor()
c.execute('''
CREATE TABLE IF NOT EXISTS final_product_data (
id INTEGER PRIMARY KEY AUTOINCREMENT,
product_url TEXT,
title TEXT,
price TEXT,
review_count TEXT,
rating TEXT,
description TEXT,
specification TEXT,
features TEXT
)
''')
c.execute('''
CREATE TABLE IF NOT EXISTS failed_urls (
id INTEGER PRIMARY KEY AUTOINCREMENT,
product_url TEXT,
error_message TEXT
)
''')
conn.commit()
conn.close()
This is the initialization function, init_db intended to initialise an SQLite database exclusively used for storing all scraped data of products scraped from any website. Each time a connection is initialized with lowes_webscraping.db, the required two tables are created namely final_product_data and failed_urls in this case: the latter is used as a holding place for products which do not have web pages:. These include properties such as the URL of the product, title, price, review count, rating, description, specifications, and key features. The primary key for each record in this table is automatically incremented. On the other hand, the failed_urls table contains any URL which could not be scraped and also an error message for explaining the cause of failure. That way, any problem which may occur in the process is recorded for further analysis. Both the tables are created only if they do not exist; this ensures the function will not raise errors if run multiple times because of the creation of a table that already exists. Lastly, the method commits all made changes to the database and closes the connection, thus completing the configuration of the database structure
Inserting Scraped Data into the Database
# Function to insert scraped data into the database
def insert_data(data):
"""
Inserts scraped product data into the SQLite database.
This function connects to the SQLite database and inserts a new record into the
`final_product_data` table. The record includes details about a product obtained
from web scraping.
Parameters:
data (dict): A dictionary containing the product details to be inserted, which
must include:
- product_url (str): The URL of the product.
- title (str): The title of the product.
- price (str): The price of the product.
- review_count (str): The number of reviews for the product.
- rating (str): The rating of the product.
- description (str): A brief description of the product.
- specification (str): Additional specifications of the product.
- features (str): Key features of the product.
Returns:
None: This function does not return a value. It commits the data to the database
and closes the connection.
Raises:
sqlite3.Error: If an error occurs while interacting with the database.
"""
conn = sqlite3.connect('lowes_webscraping.db')
c = conn.cursor()
c.execute('''
INSERT INTO final_product_data (product_url, title, price, review_count, rating, description, specification, features)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
''', (data['product_url'], data['title'], data['price'], data['review_count'], data['rating'],
data['description'], data['specification'], data['features']))
conn.commit()
conn.close()
The insert_data function is used for inputting the scraped product data into the SQLite database. On execution, this function opens a connection to the lowes_webscraping.db database and is ready to add a new record to the final_product_data table. The data that are to be inserted are passed into the function as a dictionary, and this dictionary is to have all the parameters inside it. These parameters include product's URL, title, price, number of reviews, ratings, description, specifications, and some key features. These data are then added by applying a parameterized SQL statement for the INSERT instruction provided. After executing the insertion command, the function commits the changes to the database in order to save the new record and then closes the connection. Any sqlite3.Error that might occur in the course of interacting with the database will raise an error, and therefore, error handling will occur appropriately in the insertion process. This is so because the product information scrapped will be saved in an appropriate manner for retrieval later for analysis.
Logging Failed URLs into the Database
# Function to insert failed URLs into the database
def insert_failed_url(url, error_message):
"""
Inserts failed URL data into the SQLite database.
This function connects to the SQLite database and logs a failed URL along with
an associated error message into the `failed_urls` table. This can be useful for
debugging and tracking issues encountered during the web scraping process.
Parameters:
url (str): The URL of the product that failed to be scraped.
error_message (str): A message describing the error that occurred while
attempting to scrape the product.
Returns:
None: This function does not return a value. It commits the data to the
database and closes the connection.
Raises:
sqlite3.Error: If an error occurs while interacting with the database.
"""
conn = sqlite3.connect('lowes_webscraping.db')
c = conn.cursor()
c.execute('''
INSERT INTO failed_urls (product_url, error_message)
VALUES (?, ?)
''', (url, error_message))
conn.commit()
conn.close()
The insert_failed_url function is designed to write the failure of URLs with their relevant error messages to the SQLite database. This logging method is important for debugging purposes and tracking problems that will occur during web scraping. When called, the function opens a connection to the lowes_webscraping.db database and prepares it to insert a new row in the failed_urls table. It takes two parameters: url, which is the URL of the product that was not able to be scraped, and an error_message, which gives a description of the problem that occurred when trying to scrape. The function makes use of a parameterized SQL INSERT statement to add this data safely to the database, thus preventing SQL injection. After running the insertion command, it commits the changes to the database so that the logged failure is persisted, and then closes the connection. Like any other database interaction, if there is an error in this step, an sqlite3.Error is raised so that proper error handling can be done. This function plays a critical role in keeping a record of failures during scraping, which will help in debugging and also increase the reliability of the web scraping workflow.
Updating URL Status After Successful Scraping
# Update URL status after successful scraping
def update_url_status(url):
"""
Updates the scraping status of a product URL in the SQLite database.
This function connects to the SQLite database and updates the `status`
of a specific product URL in the `product_links` table to indicate
that the scraping for this URL has been completed successfully.
This is useful for tracking the progress of scraped URLs.
Parameters:
url (str): The URL of the product whose status is to be updated.
Returns:
None: This function does not return a value. It commits the
changes to the database and closes the connection.
Raises:
sqlite3.Error: If an error occurs while interacting with the database.
"""
conn = sqlite3.connect('lowes_webscraping.db')
c = conn.cursor()
c.execute('''
UPDATE product_links SET status = 1 WHERE product_url = ?
''', (url,))
conn.commit()
conn.close()
The update_url_status function is designed to update the status of a product URL in the SQLite database, more specifically in the product_links table, to indicate that scraping for that URL has been successfully done. This functionality is important in tracking the progress of scraped URLs and maintaining an accurate record of which products have been processed. Whenever it is called, this function opens a connection to the lowes_webscraping.db database and awaits sending a SQL UPDATE statement. It needs one parameter, and in this case, that will be 'url' of the particular product page which status is to be updated. It uses a parameterized query so as to not to allow any kind of SQL injections. On executing the UPDATE statement which sets the status field to 1 (meaning successful) it commits the transaction, saving the changes made on the database and closing its connection. In any of these database interactions having error, an sqlite3.Error raises, which can handle proper error management. The main function in this entire piece of code is about having visibility in the process of scraping and ensuring data integrity in the workflow of collection.
Fetching Unscraped URLs from the Database
# Function to fetch unscraped URLs from the database (status=0)
def get_unscraped_urls():
"""
Fetches URLs from the database that have not yet been scraped.
This function connects to the SQLite database and retrieves all
product URLs from the `product_links` table where the `status`
is set to 0, indicating that these URLs have not been scraped yet.
Returns:
list: A list of product URLs that have not been scraped.
Raises:
sqlite3.Error: If an error occurs while interacting with the database.
"""
conn = sqlite3.connect('lowes_webscraping.db')
c = conn.cursor()
c.execute('''
SELECT product_url FROM product_links WHERE status = 0
''')
urls = [row[0] for row in c.fetchall()]
conn.close()
return urls
The get_unscraped_urls function will obtain all the URLs of products in the SQLite database, which haven't been passed through the web scraping workflow yet. A connection to the lowes_webscraping.db database will conduct a SQL SELECT statement against the product_links table, selecting all those URLs where the status equals 0. That means those haven't been scraped yet. The function uses a cursor to execute the query, then it fetches all rows returned. Then, it compiles the fetched URLs into a list comprehension to take the first element from every row, which coincidentally is the product URL. After retrieving the list of unscrewed URLs, the function closes the database connection to free resources and returns the list back to the caller. If there are any problems in dealing with the database, then an sqlite3.Error is raised so potential problems are handled correctly. This function is critical to managing the scraping process because it will allow the scraper to very efficiently identify which URLs still need processing.
Extracting Features from HTML Content
# Process HTML content for extracting features
def process_html_content(html_content):
"""
Extracts features from the provided HTML content using BeautifulSoup.
This function parses the given HTML content to find all tables
with the class 'TableWrapper-sc-ys35zb-0'. It extracts the key-value
pairs from the table's rows and constructs a string representation
of the features. The dictionary name is taken from the table's
header (thead), and the key-value pairs are formatted as
'key:value'.
Args:
html_content (str): The HTML content to be processed.
Returns:
str: A string representation of the extracted features from the tables.
If no tables are found, an empty string is returned.
Raises:
AttributeError: If the structure of the HTML content does not
contain the expected table elements.
"""
soup = BeautifulSoup(html_content, 'html.parser')
tables = soup.find_all('table', class_='TableWrapper-sc-ys35zb-0')
features = []
for table in tables:
thead = table.find('thead')
dictionary_name = thead.get_text(strip=True) if thead else 'No dictionary name found'
table_features = f"{dictionary_name}:{{"
tbody = table.find('tbody')
if tbody:
key_value_pairs = []
for row in tbody.find_all('tr'):
cells = row.find_all('td')
if len(cells) > 1:
key_value_pairs.append(f"{cells[0].get_text(strip=True)}:{cells[1].get_text(strip=True)}")
table_features += ','.join(key_value_pairs) + "}"
features.append(table_features)
return ' '.join(features)
Process_HTML_Content is designed to present the features of content using BeautifulSoup given the HTML. For a content object, it should create a BeautifulSoup object to sufficiently parse the content before finding relevant features from the content. The function searches within each table element with a class TableWrapper-sc-ys35zb-0 since features will appear inside it in some sort of structured way. It then tries to locate the header of that table, or 'thead', to create a name of the dictionary which can define the contents of the particular table. In case there is no header found, it defaults to 'No dictionary name found'.
This function checks the table for a body. That's where the key-value pairs would be located. The function loops through rows in the tbody to collect data from pairs by first and second cells for every row if more than two cells exist in a row. The obtained pairs are set in the key:value form and added to the list. After collecting the pairs for a table, those are concatenated into a string representing the features in the format dictionary_name: {key:value,key:value}.
Lastly, the function joins the strings of feature information from the tables into a single space-delimited string. If the processing does not contain any tables, the function returns an empty string. While parsing, it also raises an AttributeError when the structure of the HTML deviates from that expected by parsing; this then allows error handling in case parsing does not continue as one would hope. This function is very critical as it converts unstructured data from HTML into structured form for easier analysis or storage.
Scrolling Through a Web Page for Dynamic Content Loading
# Scroll through the product page to load all content
async def scroll_page(page):
"""
Scrolls through a webpage to load all dynamic content.
This asynchronous function simulates scrolling down the page
until no new content is loaded. It keeps scrolling until the
height of the document remains constant, indicating that all
content has been loaded.
Args:
page : The page object representing the
webpage to scroll.
Raises:
Exception: Raises an exception if there is an issue
with page evaluation or scrolling.
Usage:
await scroll_page(page)
"""
previous_height = await page.evaluate("document.body.scrollHeight")
while True:
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await asyncio.sleep(2)
current_height = await page.evaluate("document.body.scrollHeight")
if current_height == previous_height:
break
previous_height = current_height
The scroll_page function is an asynchronous method that is designed to simulate scrolling across a webpage to load all content dynamically.
When invoked, it first retrieves the initial height of the document's body using page.evaluate("document.body.scrollHeight"). This will be used as a comparison point to see if more contents are being loaded as scrolling through the page. The function then enters an infinite scroll down to the bottom of the page using the command window.scrollTo(0, document.body.scrollHeight). The function pauses for 2 seconds after scrolling using await asyncio.sleep(2) to allow loading time for the new content.
This is how it looks after each scroll: function checks the current height of the document's body, if this height hasn't changed since the last one then the content hasn't been loaded in; in which case it quits the loop. If this is different then it updates the variable previous_height to the new one, allowing it to keep on scrolling. This continues on till the end of all dynamic content loading.
This function ensures that the operation of scrolling does not get in the way of the other tasks because asynchronous programming is used, making this function more efficient for dealing with web scraping processes, mainly on pages that rely on JavaScript to load more content. It is a part of web scraping workflows where complete content extraction is an important requirement for getting proper data collection.
Extracting the Product Title from a Web Page
# Extract product title
async def extract_title(page):
"""
Extracts the product title from a webpage.
This asynchronous function retrieves the product title
from the specified HTML element on the page. It searches
for the element using a predefined CSS selector. If the
title element is found, its text content is returned;
otherwise, an empty string is returned.
Args:
page : The page object representing
the webpage to extract the title from.
Returns:
str: The extracted product title if found, or an empty string
if the title element does not exist.
Usage:
title = await extract_title(page)
"""
title_element = await page.query_selector('h1.styles__H1-sc-11vpuyu-0.krJSUv')
if title_element:
return (await title_element.text_content()).strip()
return ''
The extract_title function is an asynchronous function that fetches the title of a product from a given webpage library. On invocation, it accepts one argument: page, which is the webpage object whose title will be extracted.
The first thing that is done in the function is to locate the product title element, which is identified by a defined CSS selector h1.styles__H1-sc-11vpuyu-0.krJSUv. This is a defined CSS selector targeting the <h1> HTML element that has the product title. After that, it uses page object's query_selector() method to locate this one. Once found, the function extracts text content from that one using text_content() applied on the title element.
The function will remove any leading or trailing whitespace from the extracted title with the strip() method before returning it. If the title element is not found on the page, the function returns an empty string.
This is an extremely important function to web scraping workflows because this usually holds all the title information that may be needed for identification and categorization of a specific product. The encapsulation of this extraction logic in the context of an asynchronous function makes it efficient despite other tasks then running in parallel, waiting for the title extraction operation.
Extracting the Review Count from a Product Web Page
# Extract review count
async def extract_review_count(page):
"""
Extracts the review count from a product webpage.
This asynchronous function retrieves the review count
from the specified HTML element on the page. It searches
for the element using a predefined CSS selector. If the
review count element is found, its text content is returned;
otherwise, an empty string is returned.
Args:
page: The page object representing
the webpage to extract the review count from.
Returns:
str: The extracted review count if found, or an empty string
if the review count element does not exist.
Usage:
review_count = await extract_review_count(page)
"""
reviews_element = await page.query_selector('div[data-testid="trigger"] span.sc-dhKdcB.kPpRJe')
if reviews_element:
return (await reviews_element.text_content()).strip()
return ''
The function extract_review_count is an asynchronous function that is supposed to fetch the review count on a given product webpage. It only has one parameter: page; that is, it takes in a page object where extractivity of the review count will be performed.
Inside the function, the action starts with finding a review count element by CSS selector: div[data-testid="trigger"] span.sc-dhKdcB.kPpRJe This selector finds a <span> inside a <div> that has a data-testid attribute. The function uses page object's query_selector method to find this element.
If the review count element exists, it will fetch text content using the text_content() method then remove leading or trailing white spaces with the strip() method. This makes sure the review count returned is clean and well formatted. In a case where the review count element does not exist, the function returns an empty string.
This function will go a long way in web scraping since it will enable collection of really important product feedback metrics for the sake of analysis, comparison, and understanding customer sentiment. Because it's an asynchronous function, other operations will continue to go on in parallel while the waiting for review count extraction runs its course.
Extracting Product Rating from a Web Page
# Extract rating
async def extract_rating(page):
"""
Extracts the rating from a product webpage.
This asynchronous function retrieves the rating of a product
from the specified HTML element on the page. It searches
for the element using a predefined CSS selector. If the rating
element is found, its 'aria-label' attribute is returned;
otherwise, an empty string is returned.
Args:
page : The page object representing
the webpage to extract the rating from.
Returns:
str: The extracted rating if found, or an empty string
if the rating element does not exist.
Usage:
rating = await extract_rating(page)
"""
rating_element = await page.query_selector('div.sc-kAyceB.ijRZHV span[role="img"]')
if rating_element:
return (await rating_element.get_attribute('aria-label')).strip()
return ''
This is an asynchronous function named extract_rating, which pulls the rating of a product from a particular webpage. It requires one parameter: page. This is the webpage object from which it extracts the rating. The first thing this function does is find the rating element; a predefined CSS selector targets the rating information that is usually held in a <span>, typically within a <div> that carries certain class names. If it finds a rating element, it pulls out the value of its aria-label attribute. Typically this is text describing the rating as, for example, "4.5 out of 5 stars". The function removes any leading and trailing white space from that value before it returns. Otherwise, it returns an empty string-that is, it could not find a rating element on the page. This is one of the most important functionalities required for analysis on product performance based on user ratings. It will aid in better data collection, hence giving a deeper insight into the satisfaction of customers and quality of the product.
Extracting Product Price from a Web Page
# Extract price
async def extract_price(page):
"""
Extracts the price from a product webpage.
This asynchronous function retrieves the price of a product
from the specified HTML element on the page. It looks for the
main price container and extracts both the dollar and cent parts
of the price. If both parts are found, they are combined into
a single string representing the final price; otherwise, an empty
string is returned.
Args:
page : The page object representing
the webpage to extract the price from.
Returns:
str: The extracted price if found, or an empty string
if the price element does not exist.
Usage:
price = await extract_price(page)
"""
# Select the correct div that contains the price
price_element = await page.query_selector('div.main-price.medium.split.split-left')
if price_element:
# Extract the dollar part and cent part of the price
dollar_part = await price_element.query_selector('span.item-price-dollar')
cent_part = await price_element.query_selector('span.PriceUIstyles__Cent-sc-14j12uk-0.bktBXX.item-price-cent')
if dollar_part and cent_part:
# Combine dollar and cent parts to form the final price
return f"{(await dollar_part.text_content()).strip()}{(await cent_part.text_content()).strip()}"
return ''
The extract_price function is asynchronous, and it would take in the price of a given product from a given page. It accepts an argument; this argument will be a page object. This price will be fetched from that page object. Initially, this function would go ahead and select the div element which contains the basic information of price using a pre-defined CSS selector. Inside this block, it is looking for two specific elements of the price: the dollar and cent. It does this by requesting the elements containing these values - identified by their respective CSS classes. If it finds both parts of the price, the function returns their text content, after stripping any whitespace from around those values, then combines the two to make a single string that represents the full price. If either the dollar or the cent portion does not exist, it leaves the function returning a NULL string, meaning the price could not be pulled. This is extremely important for the purpose of accumulating accurate pricing information that would otherwise be used in further analyzing products and the competitive market for a product.
Extracting Product Description from a Web Page
# Extract description
async def extract_description(page):
"""
Extracts the product description from a product webpage.
This asynchronous function retrieves the product description
from the specified HTML element on the page. It locates the
description within the appropriate div structure and returns
the text content, stripped of any leading or trailing whitespace.
If the description element is not found, an empty string is returned.
Args:
page : The page object representing
the webpage from which to extract
the description.
Returns:
str: The extracted product description if found, or an
empty string if the description element does not exist.
Usage:
description = await extract_description(page)
"""
# Select the correct div that contains the description
description_element = await page.query_selector('span.accordion-content.opened div.sc-esYiGF.bIlmEA.overviewWrapper div.romance')
if description_element:
# Return the text content of the description, stripped of any leading/trailing whitespace
return (await description_element.text_content()).strip()
return ''
The extract_description function is an asynchronous function intended to fetch the product description from a given webpage. It takes one argument, page, which is the page object that will be used to fetch the description. The function starts by finding the correct <span> element holding the product description by using a pre-defined CSS selector that correctly points to the required div structure. If the description element exists, the function outputs its text content after eradicating any leading and trailing whitespace to ensure a clean output. If the description element does not exist, the function would return an empty string, signaling that the description could not be extracted. This is essential for getting exhaustive product information and improving the quality of data gathered for further analysis or display.
Extracting Product Specifications from a Web Page
# Extract specifications
async def extract_specifications(page):
"""
Extracts product specifications from a product webpage.
This asynchronous function retrieves the product specifications
from the specified HTML element on the page. It locates the
specifications within a list structure and returns the text
content of each bullet point, joined into a single string
with line breaks. If the specifications element is not found,
an empty string is returned.
Args:
page : The page object representing
the webpage from which to extract
the specifications.
Returns:
str: The extracted product specifications if found, or an
empty string if the specifications element does not exist.
Usage:
specifications = await extract_specifications(page)
"""
# Select the correct div that contains the specifications
specs_element = await page.query_selector('span.accordion-content.opened div.specs ul.bullets')
if specs_element:
# Get all the bullet points (li) elements inside the specs list
bullet_points = await specs_element.query_selector_all('li p')
if bullet_points:
# Extract the text content of each bullet point and return them as a joined
string
specs = [await li.text_content() for li in bullet_points]
return '\n'.join(specs)
return ''
The extract specifications function is asynchronous. It will collect the specifications of a product from the appropriate webpage. It accepts a page object containing data about the product as a single argument. The function looks first for the correct structure, which is an unordered list (<ul>) wrapped around a <span> containing a list of specifications that have a class of bullets. If the specifications element is successfully located, then the function collates all of the separate bullet points within <li> tags and captures the text content of each one. The separate text strings are concatenated together as one string with line breaks in between so that the resulting string can be easily readable. In case the element cannot be located, then the function will return an empty string; that means that no specification data exists. This method is able to retrieve detailed product specifications in an efficient manner, thus improving the overall data collection process for further analysis or presentation.
Extracting All Product Details from a Web Page
# Extract all product details
async def extract_product_details(page, data):
"""
Extracts all relevant product details from a product webpage.
This asynchronous function aggregates multiple data extraction
functions to retrieve essential product information from the
specified HTML page. The extracted details are stored in the
provided `data` dictionary.
Args:
page: The page object representing
the webpage from which to extract
the product details.
data (dict): A dictionary that will hold the extracted product
details. The function updates this dictionary
with the following keys:
- 'title'
- 'review_count'
- 'rating'
- 'price'
- 'description'
- 'specification'
Returns:
None: The function modifies the `data` dictionary in place
and does not return any value.
"""
data['title'] = await extract_title(page)
data['review_count'] = await extract_review_count(page)
data['rating'] = await extract_rating(page)
data['price'] = await extract_price(page)
data['description'] = await extract_description(page)
data['specification'] = await extract_specifications(page)
It is there for the sake of getting complete product details from a webpage through various extraction functions. This is an asynchronous function that takes two parameters, namely page, holding the product's HTML content through the object, and data, a dictionary where all the details will be stored.
The following function includes several asynchronous calls to particular extraction functions such as extract_title, extract_review_count, extract_rating, extract_price, extract_description, and extract_specifications. All of these are meant to retrieve a specific piece of information, say, product title, review count, rating, price, description, or specifications. Then the results of those calls are stored within the data dictionary under its corresponding keys.
This function does this pretty effectively by updating the dictionary in place, hence gathering all product details into one structure for easy access and further processing. It simplifies data collection workflows as it makes sure that the information one needs is fetched quite efficiently from the webpage at hand. It is pretty easy to use in larger data collection workflows because of the nature in which the function operates: instead of returning any value, it updates the dictionary directly.
Scraping Individual Product Pages
# Scrape individual product pages
async def scrape_page(page, url):
"""
Scrapes an individual product page for details and stores the data in a database.
This asynchronous function navigates to the specified product URL,
extracts relevant product details such as title, price, review count,
rating, description, specifications, and features, and then saves
the information to a database. If any errors occur during the
scraping process, the function logs the URL and error message to
a separate database table for failed URLs.
Args:
page : The page object representing the
browser context to navigate and
extract data from.
url (str): The URL of the product page to scrape.
Returns:
None: The function modifies the data dictionary in place, inserts
the scraped data into the database, and updates the URL
status in the database. It does not return any value.
Raises:
Exception: If an error occurs during the scraping process, it
is caught and logged.
"""
data = {
'product_url': url,
'title': '',
'price': '',
'review_count': '',
'rating': '',
'description': '',
'specification': '',
'features': ''
}
try:
await page.goto(url, timeout=60000)
await page.wait_for_load_state('load')
await asyncio.sleep(2)
# Scroll to the bottom
await scroll_page(page)
# Extract product details
await extract_product_details(page, data)
# Extract HTML content for additional features
html_content = await page.content()
data['features'] = process_html_content(html_content)
# Insert data into the database
insert_data(data)
update_url_status(url)
except Exception as e:
print(f"Error scraping {url}: {e}")
insert_failed_url(url, str(e))
The scrape_page function is defined with the aim of scraping the details on an e-commerce website regarding the product. Details scraped from the web page are written to a database. The function is asynchronous and depends on two parameters: it takes in a page, meaning the browser context that has enabled navigation and data extractions, and url- the specific URL of a product page that needs scraping.
When this function is called it creates a dictionary called data and holds all the information there is about the product, it includes URL, title, price, review count, rating, description, specifications, and features. Afterwards, it tries to load the given URL by accessing the page object's method goto with a time-out of 60 seconds so it waits enough for the page to load. After ensuring the page is loaded, it introduces a short delay to let the loading process settle. Then it calls the function scroll_page to scroll to the bottom of the page, which normally needs to be scrolled down to initialize the lazy loading of product details.
The function calls extract_product_details passing both page and data dictionary for retrieving and storing the product information, retrieve the complete HTML content of the page which further is processed by the hypothetical process_html_content function for any other feature related to a product.
All data having been collected, the algorithm begins by inserting the data in the database via insert_data. The status of the URL is updated inside the database with the aid of update_url_status. If it catches an exception during the above step for reasons of network problems, or modification in the layout of the page, an error would occur that would then be captured and logged in, plus recording the failure of a certain URL within a distinct table using insert_failed_url function. This strong error handling enables proceeding with the scraping procedure even when there is an issue with certain pages, which makes it much more resilient and data-rich.
Main Function to Control the Scraping Process
# Main function to control the scraping process
async def run():
"""
Controls the overall web scraping process, initializing the database,
fetching unscraped URLs, and launching the browser to scrape each page.
This asynchronous function is responsible for coordinating the scraping
workflow by performing the following steps:
1. Initializes the SQLite database and creates the necessary tables.
2. Retrieves a list of product URLs that have not been scraped yet.
3. Launches a Chromium browser instance using Playwright.
4. Iterates over each URL, creating a new page for scraping.
5. Calls the `scrape_page` function to extract data from each product page.
6. Closes the page after scraping and waits for a random delay before
proceeding to the next URL, simulating human-like browsing behavior.
7. Closes the browser once all URLs have been processed.
Returns:
None: This function does not return any value, as its primary purpose
is to manage the scraping workflow.
Raises:
Exception: Any exceptions occurring during the scraping process
will be raised and should be handled appropriately in
the calling context.
"""
init_db()
urls = get_unscraped_urls()
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
for url in urls:
page = await browser.new_page()
await scrape_page(page, url)
await page.close()
delay = random.uniform(2, 4)
print(f"Waiting for {delay:.2f} seconds before navigating to the next URL...")
await asyncio.sleep(delay)
await browser.close()
# Run the asynchronous function
asyncio.run(run())
The main run function orchestrates everything involving the web scraping process it performs necessary initialization, retrieves some URLs, and takes responsibility in scrapping each of those product pages. The execution is asynchronous, whereupon init_db() will activate SQLite's in-memory database along with required tables to save off this scraping information. Then it accesses the list of remaining available but yet to be scraped of a list of product URLs made of calls upon the get_unscraped_urls() function.
Lastly, it uses the async_playwright() context manager from Playwright to create a new Chromium browser. This makes the headless parameter as False so the window is visible during scraping if any bug needs to be solved while creating a process. Then enters the function with an infinite loop iterating each of the URls fetched previously:. For each URL, a new page is created using browser.new_page() and the scrape_page function is called to extract relevant data from the product page.
Once all the scrapes for each URL are done, the page is closed with await page.close(). The function also introduces a random delay between 2 and 4 seconds using random.uniform(2, 4) to mimic human-like browsing behavior, allowing the program to wait before moving on to the next URL. This pause is essential to avoid detection by the website's anti-scraping measures.
Once all URLs are processed, the browser is closed by await browser.close(). There is no return of any value because the core objective is to manage overall scraping very efficiently. In case an exception occurs during the running of this function, then it must be raised and should, therefore be caught in an appropriate way in the calling context, hence making the scraping robust and reliable. Lastly, the function applied is called using asyncio.run(run()) to start an asynchronous event loop to run the scraping task:.
Libraries and Versions
This code utilizes several key libraries to perform web scraping and data processing. The versions of the libraries used in this project are as follows: BeautifulSoup4 (v4.12.3) for parsing HTML content, Requests (v2.32.3) for making HTTP requests, and Playwright (v1.46.0) for automating browser interactions. These versions ensure smooth integration and functionality throughout the scraping workflow.
Conclusion
The goal of the project has been accomplished as product data extraction from Lowe’s is fully automated via Playwright in Python, thus adhearing to the original scope of data automation. Dynamic content, pagination and errors are well managed, providing an overall robust solution for business and research purposes. The collected information lends itself perfectly for price monitoring, market research, and inventory management updating web scrapping to decision making's value chain.
Connect with Datahut for top-notch web scraping services that bring you the valuable insights you need hassle-free.