Web scraping is a technique to extract data automatically from web pages. The concept seems to require writing a program that sends a request to web pages, downloads their contents, and then parses them in order to obtain specific information. Web scraping allows for the easy collection of enormous amounts of data at a much faster pace than if it were collected manually.
For this project, we will strictly focus our scraping activities on H&M's web pages, more specifically on women's clothing, and we are interested in the product names, prices, descriptions, and much more. This is a very important source of information for market analysis, price monitoring, or even building a product database.
Structured Workflow - H&M Web Scraping Project:
Let's start with Extracting URLs: We first scrape the URLs for the products in the women's clothing category at H&M. Scrape the URLs Here is how the basis of our data gathering begins-by giving us a full list of products to analyse.
Comprehensive Data Collection: We now have product URLs, to which we go through in a systematic approach to collect all the comprehensive data about the products. This may include things such as product names, prices, descriptions, sizes, colours, and any other relevant attributes.
This workflow structured helps gather data efficiently and maintains data quality throughout the process. With these steps completed, we find solid ground for further analysis and insights into H&M's women's apparel.
We will use the following tools in order to perform the web scraping task:
Playwright: This is a new automation library that lets us control web browsers programmatically. It can navigate dynamic content and pages rendered with JavaScript, which H&M uses in most of their e-commerce website pages of a modern nature. We use it to move around the H&M website and interact with buttons and forms as needed, handling scroll and waiting for pages with content to load. One also leverages play with multiple browser engines allowing flexibility in our approach for scraping.
Beautiful Soup: This is an amazing Python library for parsing HTML and XML documents. After the load operation of Playwright on the web pages, we use Beautiful Soup to extract some data from the HTML content. Beautiful Soup has a very intuitive way of searching, navigating, and modifying the parse tree of HTML documents. It's really good with badly formatted or deeply nested HTML, which we'll encounter in various places on the H&M website. We will use Beautiful Soup to find and extract product information from the HTML tags, classes or IDs.
SQLite Database:We are going to use SQLite to store the scraped data. It's a lightweight, file-based database that's just perfect for projects like this where we need to save structured data without the overhead of a full database server. SQLite lets us create tables to store our scraped data, like product details, in a structured format. We can easily insert new records while scanning and then query this database for analysis or export it. SQLite is self-contained and serverless, thus pretty easy to install and use in our Python script without needing further configuration.
After we have harvested data off the web, cleaning that data will become a must thing, whereby the quality of the information should be realised as well as usability. The tools used may include OpenRefine and Python, for example, in standardising formats, deletion of duplicate entries, and correction of inconsistencies, with interfaces such as that by OpenRefine. For more complicated cleaning processes, Python combined with libraries such as pandas comes in handy: for example, removal of residual HTML tags, changing the type of data, advanced text manipulation processes applied on product descriptions. Thus, this clean-up step is quite vital for our pipeline of web scraping since it guarantees the consistency and reliability of information garnered from H&M's website, preparing it for the intended analysis or application subsequent to that.
SCRAPING PRODUCT URLS
This is a Python script that will efficiently scrape the product URLs from multiple category pages in H&M. The script incorporates asynchronous programming with asyncio and web automation using the Playwright framework that lets you run multiple requests at the same time. In addition, this script has SQLite database functionality holding and managing the scraped URLs, which ensures persistence of data with no duplication in multiple sessions of scraping.
The main attributes of this scraper are the ability to scrape dynamically loaded content by manipulating "Load More" buttons, scalability in that one can process category pages in bulk, and the handling of errors that is very robust. Thus, this code will easily be adapted to similar e-commerce scraping tasks given the modular structure: the functions for URL scraping are separated from database operations. Such a script is an excellent foundation for large web scraping projects, especially when the project runs on e-commerce websites that have a structure like H&M.
Importing Libraries
import asyncio
import sqlite3
from playwright.async_api import async_playwright
The first import, 'asyncio', imports the Python asynchronous programming library. Asyncio is crucial in writing concurrent code with the async/await syntax. In web scraping, this feature allows the ability to perform many I/O-bound operations at once-just like when you send several HTTP requests all at once. This may prove to speed up our process, especially if we are scraping more than one page or website. Asyncio is built to give an event loop that manages and schedules these asynchronous tasks with efficient execution of concurrent operations.
The next import that introduces SQLite is applied to the 'sqlite3' database file, it is a file-based relational database.
The last import is 'async_playwright' from the 'playwright.async_api' module. This brings into play another very powerful tool for web browser automation. Using Playwright allows us to control a browser programmatically and asynchronously, particularly convenient for web scraping tasks. We can navigate pages, interact with elements on pages, and retrieve data along with managing the concurrent execution model of asyncio. This would be really handy with interactive, heavy-JavaScript sites, where Playwright can handle such complex sites with ease. It also has the features such as auto-waiting on elements to become ready, which would surely make the scraping code more robust and reliable of course.
Thus, these three imports put together create a powerful environment in which to do web scraping. Asynchronous operations to scrape web pages will be efficient for scraping; the site can be an extremely modern, JavaScript-heavy website, and we will take our scraped data and store it locally in a database for easy access and manipulation. This setup will particularly be essential when doing large-scale web scraping projects where things tend to come down to performance and data management. The asyncio, sqlite3, and Playwright all put good ground beneath upon which a system for web scraping will be constructed-so sophisticated, so efficient.
Database Initialization
# Initialise the database connection and create the table for product URLs
def initialize_database(db_name='product_urls.db'):
"""
Initialises an SQLite database and creates a table to store product URLs if it doesn't exist.
Args:
db_name (str): The name of the SQLite database file (default is 'product_urls.db').
The table 'product_urls' will have two columns:
- id (INTEGER): A unique identifier for each URL (auto-incremented primary key).
- url (TEXT): The URL of the product page (unique).
"""
conn = sqlite3.connect(db_name)
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS product_urls (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT UNIQUE
)
''')
conn.commit()
conn.close()
This part of the code defines the 'initialize_database' function, which begins the SQLite database to store the URLs of the products. The function starts by connecting to an SQLite database. SQLite may create the named database if it does not yet exist. By default, it is named 'product_urls.db', but you can pass a different name when calling the function.
Once it connects, the function returns a cursor object. A cursor for database operations is similar to a pointer; it lets us access a database, performing SQL commands and returning results.
It consists of a SQL statement that creates the table named 'product_urls' if it isn't already present. The table stores unique product URLs. There are two columns in the table: 'id', 'url'. The 'id' is an integer that has a primary key- meaning it'll automatically increment for each new row. This will ensure each record has a distinct identifier. The 'url' column is to be filled as text data and has been determined unique so that duplicate URLs cannot be inserted.
After executing the SQL command to create the table, the function commits the changes to the database. This is important because it saves all the modifications done within the course of connection. Lastly, the function closes its database connection. That too is an exemplary procedure in managing the database because this way, it would free up available system resources and make sure that all the changes were properly saved.
Using this function at the start of our scraping process will ensure that we have properly structured and ready-to-store databases for all the product URLs the project will scrape. This, in turn, has an added advantage with regard to keeping the URL organised in a way that we can easily handle and retrieve when our scraping project is at its execution stage.
Save Scraped URLs to the Database
# Save the scraped product URLs to the SQLite database
def save_to_database(product_urls, db_name='product_urls.db'):
"""
Saves a list of product URLs to the SQLite database. Duplicate entries are ignored.
Args:
product_urls (list): A list of product URLs (strings) to save to the database.
db_name (str): The name of the SQLite database file (default is 'product_urls.db').
This function uses the 'INSERT OR IGNORE' SQL statement to avoid inserting duplicate URLs.
"""
conn = sqlite3.connect(db_name)
cursor = conn.cursor()
# Insert the URLs into the database, avoiding duplicates
cursor.executemany('''
INSERT OR IGNORE INTO product_urls (url) VALUES (?)
''', [(url,) for url in product_urls])
conn.commit()
conn.close()
This section defines the save_to_database function that will actually save the product URL scrapped into SQLite:. This function is implemented to take two arguments: product_urls, an array of the URLs to be stored, and db_name, the name for the database file name; if not supplied, it defaults to 'product_urls.db'.This function, similar to initialize_database, first of all is opening a connection to SQLite database and creates a cursor object. These steps are what is required to begin using the database.
The SQL command itself is the core of a function: the 'INSERT OR IGNORE' statement, which is a special SQLite feature. It will attempt to insert every URL into the 'product_urls' table. If an added URL already exists in the table (we set 'url' column to be unique, remember?), then it simply ignores that particular insertion without raising an error. This method will deal with duplicate URLs in an efficient way without the need to verify their presence beforehand.
This insertion was carried out using the executemany method. It is a rapid mechanism to insert multiple records at one go. The function gets data ready in the form of a list of tuples, but these tuples contain one value each, i.e., one URL; thus, the query's expectation of one value in each insertion is met. The function commits the inserted records after performing insertions so that changes made are persisted in the database. It now closes the database connection, bringing an end to the process with good database management practices.
With this function, we can easily save batches of scraped URLs to our database. In the case of a duplicate URL, it will be taken care of gracefully so our database will not add redundant entries. We'll have the possibility to run our scraping process a number of times, maybe spread over some days or weeks without duplicating data in our database.
Load Previously Scraped URLs
# Load previously scraped URLs from the SQLite database
def load_scraped_urls(db_name='product_urls.db'):
"""
Loads all previously scraped URLs from the SQLite database.
Args:
db_name (str): The name of the SQLite database file (default is 'product_urls.db').
Returns:
set: A set of URLs that have already been scraped, to avoid duplicates during the scraping process.
"""
conn = sqlite3.connect(db_name)
cursor = conn.cursor()
cursor.execute('SELECT url FROM product_urls')
scraped_urls = {row[0] for row in cursor.fetchall()} # Fetch all URLs as a set for quick lookups
conn.close()
return scraped_urls
The purpose of the load_scraped_urls function is to retrieve all the scraped URLs that have been saved into the SQLite database. As described before, this function is very important in keeping our web scraping process efficient as well as maintaining its integrity. Let's break it down: Similar to the previous functions, this function starts with opening the specified SQLite database using the parameter of db_name and creates a cursor object.
The basic operation of this function is very basic indeed, it being a simple SQL SELECT that retrieves all the URLs stored in the 'product_urls' table. It thus efficiently collects all the product URLs which have been scraped in previous runs of our scraping script.
This function runs the query and then uses a set comprehension to create a set of all the retrieved URLs. The choice of a set data structure here is important. The reason that sets in Python are optimised for fast membership testing means checking whether or not a URL exists in this set will be very, very fast, even if there are thousands of URLs.
For this, the function closes the database connection to free all available resources. It is a good practice in database management.
Finally, it returns the set of scraped URLs. The returned set in our scraping workflow has an important utilisation for us. We can thus easily check whether a newly encountered URL is already processed with quick access to all previously scraped ones, thus avoiding redundant scrapping of the same pages and focusing our efforts on new, unseen product pages.
This function starts our scraping process, ensuring continuity with other scraping sessions. It ensures that the scraper will not waste time and resources at URLs it has already processed, streamlining the overall scraping operation while being considerate of the resources of the target website. This is very useful when operating in huge-scale scraping projects that take long periods of time or in multiple sessions.
Scraping Product URLs from a Category Page
# Scrape product URLs from a category page
async def scrape_product_urls(category_url, scraped_urls):
"""
Scrapes product URLs from a given category page of the H&M website.
Args:
category_url (str): The URL of the category page to scrape.
scraped_urls (set): A set of URLs that have already been scraped to avoid duplicates.
Returns:
list: A list of new product URLs found on the category page.
This function launches a browser using Playwright, navigates to the category page, and
iteratively scrapes product URLs. It handles pagination by clicking the "Load More" button until all product URLs are collected.
"""
async with async_playwright() as p:
# Launch a browser instance
browser = await p.chromium.launch(headless=False)
page = await browser.new_page(user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.9999.999 Safari/537.36")
await page.goto(category_url, timeout=90000) # Navigate to the category page
product_urls = []
while True:
# Extract product URLs from the current page by locating product divs and anchor tags
product_divs = await page.query_selector_all('.image-container')
for div in product_divs:
link = await div.query_selector('a.item-link')
if link:
href = await link.get_attribute('href')
full_url = "https://www2.hm.com" + href
if full_url not in scraped_urls:
product_urls.append(full_url)
scraped_urls.add(full_url) # Keep track of scraped URLs
print(f"Scraped URL count: {len(scraped_urls)}")
# Attempt to click the "Load More" button to load additional products
try:
load_more_button = await page.query_selector('button.button.js-load-more')
if load_more_button:
await load_more_button.click(timeout=90000)
await page.wait_for_load_state('networkidle') # Wait for the network to idle
await asyncio.sleep(2) # Give some time for the new content to load
else:
break # No more pages to load, exit the loop
except Exception as e:
print(f"Error clicking 'Load More' button: {e}")
break # Stop if there's an error with the "Load More" button
await browser.close()
return product_urls
It is an asynchronous function-scrape_product_urls-that is supposed to scrape the URLs of the products available on H&M website category page. Behind the scenes, here, the library used is Playwright-a powerful browser automation library.
It takes as inputs two arguments: the URL of the category page to scrape against and a list of already scraped URLs to avoid duplications. The function then returns a list of new product URLs found on the category page. The function sets up a session with Playwright at runtime to open a Chromium browser. It opens a new page using a specific user agent string that emulates an actual browser. This might help avoid detection by the website.
The function then presses forward on the category URL provided and enters the loop to scrape some of the product URLs. It traverses the product divs of the page by making use of CSS selectors and then goes on to fetch an href attribute of the anchor tags lying within it. Full URLs are obtained by appending paths to the base H&M URL. Initially, the function checks if a given URL already lies in the set of scraped URLs. Unless otherwise, it returns the URL to the return list along with the set of scraped URLs.
It can even deal with pagination. It uses the visible links of products to click on a "Load More" button so as to load another bunch of products. This process is repeated until there is no more "Load More" button or until an error has occurred while fetching more products.
The function catches errors and waits until the page loads content with a click of "Load More". It also prints out the count of scraped URLs, which is helpful in monitoring progress when there are long scraping sessions. Finally, the function will close the browser window and return the list of newly scraped product URLs.
This piece of code is going to discuss some of the advanced techniques in web scraping. These include dynamic content loading, how not to have duplicates of URLs, and more importantly, how to use real browser simulation in order to evade any possible mechanism of detection.
Main Scraping Process
# Main function to manage the scraping process for multiple category pages
async def main():
"""
Main function to manage the scraping process across multiple category pages from the H&M website.
The function:
- Initialises the database to store product URLs.
- Loads previously scraped URLs to avoid duplicates.
- Iterates over multiple category URLs and scrapes new product URLs.
- Saves new URLs to the database after scraping each category.
"""
# List of category URLs to scrape
category_urls = [
'https://www2.hm.com/en_in/women/shop-by-product/tops.html',
'https://www2.hm.com/en_in/women/shop-by-product/dresses.html',
'https://www2.hm.com/en_in/women/shop-by-product/shirts-blouses.html',
'https://www2.hm.com/en_in/women/shop-by-product/jeans.html',
'https://www2.hm.com/en_in/women/shop-by-product/trousers.html',
'https://www2.hm.com/en_in/women/shop-by-product/swimwear.html',
'https://www2.hm.com/en_in/women/shop-by-product/skirts.html',
'https://www2.hm.com/en_in/women/shop-by-product/shorts.html',
'https://www2.hm.com/en_in/women/shop-by-product/basics.html',
'https://www2.hm.com/en_in/women/shop-by-product/merch-graphics.html',
'https://www2.hm.com/en_in/women/shop-by-product/nightwear.html',
'https://www2.hm.com/en_in/women/shop-by-product/lingerie.html',
'https://www2.hm.com/en_in/women/shop-by-product/blazers-waistcoats.html',
'https://www2.hm.com/en_in/women/shop-by-product/jumpsuits-playsuits.html',
'https://www2.hm.com/en_in/women/shop-by-product/loungewear.html',
'https://www2.hm.com/en_in/women/shop-by-product/knitwear.html',
'https://www2.hm.com/en_in/women/shop-by-product/hoodies-sweatshirts.html',
'https://www2.hm.com/en_in/women/shop-by-product/cardigans-jumpers.html',
'https://www2.hm.com/en_in/women/shop-by-product/jackets-coats.html',
'https://www2.hm.com/en_in/women/shop-by-product/sportswear.html',
'https://www2.hm.com/en_in/women/shop-by-product/socks-tights.html',
'https://www2.hm.com/en_in/women/shop-by-product/maternity-wear.html'
]
initialize_database() # Initialise the database (create table if not exists)
scraped_urls = load_scraped_urls() # Load previously scraped URLs to avoid duplicates
# Loop through each category and scrape product URLs
for category_url in category_urls:
print(f"Scraping category: {category_url}")
try:
# Scrape product URLs from the current category page
product_urls = await scrape_product_urls(category_url, scraped_urls)
save_to_database(product_urls) # Save the newly scraped URLs to the database
except Exception as e:
print(f"An error occurred: {e}")
break # Stop the loop if an error occurs
print(f"Total unique product URLs scraped: {len(scraped_urls)}")
if __name__ == "__main__":
asyncio.run(main())
Actually, this is the main function that acts as the coordinator of our web scraping-it manages every process of extracting product URLs across the different category pages of H&M's website. Let me break down its functionality and purpose: At the top of the function, it defines all the category URLs, covering all the types of categories in the women's section of H&M, thereby giving a wide scope for our scraping process. This way, we easily update or extend targeted categories. The function starts by calling initialize_database() to set up the SQLite database correctly. If the table does not exist, it checks for the existence of that table and thus gets prepared to store the scraped URLs.
Then it calls the load_scraped_urls() function to load all previously scraped URLs from the database. The purpose of this is to preserve effectiveness in multiple scraping sessions, as it allows the script not to scrape any duplicates it has already processed. The main body of the function is a loop that goes through each category URL. For each category, it:
Prints a status message indicating which category it is scraping.
It calls the function scrape_product_urls which scrapes product URLs from the current category page.
It saves the newly scraped URLs to the database via the function save_to_database.
This operation is wrapped by using a try-except block in order to catch any error that might occur during the scraping process. If there is an exception, the console prints it, and the loop breaks to not let matters go further. Finally, it prints the total number of unique product URL scrapped over all sessions. The script ends with a standard Python idiom that checks if the script is being run directly (not imported as a module) and, if so, runs the main function using asyncio.run(). This will also make it very easy to put this script into other systems if it needs to be.
This structure has created us a robust, scalable system, which can handle massive amounts of data across various categories. The system was designed to be efficient (avoid doing the same work), error-resistant (even with basic error handling), and informative (providing a progress update). This is easily extensible to include more categories or to adapt to changes in the website's structure.
SCRAPING PRODUCTS DATA
This is a Python script intended to scrape all product details from an H&M webpage based on product URLs scraped in advance, which uses asynchronous programming with asyncio and Playwright for making web automation efficient as well as BeautifulSoup for HTML parsing. SQLite database functionality will be added to manage the scraped URLs and store more detailed product information in order to insure data persistency and avoid redundant scraping.
Other notable features about this scraper are rotating the user agents to mimic various browsers, strong error handling, and comprehensive extraction of product information with regard to name, price, reviews, description, and care instructions. Modular structure-the coding is in a way that allows separation of functions on each part of data extraction and for database operations, hence making it highly adaptive to similar e-commerce scraping tasks. This script is an advanced data collector about a product in depth, especially for large-scale analysis projects about e-commerce.
Importing Libraries
import random
import asyncio
import sqlite3
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
The random module has several functions for the generation of random numbers and random selections. This can be used by a developer to include randomness in so many ways during web scraping. For instance, they can include random delays between requests to emulate human behavior or evade anti-scraping tools. They can also apply it for the random selection of user agents or proxy servers if they're switching between several to scatter their requests.
Importing the asyncio, async_playwright, sqlite3 libraries for crawling the script.
The BeautifulSoup HTML parsing library was now available for import as bs4. After Playwright loads a page, it is likely to use BeautifulSoup to navigate the page's structure and retrieve some of its elements. It's very strong, too, in finding items by tag, attribute, or content.
By combining all these tools, the programmer essentially sets up an extraordinarily powerful environment for a complex task like scraping. There is preparation toward handling asynchronous operations, dynamic web pages, HTML parsing, addition of randomness for some organic behaviour, and storing the result into a local database. This implies that there should be preparation toward efficiency, scalability, and being able to handle modern, JavaScript-heavy websites in their proposed scraping project.
Loading User Agents
# Function to load user agents from a text file
def load_user_agents(file_path):
"""
Loads a list of user agents from a specified text file, with each line representing a different user agent string.This is useful for web scraping to rotate user agents and avoid potential blocking by the target website.User agents simulate different browsers or devices accessing the webpage.
Args:
file_path (str): Path to the text file containing user agent strings.
Returns:
list: A list where each element is a string representing a user agent.
"""
with open(file_path, 'r') as file:
return [line.strip() for line in file.readlines()]
This function is named 'load_user_agents' and it shall read user agents from a text file. The function accepts a required parameter, as a string, called file_path through which the path to the given text file, containing the list of user agents, exists. Each line in this file shall contain a string for a user agent. During the function call, the with statement opens the file identified by the file_path in read mode. This ensures the file is closed correctly at the end of the read if an error does occur while reading. Inside the function, a list comprehension reads in the file. It reads in each line of the file, strips leading and trailing spaces by applying the 'strip()' method, and then places the resulting string into a list. The function returns that list of user-agent strings. Each of the components in the list relates to a different user agent that could be used for web scraping.
To begin with, the use of multiple user agents in web scraping is due to acting as if it were another browser or even several devices at times that accessed the site. This way, the requests may not be traced back to a single scraper and avoid getting detected and eventually blocked by the target website.
We also enable this kind of flexibility of updating or customising the user agent list without modifying the code by reading user agents from a file. This even makes the main script cleaner by separating the data, the user agent strings, from the logic.
In the frame of a larger scraping project, this function would most likely be called somewhere in the beginning of the script to initialise a list of several user agents. Then, at each request, the programmer would be able to choose a completely random user agent from this set. This would make the behaviour of their scraping look more varied and chaotic, that might improve their ability to collect information without getting blocked.
# Function to select a random user agent
def get_random_user_agent(user_agents):
"""
Selects and returns a random user agent from a list of user agents.
This allows the scraper to mimic requests from different browsers or devices,
which can help avoid detection and blocking by the target website.
Args:
user_agents (list): A list of user agent strings from which to select.
Returns:
str: A randomly chosen user agent from the provided list.
"""
return random.choice(user_agents)
This function returns a random user agent from the list that it is passed. It takes one parameter, 'user_agents', which would be expected to be a list of user agent strings. As such, the list would commonly be a value returned by the previously defined function, load_user_agents. Inside the function, the random.choice() method is used. That is, it's a part of Python's random module; it randomly selects an item from some given sequence. Here it is the list of user agents. And the function returns the string of the randomly selected user agent.
A simple wrapper function over user_agent_pool for web scraping aims to get a different user agent for every request. Sometimes, choosing a random user agent varies the identity of the scraper with every new request. In web scraping, this randomization has several benefits, imitating how actual users access the website using different browsers or devices.
It can also help to avoid detection by the targeted website. If all requests use the same user agent, the website might become more capable of identifying and therefore blocking the scraper.
It disperses the activity of scraping across multiple simulated browser types, which is useful if particular content is otherwise served differently for various browsers.
This function would most likely be used in conjunction with the 'load_user_agents' function. The developer might load the user agent once, at the beginning of the script, and then call 'get_random_user_agent' before every request to obtain a new user agent string.This adds yet another variability to the process of the scraper, its actions being better mimicked to resemble those of a natural human user and, probably, increasing the capability of the scraper for passing unnoticed or at least unobstructed
Database Operations for URLs
# Function to load URLs from the product_urls table in the SQLite database
def load_urls_from_db(db_path):
"""
Connects to an SQLite database and retrieves a list of product URLsthat have not yet been scraped, based on a 'scraped' flag set to 0 in the database. The 'scraped' field in the database table indicates whether a URL has been processed.
Args:
db_path (str): Path to the SQLite database file.
Returns:
list: A list of product URLs (as strings) that have not been scraped.
"""
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
# Fetch all URLs where scraped = 0, meaning they haven't been processed yet
cursor.execute("SELECT url FROM product_urls WHERE scraped = 0")
urls = [row[0] for row in cursor.fetchall()]
conn.close()
return urls
This is one of the most crucial functions in the web scraping pipeline. The load_urls_from_db function is meant to connect to an SQLite database to retrieve a list of product URLs that are waiting to undergo the process of being scraped. This encapsulates the logic of the interaction with the database, hence generally producing a clean abstraction of how it can be applied elsewhere in the scraping system. Upon invoking this function, it opens a connection to the SQLite database using db_path. This parameter of the path will be helpful in using flexibility in location according to varied deployment scenario or for testing purposes. The function uses the sqlite3 module-a built-in Python library that provides a lightweight disk-based database solution.
After initialising, the function manufactures a cursor object. SQLite operations involve nothing more than a control structure that allows traversal over the records of a database. This will be executed by the function using its cursor, to fetch the results of the SQL query. Hence, the heart of this function lies in the SQL query it contains. It is the SELECT url FROM product_urls WHERE scraped = 0 SQL query that aims at fetching only the URLs yet to be processed. To do so, it examines the scraped flag in the product_urls table. A value of 0 for this flag indicates it has not been scraped; a value of 1 (or any non-zero value) would indicate that it has been processed. Using a list comprehension after performing the query, the function converts the results of the query into a list of the string values of URLs. This step extracts the first (and only) column of each returned row from the query, effectively filtering out a clean list of URLs ready to be processed. The function then closes the database connection correctly. This is important for maintaining data integrity and preventing resource leaks, especially in long-running scraping operations. Finally, the function returns the list of unscraped URLs. The return value can then be used by other parts of the scraping system to initiate the actual web scraping processes.
The function, load_urls_from_db, is integral to how efficient and reliable the full system of the scraper is; through retrieving only the URLs that have not yet been processed, the scraping system avoids repetitive attempts at scraping and works through its load progressively and methodically to ensure thorough work through; it is particularly important in mass-scale web scraping where state management of thousands of URLs is crucial for performance as well as for comprehensiveness in data collection.
# Function to mark a URL as scraped in the database
def mark_url_as_scraped(db_path, url):
"""
Marks a specific URL as 'scraped' by setting its 'scraped' flag to 1 in the database.
This ensures that the URL is not scraped again in the future.
Args:
db_path (str): Path to the SQLite database file.
url (str): The URL of the product page that has been scraped.
"""
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
# Update the 'scraped' status of the URL to 1 (scraped)
cursor.execute("UPDATE product_urls SET scraped = 1 WHERE url = ?", (url,))
conn.commit()
conn.close()
The mark_url_as_scraped function, therefore, is part of the web scraping system with the purpose of keeping track of the processed status of the URLs. This function maintains in the database that a particular URL was successfully scraped. Upon being called by that particular function, they accept two parameters. That includes db_path - the path towards the SQLite database file and url, which is the specific product page URL. These parameters are used to ensure the function can reach the appropriate database and update the record of interest.
It first connects to the SQLite database using the db_path and then creates a cursor object, which is the control structure to execute SQL commands on the database. The core of this function is the SQL UPDATE statement, which is: UPDATE product_urls SET scraped = 1 WHERE url = ?. This query looks for the table named product_urls and toggles the scraped flag to 1 for that row in which this URL matches. Using a parameterized query (with the ? placeholder avoids possible SQL injection vulnerabilities special characters in URLs will be correctly escaped. After this update query, the function calls conn.commit() to write its changes to the database. This is crucial in that it will ensure that the update is retained perpetually and does not go down the drain once the program terminates for whatever reason. Finally, the function closes the database connection to free up system resources. It really helps to be in the habit of opening and closing the connection, doing the operation and then immediately closing the connection because, although this is an awful lot of hammering on the database, in a system that may have to handle thousands of URLs over several hours, it's probably okay to hammer the database.
The mark_url_as_scraped function keeps track of which URLs already have been processed inside the scraping workflow. Its importance is its ability to avoid redundant scraping of similar URLs in subsequent runs, thus improving efficiency and unwanted loads on the target website. It also helps in tracking it easily. For example, it also enables the system to resume scraping from its last position if it gets interrupted.
Parsing Product Details
def get_product_name(soup):
"""
Extracts the product name from the parsed HTML of the product page.
This function specifically looks for a <div> element with a known class
where the product name is typically located.
Args:
soup (BeautifulSoup): The BeautifulSoup object created from the product page HTML.
Returns:
str: The name of the product, or 'N/A' if the product name cannot be found.
"""
product_name_div = soup.find('div', class_='ProductName-module--container__3Qbt1')
return product_name_div.find('h1').get_text(strip=True) if product_name_div else 'N/A'
This get_product_name function is the heart of the web scraping system. It is used with a BeautifulSoup object that is really just a representation of the HTML of a product page. The job of the function is to find and pull out the name of the product from the HTML. It achieves this by looking for a particular div element that normally contains the name of the product. If it can locate that element, then it captures the text within an h1 tag that it finds within it. Then it returns that text as the product name. In case, though, the function cannot locate the correct element or h1 tag, then it returns 'N/A'. That helps the scrap mechanism to deal with pages which may be structured differently, thus being unable to provide a specific piece of information. It organises and makes the whole process of scrapping much easier to maintain by focusing only on retrieving the product name.
def get_prices(soup):
"""
Extracts the original price, discounted price (if any), and discount percentage from the product page.The function looks for specific HTML elements and classes where the pricing information is generally found.If the discount is available, it will return both the original and discounted price; otherwise, it will return only the original price.
Args:
soup (BeautifulSoup): The BeautifulSoup object created from the product page HTML.
Returns:
tuple: A tuple containing:
- original_price (str): The original price of the product, or 'N/A' if not found.
- discount_price (str): The discounted price of the product, or 'N/A' if no discount.
- discount_percentage (str): The discount percentage, or 'N/A' if no discount.
"""
container = soup.select_one('div.e26896')
discount_price, original_price, discount_percentage = 'N/A', 'N/A', 'N/A'
if container:
discount_p = container.select_one('p.edbe20.ac3d9e.bf9a4f')
if discount_p:
spans = container.find_all('span')
if len(spans) >= 2:
discount_price = spans[0].get_text(strip=True)
original_price = spans[1].get_text(strip=True)
discount_percentage = discount_p.get_text(strip=True)
else:
first_p = container.find('p')
first_span = container.find('span')
if first_p:
original_price = first_p.get_text(strip=True)
elif first_span:
original_price = first_span.get_text(strip=True)
return original_price, discount_price, discount_percentage
The get_prices function tries to grab the prices for a page from an online shopping website. It accepts as argument a BeautifulSoup object. That object is what holds the originally parsed version of the HTML created by the website. The function then seeks a specific subset of those elements within the structure of this HTML, as usually the price is carried by such elements. However, if a discount exists, it will try to find the price along with its discount percentage. The function is flexible - if it cannot get a discounted price, it returns the original price. For all values, it returns 'N/A' if it can't find any price information at all. This function returns three pieces of information: the original price, the discounted price (if any), and the discount percentage (if any). This enables the scraping system to extract price data correctly from various product pages because it can handle several types of pricing scenarios.
def get_reviews(soup):
"""
Extracts the number of reviews from the product page, if available.The review count is typically located within a specific HTML element that contains the number of customer reviews for the product.
Args:
soup (BeautifulSoup): The BeautifulSoup object created from the product page HTML.
Returns:
str: The number of reviews as a string, or 'N/A' if no reviews are found.
"""
review_element = soup.select_one('hm-product-reviews-summary-w-c button.d1a171 span.d1cd7b.b475fe.e54fbe')
return review_element.get_text(strip=True) if review_element else 'N/A'
The get_reviews method is to extract the number of reviews a product has. It makes use of a BeautifulSoup object that carries the HTML code of a product page. The method then looks for a specific element in the HTML, within which the number of reviews usually appears. Then it continues to employ the element in order to extract the text, which should be the number of reviews. If the function cannot find the review information, it returns 'N/A'. It becomes useful as the count of reviews gives an idea if the product is popularly received or not. By extracting this information, the scraping system can give more precise data about each product, which may be important for comparison or analysis purposes.
def get_description_fit(soup):
"""
Extracts the product description and fit information from the product page.This function searches for the specific section of the HTML where product description is usually found,which often provides key details about the product's appearance, style, and fit.
Args:
soup (BeautifulSoup): The BeautifulSoup object created from the product page HTML.
Returns:
str: The product description, or 'N/A' if no description is available.
"""
description_div = soup.select_one("#js-product-accordion > div > div:nth-child(1)")
return description_div.get_text(strip=True) if description_div else 'N/A'
A function named get_description_fit extracts product description and fit information from a product page. The function utilizes the BeautifulSoup object because it contains the HTML of the page already parsed. The function will search for a certain part of the HTML that usually includes what the product looks like, style, and fit details. If the function finds this section, it then captures the text for the whole of that area. In instances where it failed in the above case, it returns 'N/A'. This function is significant as product descriptions contain much information about an article. By extracting this data, the system will now be able to provide a detailed description for each of these items. Using the above information, one could develop product catalogues or even compare other items.
def get_material_and_fabric(soup):
"""
Extracts the material and fabric information from the product page.This typically includes details about the composition of the product,such as the type of fabric used, which is important for customer decisions on comfort and quality.
Args:
soup (BeautifulSoup): The BeautifulSoup object created from the product page HTML.
Returns:
str: A description of the material and fabric, or 'N/A' if not found.
"""
material_div = soup.select_one("#js-product-accordion > div > div:nth-child(2)")
return material_div.get_text(strip=True) if material_div else 'N/A'
The function get_material_and_fabric is supposed to identify what a product is made of. It works on a BeautifulSoup object, which contains the HTML for a product page. A function that would look inside of that HTML for some sort of section where material and fabric information would usually appear - type of fabric used, for example, or what makes up the product. This means that if the function finds such information, then it will extract all of the text from that section. Besides, it has to return 'N/A' if there is no such material or fabric information found because such information could be used for many products, especially clothing. Such information would allow the scraping system to collect data that customers are always curious about, like what a clothing piece is made of or what material was used in a product.
def get_care_instructions(soup):
"""
Extracts the care instructions for the product from the product page.These instructions typically detail how to clean and maintain the product, such as washing and drying information,which is crucial for customers interested in product durability and ease of care.
Args:
soup (BeautifulSoup): The BeautifulSoup object created from the product page HTML.
Returns:
str: A string containing the care instructions, or 'N/A' if no instructions are available.
"""
care_ul = soup.select_one('div#section-careGuideAccordion ul.fe94e9')
return '\n'.join([li.get_text(strip=True) for li in care_ul.find_all('li')]) if care_ul else 'N/A'
The function get_care_instructions retrieves a page of information regarding how to take care of a product. It employs a BeautifulSoup object, which stores the HTML of a product's webpage. The function searches the page for a specific part, where care instructions are most often found-in this case, a bulleted list. If it finds such a list, it retrieves the text from each of the bulleted points. It then collates them all into one string but with each instruction on a new line. If no care instructions are found, 'N/A' is returned. This function is central because care instructions inform consumers how to clean a product and keep it maintained. To get that information extracted will help provide the valuable details the scraping system can pass on to the customer about how to keep their products well maintained.
Saving Scraped Data to Database
# Function to save scraped data to the SQLite database
def save_data_to_db(db_path, product_data):
"""
Saves the scraped product data into the SQLite database.If a product with the same URL already exists, it replaces the existing data.This ensures that each product URL is unique in the database,and allows for updates to the product details if the page is scraped again.
Args:
db_path (str): Path to the SQLite database file.
product_data (dict): Dictionary containing the scraped product data,
including fields like product name, price, reviews, etc.
"""
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
# Create the product_data table if it doesn't exist
cursor.execute('''
CREATE TABLE IF NOT EXISTS product_data (
url TEXT PRIMARY KEY,
product_name TEXT,
original_price TEXT,
discount_price TEXT,
discount_percentage TEXT,
reviews TEXT,
description TEXT,
material_details TEXT,
care_instructions TEXT
)
''')
# Insert the scraped data into the product_data table, replacing if the URL already exists
cursor.execute('''
INSERT OR REPLACE INTO product_data (
url, product_name, original_price, discount_price, discount_percentage,
reviews, description, material_details, care_instructions
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
''', (
product_data['url'],
product_data['product_name'],
product_data['original_price'],
product_data['discount_price'],
product_data['discount_percentage'],
product_data['reviews'],
product_data['description'],
product_data['material_details'],
product_data['care_instructions']
))
conn.commit()
conn.close()
The save_data_to_db function saves scraped product information to a SQLite database. It requires two arguments: the path for the database file and a dictionary with product information. This function begins with an attempt to open a SQLite database using the given path. This allows it to create a cursor object that is used to execute SQL commands. The first SQL command it executes creates the table product_data, if it doesn't exist. It's designed to hold information including URLs, product names, prices, reviews, and other characteristics.
After setting up the table, the function will start inserting scraped data into the database, using an "INSERT OR REPLACE" SQL command. Since this command will update existing information if the same product with the same URL is already stored in the database, it ensures the uniqueness of product URLs and makes it easy to update products in cases where they are scraped more than once.
The function saves all the changes to the database and closes down the connection in the final act. Thus, scraped data is saved on a permanent basis and resources of the database are freed.
The function includes several best practices. It utilises parameterized queries, which prevent SQL injection attacks. The create_table statement includes the "IF NOT EXISTS" clause, showing defensive programming when not to raise an error because of multiple calls to the function. Lastly, it manages database resources by committing the changes and closing the connection properly. Overall, this function is something that will effectively store and update product details in SQLite. The usage is especially useful for applications that have to track changes over time for the products or to build comprehensive catalogues of products from scraped data.
Asynchronous Web Scraping
# Asynchronous function to scrape a single URL
async def scrape_url(url, user_agent):
"""
Asynchronously scrapes product details from a given URL using a random user agent to avoid detection.It uses Playwright to open the webpage, waits for the page content to fully load, and then extracts relevant details
using BeautifulSoup, such as the product name, prices, reviews, and more.
Args:
url (str): The URL of the product page to scrape.
user_agent (str): A randomly selected user agent to simulate browser access.
Returns:
dict: A dictionary containing the scraped product details, including:
- url (str): The URL of the product.
- product_name (str): The name of the product.
- original_price (str): The original price of the product.
- discount_price (str): The discounted price (if any).
- discount_percentage (str): The discount percentage (if any).
- reviews (str): The number of customer reviews.
- description (str): The product description and fit.
- material_details (str): The material and fabric information.
- care_instructions (str): Instructions for caring for the product.
"""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
context = await browser.new_context(user_agent=user_agent)
page = await context.new_page()
# Visit the product page
await page.goto(url)
await page.wait_for_selector('body')
# Get the page content and parse it with BeautifulSoup
html_content = await page.content()
soup = BeautifulSoup(html_content, 'html.parser')
# Close the browser
await browser.close()
# Return the extracted product data
return {
'url': url,
'product_name': get_product_name(soup),
'original_price': get_prices(soup)[0],
'discount_price': get_prices(soup)[1],
'discount_percentage': get_prices(soup)[2],
'reviews': get_reviews(soup),
'description': get_description_fit(soup),
'material_details': get_material_and_fabric(soup),
'care_instructions': get_care_instructions(soup)
}
The scrape_url function is an asynchronous function, which is designed to fetch product details from a given URL. This function mainly uses the Playwright automation library to browse web pages and BeautifulSoup to parse HTML content. It takes in two parameters: the URL of the page that contains the product to scrape, and a user-agent string. Another technique utilised is a user agent, which will be randomly selected to avoid detection by websites that might block scraping activities. Upon execution, the function launches a Chromium browser instance using Playwright. The function creates a new browser context with the user agent given and opens a new page. The function then navigates to the specified URL while waiting for the body of the page to load completely.
When the page loads, this function reads in all the full HTML content of the page. Then it parses that content with BeautifulSoup, a popular Python library used for extracting data out of HTML and XML files. Finally, it closes the browser to free up system resources. It then recursively goes through and pulls many different pieces of data about the product by way of a series of helper functions-not shown in this snippet. These helper functions are to be used in order to get specific data like product name, prices, reviews, description, material details, and care instructions. All the collected data are encapsulated within a dictionary, which then becomes the return value of the function. This dictionary includes key-value pairs for each piece of information scraped from the product page.
This class should be used in an asynchronous context; the function it defines can be put to good use when scraping a number of pages concurrently, potentially speeding up the process of scraping altogether.
Main Function and Running the Script
# Main asynchronous function
async def main():
"""
The main function to scrape product URLs and save the data into the SQLite database.
It performs the following tasks:
1. Loads a list of user agents from a specified file.
2. Fetches a list of unscripted product URLs from the SQLite database.
3. For each URL, a random user agent is selected, and the product details are scraped.
4. The scraped product data is saved into the database.
5. Each URL is marked as scraped in the database to avoid re-scraping.
This function manages the entire scraping workflow and ensures that data is collected efficiently.
"""
db_path = 'product_urls.db'
# Load user agents from a text file
user_agents = load_user_agents('path/to/user agents file/user_agents.txt')
# Load unscripted product URLs from the SQLite database
urls = load_urls_from_db(db_path)
for url in urls:
user_agent = get_random_user_agent(user_agents)
product_data = await scrape_url(url, user_agent)
# Save the scraped data to the database
save_data_to_db(db_path, product_data)
# Mark the URL as scraped in the database
mark_url_as_scraped(db_path, url)
if __name__ == "__main__":
# Run the main function using asyncio to handle asynchronous tasks
asyncio.run(main())
The main function is a sort of central orchestrator for web scraping, as it's implemented asynchronously, so this function can easily perform the job of downloading numerous product URLs and then store extracted data. At the start of its execution, this function defines a path to an SQLite database, which will be used for data storage purposes, followed by the loading of a list of user agents from a file. These user agents will then be adapted to mimic several browsers and therefore will not be noticed by the targeted website. The function proceeds by gathering a list of product URL's not yet scrapped from SQLite databases. Scrapping can be resumed at any moment if it is interrupted and may depend on only those URLs not yet processed till this point.
This function cycles through each URL in the list, picks a random user agent from the list loaded first, and then proceeds calling the scrape_url function defined elsewhere to extract product details off of the webpage. The function is awaited because it's an asynchronous operation. Then the product data is saved to the SQLite database using the save_data_to_db function after it's fetched. After successful storage, the URL is marked as scraped within the database, so the chance of its further processing does not occur during successive runs.
The basic function is set up as the executable start of the script, in case it is run directly; as it's not an import, it will not be run as a module. Rather it uses Python's asyncio.run() to run the asynchronous main function in the event loop.
LIBRARIES AND VERSIONS
playwright 1.45.0
beautifulsoup4 4.12.2
Connect with Datahut for top-notch web scraping services that bring you the information you need, hassle-free.
FAQ SECTION
Do I need coding experience to scrape H&M product data using Python?
While coding knowledge helps in building and customizing your own scraping scripts, you don’t necessarily need it to scrape H&M data. Our web scraping services handle the technical aspects for you, offering customized data solutions without requiring you to write any code.
What kind of data can I extract from H&M's website?
With Python or our web scraping services, you can extract product details like name, price, sizes, colors, and stock availability. We can also scrape more specific information, such as customer reviews, product ratings, and promotional offers, depending on your needs.
Is web scraping H&M’s data legal?
Web scraping is legal if done responsibly and ethically. It's important to respect H&M’s terms of service and privacy policy. Our company follows best practices and legal guidelines to ensure compliance, helping you avoid any risks.
What if H&M updates their website? Will the scraping scripts still work?
Websites often change their structure, which can break scraping scripts. However, our team continuously monitors website changes and updates your scraping tools to ensure uninterrupted data flow, saving you time and technical headaches.