Web scraping is a powerful technique to automatically extract large volumes of data from a website. This method involves writing codes that interact with web pages, send requests, and parse the received data while storing it in a structured format. It is used when dealing with websites that have no official API or data available through the API are limited.
ASOS is one of the online fashion retailers that offers a massive amount of clothing, shoes, and accessories for target youths. The large amount of updated products in the ASOS catalogue provides a tremendous source of data for analysis and makes it a good candidate for web scraping projects. You can surf to asos.com to browse the ASOS website.
In this documentation, I explain how to scrape the ASOS women's clothing data. The two steps are getting product URLs gathered with an API using the Requests library and the actual product data-details such as price, description, and availability-and then brought out using a combination of Playwright and Beautiful Soup. The extracted data will be cleaned and preprocessed with the OpenRefine tool and also with the Pandas library of Python in some instances. This will ensure that the final set is complete, accurate, and ready for analysis or any further processing.
Web Scraping Tools for ASOS: A Comprehensive Review
Requests: The HTTP Library
Requests is one of the most used Python HTTP libraries, which makes the process of sending HTTP/1.1 requests very easy and quite intuitive. Besides decompression of the response, connection pooling for better performances, session persistence, and cookie management, it supports many HTTP methods. Requests are excellent for keeping track of several requests that might necessitate some kind of state maintenance. In web scraping Usually, Requests is employed to fetch some static web page or to interact with an API.
Playwright: Automation of the Browser
It was built by Microsoft and serves as an automation library to control web browsers, supporting multiple browsers: Chromium, Firefox, and WebKit. It could deal with such dynamic content and JavaScript-rendered pages that are hard for other tools; therefore it is particularly convenient for complicated website scraping. Interceptions of network requests are supported, and screenshots and PDF generation capabilities are included, too. Therefore, web scraping tasks that request complicated JavaScript usually require a whole browser environment to render the right way.
BeautifulSoup: The HTML Parser
It's a Python library for parsing the HTML and XML documents. This supports different type of parsers, which provides an excellent and simple way to navigate and search the parse tree. If needed, it offers all modification methods of the parse tree and readily supports poorly formatted HTML. It uses BeautifulSoup to fetch certain information that BeautifullSoup gathers after fetching the web page data by reading its HTML structures.
SQLite: The Database Engine
SQLite is the C library for implementing the lightweight disk-based database. It doesn't require a separate server process and allows accessing the database using a nonstandard variant of the SQL query language. SQLite is self-contained and serverless; hence, no configuration is needed. SQLite supports standard relational database features, such as SQL syntax, transactions, and prepared statements. SQLite is widely applied in web scraping projects in keeping the scraped data in a structured format locally and possibly making it possible to query and easily manipulate this data.
Scraping Product Urls
This is a Python script dedicated web scraper. It is designed to scrape on ASOS URLs of products the popular online fashion store is offering. It scrapes links of products based on women's clothing categories and then based on diverse types of apparel. It interacts with ASOS's API very closely to emulate the browser behavior to ensure good retrieval of product data. It includes an extensive range of women's clothing categories, from tops and dresses to jeans, skirts, and swimwear, that guaranteed full coverage of the retailer's catalog. The reason behind handling a huge number of products is pagination, through which the scraper works its way systematically over all available items in a category. The core functionality is extracting product URLs from API responses, effectively managing different types of JSON structures that may appear in the code. All aggregated URLs are persisted in an SQLite database for easier further processing. The code is engineered to be scalable and able to collect thousands of product URLs. Basic error handling is encapsulated in the code to ensure smooth execution and integrity of data throughout the scraping process. This is a good base for much larger data scraping projects - possible applications include price tracking, trend analysis, or inventory tracking in the fast-fashion e-commerce sector.
Importing Libraries
import requests
import sqlite3
import pandas as pd
This imports the required packages for the ASOS Scraper. The requests library will be used to perform the HTTP request to the ASOS API. sqlite3 is a library imported in order to interact with a SQLite database where scraped data will be put. pandas is used to do data manipulation and easily convert scraped data into a format in a suitable structure for insertion into a database.
Header Generation Function
def get_headers():
"""
Returns a dictionary of HTTP headers required to make requests to the ASOS API.
These headers are used to mimic a real browser request and ensure access to the API.
Returns:
dict: HTTP headers including Accept, User-Agent, Cookie, and other required fields.
"""
return {
# Headers to simulate a browser request
"Accept": "application/json, text/plain, */*",
"Accept-Encoding": "gzip, deflate, br, zstd",
"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
"Asos-C-Name": "@asosteam/asos-web-product-listing-page",
"Asos-C-Plat": "web",
"Asos-C-Ver": "1.2.0-f14cd3cd-64",
"Asos-Cid": "b51cae20-5e6b-450e-a8ad-34c6aeb1e773",
# Add appropriate cookies and user-agent to avoid request blocks
"Cookie": "browseCountry=IN; browseCurrency=GBP; browseLanguage=en-GB; ...",
"Dnt": "1",
"Host": "www.asos.com",
"Referer": "https://www.asos.com/",
"Sec-Ch-Ua": '"Not.A/Brand";v="8", "Chromium";v="112", "Microsoft Edge";v="112"',
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": '"Windows"',
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.0.0",
"X-Requested-With": "XMLHttpRequest"
}
That is what gives the ASOS scraper its mimicry of a real web browser: otherwise, it may get detected and even blocked. It returns a dictionary of HTTP headers that are almost indistinguishable from what any real web browser would send for those headers in the course of a request. They include accepted content types and preferred language, even with the ASOS-specific identifiers for the request to appear as well-believable as possible.
Among the most important headers moved around by this function are the User-Agent header string, indicating to the server that it's being requested by some variant of Microsoft Edge on the client end running Windows 10. There are other important headers too, like 'Accept-Encoding', hinting at compression through multiple methods, and 'Referer', which probably hints at the request being served from the ASOS website itself.
The 'Cookie' header is very significant because this particular header may contain session information and user preferences. Indeed, it might include all the various tokens and identifiers that would be established in a real browsing session by the ASOS website. To a scraper, it keeps session consistency across multiple requests, thereby simulating a legitimate user.
By exceedingly meticulously crafting these header lines, the task greatly enhances the chance that a scraper goes unnoticed and will operate for long periods of time without encountering some block or rate-limit. It is sensitivity to the imitation of real browser behavior that makes this scraper finally successful in traversing ASOS anti-bot measures.
Product URL Fetching Function
def fetch_product_urls(url):
"""
Fetches product URLs from the ASOS API.
Makes an API call to the provided URL and extracts the product URLs from the JSON response.
The function handles both list and dictionary formats in the JSON response.
Args:
url (str): The API URL to fetch product data.
Returns:
list: A list of product URLs.
"""
headers = get_headers()
response = requests.get(url, headers=headers)
response.raise_for_status() # Ensure the request was successful
json_data = response.json() # Get the JSON response
urls = []
# Check if the response is a list of items
if isinstance(json_data, list):
for item in json_data:
if 'url' in item:
urls.append("https://www.asos.com/" + item['url'])
elif 'products' in item:
for product in item['products']:
if 'url' in product:
urls.append("https://www.asos.com/" + product['url'])
# Handle dictionary format with 'products' key
elif isinstance(json_data, dict) and 'products' in json_data:
for product in json_data['products']:
if 'url' in product:
urls.append("https://www.asos.com/" + product['url'])
return urls
The `fetch_product_urls(url)` function is therefore an interaction directly with the ASOS API itself to fetch its product URLs. The function takes in one URL, which sends to a particular, one-specific ASOS API endpoint. This function outset calls a set of functions in line for getting HTTP headers and then does a GET request to the provided URL with the available HTTP headers.
After sending the request and confirming that it was successful via a `raise_for_status()` call, the function will parse the JSON of the API response. It is created to be flexible enough to work with list-based and dictionary-based JSON output structures. This is important because sometimes, with or without notice, API responses change over time, and the ability to handle multiple formats increases the robustness of the scraper.
It constructs full URLs by prepending the ASOS domain since it looks for product URLs in the response. Following its traversal, it makes use of nested loops to locate 'url' keys or 'products' lists at various levels. The entire approach captures the product URLs in its traversal regardless of the position.
The function collects all the URLs discovered into a list, which it then returns. That list of product URLs forms the foundation for the secondary data extraction, which can be passed on into functions to scrape more specific product data from individual product pages. The `fetch_product_urls(url)` function thus seems to act as a major bridge between the high-level logic of the scrape and the gritty details of the ASOS API.
Category URL Generation Function
def generate_urls_to_scrape():
"""
Generates a list of ASOS category API URLs to scrape product data.
The categories consist of women's clothing like tops, dresses, jeans, etc. Each category
is paginated, and the function returns URLs for different offsets to cover all available products.
Returns:
list: A list of URLs for various product categories and pagination.
"""
categories = [
(4169, 10543), # Women, Tops
(8799,10856), # Women, Dresses
(9263,1986), # Women, shorts
(2639,3236), # Women, skirts
(19632,4903), # Women, co-ord sets
(2238,4296), # Women, swimwear-beachwear
(51642,218), # Women, waistcoats
(11896,588), # Women, blazers
(15199,612), # Women, blouses
(2641,1604), # Women, coats-jackets
(11321,937), # Women, hoodies and sweatshirts
(15210,4705), # Women, designer
(3630,2458), # Women, jeans
(2637,1135), # Women, jumpers & cardigans
(7618,949), # Women, jumpsuits & playsuits
(6046,1824), # Women, lingerie and nightwear
(21867,829), # Women, loungewear
(15200,1224), # Women, shirts
(7657,235), # Women, socks & tights
(26091,1369), # Women, sportswear
(13632,999), # Women, suits & tailoring
(27953,390), # Women, tracksuits & joggers
(2640,4113) # Women, trousers & leggings
]
# Create URLs for each category and pagination
return [
[
f"https://www.asos.com/api/product/search/v2/categories/{category}?offset={offset}&includeNonPurchasableTypes=restocking&store=ROW&lang=en-GB¤cy=GBP&rowlength=2&channel=mobile-web&country=IN&keyStoreDataversion=q1av6t0-39&limit=72"
for offset in range(0, max_offset, 72) # Paginate by 72 items per page
]
for category, max_offset in categories
]
The `generate_urls_to_scrape` function is the strategic planner of the ASOS scraper. It produces an exhaustive list that the scraper will process through which deals with all categories in the women's clothing catalogue from ASOS. This function employs a predefined list of category IDs and max offsets for each; it represents the total number of items available in those categories.
Using these category IDs and offsets, this function constructs the list of URLs of all categories with pagination so that all products can be reached. Multiple query parameters for each of the generated URLs will include, but not be limited to, store, language, currency, items per page to return, among others. They were chosen deliberately so that if a user browses straight through the ASOS website and drags through categories, this will be one of the requests. So it helps keep the disguise afloat for the scrape.
The function utilizes list comprehension for the generation of the URLs efficiently; the nested list structure it produces contains inner lists whose entries comprise the paginated URLs for specific product categories, so that comprehensive coverage and organization of the URLs can be obtained as input for systematic processing.
The output of this function is an organized plan that the scraper will follow, so nothing in the ASOS catalogue will be missed. Moreover, it makes it quite easy to adjust the scope by simply changing the list of category tuples, which makes it adaptive to the changes in the catalogue structure at the ASOS site or focus on categories of interest.
Main Scraping Function
def scrape_all_urls():
"""
Scrapes all product URLs for different categories from ASOS.
Iterates over the generated category URLs, fetches the product URLs for each category and pagination,
and stores them in a list.
Returns:
list: A combined list of all product URLs scraped across categories.
"""
urls_to_scrape = generate_urls_to_scrape()
all_product_urls = []
# Loop through the list of category URLs
for urls in urls_to_scrape:
for url in urls:
product_urls = fetch_product_urls(url) # Fetch product URLs for each API call
all_product_urls.extend(product_urls) # Add to the combined list
print(f"Fetched {len(product_urls)} URLs from {url}")
return all_product_urls
The function `scrape_all_urls()` controls the entire process of scraping. Following `generate_urls_to_scrape()`, which creates the list of categories to scrape through and returns this, it then enters a loop that goes over each category with their corresponding URLs, methodically working through the entire ASOS catalog. For every URL scrape, it calls `fetch_product_urls()` to extract the product urls from that respective API.
The function aggregates the product URL from all categories into one list called `all_product_urls` by processing every URL. This aggregation makes further processing easier and ensures that all scraped URLs are kept together irrespective of the sources. It prints out a status update after each call of API in order to keep the user updated on the state of things, indicating the quantity of the fetched URLs and which specific URL fetched them.
The function uses nested loops in order to treat the addresses: an outer loop runs over categories, and an inner one treats pages in each category. It is easy to add some category-specific logic according to your needs-there is no problem in applying different rules for processing different types of products.
It may not be the scrappiest approach to very large catalogues, but it is fast. Besides, it has some benefits: it is unlikely to drain the ASOS servers with requests, which decreases the possibility of getting blocked with too many activities. And it is easier to implement rate limiting or pauses between requests if necessary. The clear sequential nature of processing also makes debugging easier and even interruption of scraping processes if required, with easy resumption from the point at which it was stopped.
Database Storage Function
def save_to_database(urls):
"""
Saves the scraped product URLs to an SQLite database.
This function creates a table called 'urls' in a local SQLite database file and inserts all the scraped URLs into the table.
Args:
urls (list): A list of product URLs to save to the database.
"""
# Convert URLs list to a DataFrame for easier handling
df = pd.DataFrame(urls, columns=['url'])
conn = sqlite3.connect('asos_data.db') # Connect to (or create) the database
cursor = conn.cursor()
# Create the URLs table if it doesn't exist
cursor.execute('''
CREATE TABLE IF NOT EXISTS urls (
url TEXT NOT NULL
)
''')
# Insert each URL into the table
for url in df['url']:
cursor.execute('INSERT INTO urls (url) VALUES (?)', (url,))
conn.commit() # Commit the transaction
conn.close() # Close the connection
The `save_to_database(urls)` function is responsible for saving all the scraped data into a SQLite database. It takes a list of product URLs as an argument and first converts it into a pandas DataFrame for easier manipulation. Although converting the list into a DataFrame seems quite unnecessary for such a simple list of URLs, it makes it more versatile in case something might have needed to be stored for other metadata about each URL later on.
The function then connects the database to a SQLite database called 'asos_data.db'; if that does not exist, it will be made. SQLite was chosen for this job because it is extremely lightweight and easily portable; it doesn't need a separate server process and stores the whole database within a single file, which is suitable for datasets of smaller to medium sizes and simplifies sharing and backing up data.
Now it creates an in-memory table called 'urls'. A loop is then used to place every URL found in the DataFrame into the 'urls' table in the database through SQL INSERT statements. While not ideal when loads are completely massive, this method could be more than sufficient for what will be inserted as the loads can be quite reasonable coming from the ASOS web scraper.
Once all the URLs are inserted in the database, this function commits the transaction so all the changes it makes will be preserved and the database connection is closed. Such a function allows for a reliable method of storing the data-scraped information, whereby retrieval and analysis will be easy enough to conduct with ease in future operations. Saving of data into a database instead of saving to any file offers a better output regarding integrity ease in querying and an opportunity for integration with other tools for data processing.
Main Execution Function
def main():
"""
Main function to scrape ASOS product URLs and save them to a local database.
This function handles the scraping and saving process. It first scrapes all the product URLs
across the categories and then saves them to an SQLite database.
"""
all_product_urls = scrape_all_urls() # Scrape the URLs
save_to_database(all_product_urls) # Save them to a database
print("Data has been successfully scraped and saved to 'asos_data.db'")
This way, the main() function is actually an entrance point and a general coordinator of the ASOS scraping script. It wraps around high-level flow for the whole process, thus making it rather easy to run through all the steps with just one function call. Separation of concerns helps keep the main execution logic separate from being tangled up with the implementation details of the scraping and the data storage processes.
It calls `scrape_all_urls()`, starting the scrapping process, which begins to extract all product URLs under different categories on ASOS. This could therefore incur a very long execution time due to the number of categories and products involved in the scrapping process.
After fetching all of those links, the `main()` calls passes this collection of URLs to the `save_to_database()` function responsible for saving this collection of URLs into the SQLite database. The function simply prints a message saying 'success' so that a user knows data has indeed been successfully scraped and saved in 'asos_data.db'. That would provide the required feedback to a user. Although the function is very simple itself, it organizes all the entire process of scraping and thus gives the very clear entry point about how to run the scraper.
The "main()" function would be easily expanded later to include even more steps: data validation, error handling, or possibly other further steps of the data processing flow. Its central location as an entry point to the script makes it an ideal place for expansion with the development of capabilities in process management and reporting over time.
Script Execution
if __name__ == "__main__":
main() # Run the main function
This conditional statement ensures that the main() function is only executed when the script is run directly, not when it's imported as a module. It's a common Python idiom that allows the script to be both importable and executable.
Scraping Products Data
This is an advanced level full web-scraping Python script that scrapes a very detailed page of products by the online fashion retailer ASOS website. Using asynchronous programming with Playwright, it succeeded in collecting detailed information about the products-name and price, customer review, product characteristics, composition, and care. Thorough efforts have been made to defeat the site's limitations by reasonably ensuring the collection of data.
The scraper works with an array of advanced features like user agent rotation, realistic HTTP headers, and retry logic for resilient operation; in the meantime, it stores all collected data in an SQLite database for easy access and analysis. With modular designs and a highly comprehensive error handling policy in place, the script becomes a reliable provision for building a product database and following price trends as well as fashion trends, making it a very valuable tool in the e-commerce space of fashion in market research and inventory analysis.
Importing Libraries
import asyncio
import sqlite3
import random
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
import re
The import makes available all the Python libraries required for the scraping operation. Essentially, the script relies on the asyncio package for asynchronous programming - meaning a number of operations can be run at the same time, increasing the speed of the scraper by huge extents. Using Playwright, which can be imported from its async API, already features browser automation capabilities, with the ability to navigate web pages and deal dynamically with content. BeautifulSoup comes in helpful in parsing HTML and taking out data from the loaded web pages.
sqlite3 -this will be used locally for database operations where scraped data will be stored and at the same time, though tracking progress of what has been scraped. Random adds variability to the behavior, making the script appear more human-like not to get noticed. Regular expressions (re) is applied to match patterns for extractions of specific text patterns from scraped contents.
User Agent Loading Function
def load_user_agents(file_path):
"""
Load a list of user agents from a file.
Args:
file_path (str): Path to the file containing user agents, one per line
Returns:
list: A list of user agent strings
Raises:
FileNotFoundError: If the specified file path doesn't exist
IOError: If there's an error reading the file
"""
with open(file_path, 'r') as file:
return [line.strip() for line in file.readlines()]
One of the essentials that enable the scraper to do different web browsers which behave different are user-agent loading functions. It reads a text file full of varied user agent strings, different identifiers that tell the website what kind of browser and operating system access it. By enabling the rotation of user agents, the scraper can copy and mimic these different devices and browsers that thus bring a more natural behavior over a less stylized one in doing as if doing automated scraping.
The function does just that: open a specified file, read all its lines, strip the contents of these for leading and trailing whitespace, and return them as a list. Later in the scraping process, this list is used to randomly select different user agents for each request. Should it fail to find the file or be unreadable, appropriate exceptions are so raised that the main program can catch such errors gracefully.
Product Data Parsing Functions
def parse_product_name(soup):
"""
Extract the product name from the HTML soup object.
Args:
soup (BeautifulSoup): Parsed HTML content
Returns:
str: The product name if found, 'N/A' otherwise
"""
product_name_tag = soup.find('h1', class_='jcdpl')
return product_name_tag.get_text(strip=True) if product_name_tag else 'N/A'
This function would parse the product name from the ASOS webpage, for it is tasked with fetching the title or name of the product. It utilizes BeautifulSoup's parsing ability, located an HTML element, namely, an h1 tag with a class 'jcdpl', which, as general practice on the ASOS web pages, is where the name of the product goes in; hence, the function needs to be robust enough to work even when that structure isn't there.
If the tag for product name exists, this function extract the text content, strip any excess whitespace with strip(), and returns it. If the tag is not found, say because the page changed or some other form of error occurred, the function returns 'N/A' instead of throwing an error. This helps the scrape not to fail if some information is otherwise unattainable on a certain page.
def parse_price(soup):
"""
Extract price information from the HTML soup object.
This function handles different price formats:
1. Sale price with original price and discount
2. Regular price without discount
Args:
soup (BeautifulSoup): Parsed HTML content
Returns:
tuple: (current_price, original_price, discount_percentage)
- current_price (float): The current selling price
- original_price (float): The original price (same as current_price if no discount)
- discount_percentage (int): The discount percentage, 0 if no discount
"""
# Try to find price with sale information
price_span = soup.find('span', class_='ky6t2', attrs={'data-testid': 'price-screenreader-only-text'})
if price_span:
price_text = price_span.get_text(strip=True)
# Regular expression for sale price format: "Now £X.XX. Was £Y.YY. (-Z%)"
price_pattern = re.compile(r"Now £(?P<sale_price>\d+\.\d{2})\. Was £(?P<mrp>\d+\.\d{2})\. \((?P<discount>\-?\d+)%\)")
# Regular expression for simple price format: "£X.XX"
price_pattern_simple = re.compile(r"£(?P<price>\d+\.\d{2})")
# Try to match sale price format
match = price_pattern.search(price_text)
if match:
return float(match.group('sale_price')), float(match.group('mrp')), int(match.group('discount'))
# Try to match simple price format
match = price_pattern_simple.search(price_text)
if match:
price = float(match.group('price'))
return price, price, 0
# Try alternative price format
else:
price_span = soup.find('span', class_='current-price-text', attrs={'data-testid': 'current-price'})
if price_span:
mrp = price_span.get_text(strip=True)
return mrp, mrp, 0
return 0, 0, 0
One of the more complex parts of the scraper is price parsing, as ASOS appears to format prices on product pages in so many different ways. It attempts to capture not only the current price but also the original price and discount percent. This function combines the use of BeautifulSoup for finding the relevant HTML elements and regular expressions to parse the text content of those elements.
This function finds a specific span element containing price information first. Then it uses two different regular expression patterns to match the latter with price formats. The first pattern is for a sale price format where both the current and the original price are found along with a discount percentage. The second one is much simpler, just looking for a regular price format.
While this match of either pattern gives us the right numbers, it returns them as a tuple of current price, original price, and discount percentage. In the case of a standard price-no discount-the current and original price will be the same, and the discount will be zero. So if no information is found about the price, the function returns zeros for all values-this way, an expected absence or format of the price data will not make the scraper stop.
The named capture groups in the regular expressions-that is, (?P<sale_price>.)-give the code readability and maintainability as it clearly expresses what the part of the price text is expected to match. Thus, when used to extract price information from ASOS product pages, the function will be robust and reliable because it can handle many formats and sub-handle missing data.
def parse_rating_and_reviews(soup):
"""
Extract product rating and review count.
Args:
soup (BeautifulSoup): Parsed HTML content
Returns:
tuple: (rating, review_count)
- rating (str or int): Product rating if found, 0 otherwise
- review_count (str or int): Number of reviews if found, 0 otherwise
"""
rating_div = soup.find('div', class_='I21qs', attrs={'data-testid': 'overall-rating'})
rating = rating_div.get_text(strip=True) if rating_div else 0
review_div = soup.find('div', {'data-testid': 'total-reviews'})
review = review_div.get_text(strip=True) if review_div else 0
return rating, review
It retrieves the following two vital customer feedback details from the ASOS product pages, namely overall product rating and the total number of customer reviews. Such details will help in understanding the ranks of products as well as the satisfaction level of customers. The function employs BeautifulSoup's capacity to locate the specific HTML elements on the page that contain this information.
The rating and review count works in a similar way. This locator will try to locate a div element that ASOS applies particular attributes to in order to indicate the pieces of info. The 'data-testid' attribute is really handy as it tends to be way steadier than class names or even other attributes that often flit around with web pages whenever they are periodically updated. It returns text content if there are such elements, otherwise it returns zero.
It returns the two pieces of information as a tuple, so calling code can easily deal with both values together. In this way of returning default values (zero) when information isn't found helps prevent errors and lets the scraping process proceed smoothly even when data are missing from some products.
def parse_color(soup):
"""
Extract the color information from the product page.
Args:
soup (BeautifulSoup): Parsed HTML content
Returns:
str: The color of the product if found, 'N/A' otherwise
"""
color_div = soup.find('div', {'data-testid': 'productColour'})
if color_div:
p_tag = color_div.find('p', class_='aKxaq hEVA6')
if p_tag:
return p_tag.get_text(strip=True)
return 'N/A'
The color parser function aimed to pull color data from ASOS product pages. Colours are an important attribute for fashion items, so this was much important information that needed to be picked up with specificity. It does so in two steps in order to extract colour data by idenfying the appropriate source and for such complex nesting of HTML elements, which web scraping often involves.
In its implementation, the function first spots a div holding in it colour information. To identify it, a 'data-testid' attribute is matched. When that is found, then within this div, it searches for a paragraph tag carrying some of the exact class names. The nested search ensures that it pulls up the correct colour information without conflicts with other elements in the page having similar features.
The default for this function is it to be 'fail-safe': if an outer div or inner paragraph tag cannot be found, it returns 'N/A'. The reason here is that it would break the scraping attempt mainly for missing color information that can occur for several reasons like out of stock products or other website structure.
def parse_product_details(soup):
"""
Extract detailed product information including category, brand, details, and product code.
Args:
soup (BeautifulSoup): Parsed HTML content
Returns:
tuple: (category, brand, details, product_code)
- category (str): Product category if found, 'N/A' otherwise
- brand (str): Product brand if found, 'N/A' otherwise
- details (str): Product details if found, 'N/A' otherwise
- product_code (str): Product code if found, 'N/A' otherwise
"""
div_element1 = soup.find('div', {'data-testid': 'productDescriptionDetails'},
style=lambda value: value and 'visibility: hidden' in value)
if div_element1:
# Find the index where brand information starts
by_index = div_element1.text.find(' by ')
if by_index == -1:
return 'N/A', 'N/A', 'N/A', 'N/A'
# Split text into before and after 'by'
before_by = div_element1.text[:by_index].strip()
after_by = div_element1.text[by_index + len(' by '):].strip()
# Extract category
category_tag = div_element1.find('a', recursive=False)
category = category_tag.get_text(strip=True) if category_tag and category_tag.get_text(strip=True) in before_by else before_by
# Extract brand
brand = None
after_by_text = after_by
ul_tag = div_element1.find('ul')
if ul_tag:
after_by_text = div_element1.text[by_index + len(' by '):div_element1.text.find(ul_tag.get_text(), by_index)].strip()
brand_tag = None
for a_tag in div_element1.find_all('a'):
if a_tag.previous_sibling and ' by ' in a_tag.previous_sibling:
brand_tag = a_tag
break
brand = brand_tag.get_text(strip=True) if brand_tag else after_by_text.strip()
# Extract product details
details_tags = div_element1.select('div.F_yfF > ul > li')
details = '\n'.join(li.get_text() for li in details_tags)
# Extract product code
product_code_tag = div_element1.find('p', class_='Jk9Oz')
product_code = product_code_tag.get_text().replace('Product Code: ', '') if product_code_tag else 'N/A'
return category, brand, details, product_code
return 'N/A', 'N/A', 'N/A', 'N/A'
Product details parsing is probably one of the more complex functions of the scraper that will scrape several pieces of information from the product description section. Four important pieces of data are brought out of this function: the category, brand, detailed description, and product code. The details in this section are in its methods to deal with both different formats of text and nested HTML structures.
It begins by locating the presence of a specific div-element consisting of all product information. Ironically, it is a hidden-div element, and its style possesses 'visibility: hidden', so although it may appear invisible, it still holds some valuable information. Upon finding the div, the function then manipulates this text and HTML-enforces parsing to obtain distinct access to all types of different pieces of information. It scans for characteristic phrases in the text, like " by " that splits the brand name from the category mostly.
The brand extraction is smart enough and has fallbacks to several methods in case information capture fails. So, it first looks within the anchor tag for the brand, then falls back to text analysis if the brands are not found within the tags. This ensures reliable data extraction, even though there could be slight variations in the structure of the webpage. There is also an account for a list of multiple product details in cleaning the unwanted text removed from the product code.
The function maintains a fail-safe approach in its operation. Fallback options for part of the extraction process that fails, returns 'N/A' if the whole process fails for all the fields. Thus it can continue the process of scraping even though some of the products yield information which either might be missing or in different formats.
def parse_size_and_fit(soup):
"""
Extract size and fit information from the product description.
Args:
soup (BeautifulSoup): Parsed HTML content
Returns:
str: Size and fit information if found, 'N/A' otherwise
"""
div_element2 = soup.find('div', {'data-testid': 'productDescriptionSizeAndFit'},
style=lambda value: value and 'visibility: hidden' in value)
if div_element2:
div_text = div_element2.find('div', class_='F_yfF')
return div_text.get_text(strip=True) if div_text else 'N/A'
return 'N/A'
The size and fit parsing function retrieves information on fit as well as sizes available. Such information is very crucial for any online shopper who cannot try any item before making a purchase. Like any other parsing function, it looks for a specific hidden element of div, usually containing size and fit details.
Like the product details function, this one searches for a div with a specific 'data-testid' attribute that's usually hidden on a page. Then it searches for the inner div that has a specified class and contains the text of interest. If both are found, it extracts and returns text content; otherwise, it falls back to 'N/A' so any instance of non-finding doesn't jeopardize the entire process of scraping.
def parse_care_instructions(soup):
"""
Extract care instructions from the product description.
Args:
soup (BeautifulSoup): Parsed HTML content
Returns:
str: Care instructions if found, 'N/A' otherwise
"""
div_element3 = soup.find('div', {'data-testid': 'productDescriptionCareInfo'},
style=lambda value: value and 'visibility: hidden' in value)
if div_element3:
div_text2 = div_element3.find('div', class_='F_yfF')
return div_text2.get_text(strip=True) if div_text2 else 'N/A'
return 'N/A'
The care instructions parsing function extracts information on how to look after the clothing product-from washing to ironing guidelines and general maintenance-how to react when exercising a certain precaution that might be too extreme. It operates in the same patterns as does the size and fit function, listening for a hidden div element with certain attributes.
Including care information is crucial so that customers can understand better how to care for their purchases. This function will guarantee the capture of said data in a reliable way. This function applies the robust error handling as the others, and it returns 'N/A' when expected elements are not found. It helps maintain stability with the overall scraping.
def parse_material(soup):
"""
Extract material information from the product description.
This function looks for material information in various formats and attempts
to separate main material from additional material details.
Args:
soup (BeautifulSoup): Parsed HTML content
Returns:
tuple: (main_material, additional_material)
- main_material (str): Primary material information or 'N/A' if not found
- additional_material (str): Secondary material information or 'N/A' if not found
"""
div_element4 = soup.find('div', {'data-testid': 'productDescriptionAboutMe'},
style=lambda value: value and 'visibility: hidden' in value)
if div_element4:
div_content = div_element4.find('div', class_='F_yfF')
if div_content:
div_text = div_content.text.strip()
# Check for various material section separators
for separator in ['Main:', 'Body:', 'Fabric:', 'Lining:']:
if separator in div_text:
parts = div_text.split(separator)
return parts[0].strip(), parts[1].strip()
# If no separator found, return all text as main material
return div_text, 'N/A'
return 'N/A', 'N/A'
The material parsing function is to obtain information describing what materials the article of clothing is made of. This function is more complicated than the other two because material information may present itself in many different forms. Sometimes, there is a primary material and several secondary ones, while sometimes there is only one material.
This function looks for the correct div element and then the specified separators ASOS uses to divide material information of different types. It's looking for a few possible types of separators like 'Main:', 'Body:', 'Fabric:', and 'Lining:'. If any of them exist, then splits the text and returns two parts. If no separator exists then just returns the whole text as main material and returns 'N/A' as additional material.
Scraping Function
async def scrape_product_details(browser, product_url, user_agents):
"""
Scrape product details from a given URL using an async browser instance.
Args:
browser: Playwright browser instance
product_url: URL of the product to scrape
user_agents: List of user agent strings for rotation
Returns:
dict: Product details if successful, or dict with just URL if failed
The function attempts to scrape each URL up to 3 times, with random delays
between retries. It uses rotating user agents and custom headers to mimic
real browser behaviour.
"""
max_retries = 3
# Comprehensive headers to mimic real browser behaviour
headers = {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br, zstd",
"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
"Connection": "keep-alive",
"Content-Type": "text/plain",
"Origin": "https://www.asos.com",
"Referer": "https://www.asos.com/",
# Chrome and Chromium version strings - update as needed
"Sec-Ch-Ua": '"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"',
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": '"Linux"',
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "cross-site",
"User-Agent": random.choice(user_agents),
# Note: Update this session ID or implement rotation if needed
"Cookie": "JSESSIONID=51726954faffe671"
}
for attempt in range(max_retries):
try:
# Create new page for each attempt to avoid state issues
page = await browser.new_page()
await page.set_extra_http_headers(headers)
# Extended timeout for slow-loading pages
await page.goto(product_url, timeout=90000)
await page.wait_for_load_state('load')
html = await page.content()
soup = BeautifulSoup(html, 'html.parser')
# Always close the page to free up resources
await page.close()
# Extract and return all product details
return {
'product_url': product_url,
'product_name': parse_product_name(soup),
# Note: parse_price returns a tuple of (sale_price, original_price, discount)
'mrp': parse_price(soup)[1],
'discount': parse_price(soup)[2],
'sale_price': parse_price(soup)[0],
# Similar tuple unpacking for ratings
'rating': parse_rating_and_reviews(soup)[0],
'no_of_reviews': parse_rating_and_reviews(soup)[1],
'color': parse_color(soup),
# product_details returns (category, brand, details, product_code)
'category': parse_product_details(soup)[0],
'brand': parse_product_details(soup)[1],
'details': parse_product_details(soup)[2],
'product_code': parse_product_details(soup)[3],
'size_and_fit': parse_size_and_fit(soup),
'care_instructions': parse_care_instructions(soup),
# material returns (material_type, material_composition)
'material_type': parse_material(soup)[0],
'material_composition': parse_material(soup)[1]
}
except Exception as e:
error_message = str(e)
print(f"Attempt {attempt + 1} failed to scrape {product_url}: {error_message}")
if attempt < max_retries - 1:
# Random delay between retries to avoid detection
sleep_time = random.uniform(1, 3)
print(f"Retrying after {sleep_time} seconds...")
await asyncio.sleep(sleep_time)
else:
# After all retries fail, return minimal dictionary with just the URL
return {'product_url': product_url}
It is within the scrape_product_details function that our web scraping operation does the real magic. It's designed with the toughest challenges of data extraction in a web page where the client end is dynamic and the same as mimicking the actions of a human user. As input, it takes a browser instance, a product URL, and a list of user agents. Once there are these three inputs, it organizes some headers closely resembling a real one, including as randomly selected to create variation in the request signature.
It contains a retry mechanism, in case of initial failures, through attempting to scrape each product page thrice. Given the network instabilities or even server crashes for a short time, the resilience type is really needed. At each attempt, it creates a new page instance, applies the crafted headers, and navigates to the target URL. That makes sure each attempt is free from state-related problems.
Once the page is loaded, it employs our previously defined parsing functions to recover a wide array of product information. Ranging from as general as the name and price of the product to such specific ones as material composition and instructions for care, it delivers all the necessary attributes that may be required. Using separate functions for each data point makes maintenance and updates of our scraping logic easier when the website structure changes.
Failures of this sort persist even after retry attempts; gracefully degrading to return a minimal dictionary containing only the URL. This behaviour helps in letting our main scraping loop continue running when individual product pages are problematic. It further utilizes random delays between retries throughout the operation. This strategy lets our scraper blend with normal flows of user traffic.
Database Operations
def get_product_urls(db_path):
"""
Get list of unscraped URLs from the database.
Args:
db_path: Path to SQLite database
Returns:
list: Unscraped URLs where scraped=0
"""
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
# Only get URLs that haven't been scraped yet
cursor.execute("SELECT url FROM urls WHERE scraped = 0")
urls = [row[0] for row in cursor.fetchall()]
conn.close()
return urls
It is our get_product_urls function that acts as the trigger mechanism in our scraper between our scraper and our data storage, connecting with our SQLite database we have created and retrieving a list of URLs ready for processing. Elegant and simple, it plays quite an important role in managing our queue of scraps insofar as it queries the list of urls where the 'scraped' flag is set to 0, therefore always working with fresh, untouched data.
def save_product_details(db_path, data):
"""
Save scraped product details to database.
Args:
db_path: Path to SQLite database
data: Dictionary containing product details
The function handles both successful and failed scrapes:
- Successful scrapes are saved to the 'data' table
- Failed scrapes are logged in 'error_urls' table
- URL status is updated in both cases
"""
# Early return if data is not in expected format
if isinstance(data, tuple):
print(f"Expected dictionary but got tuple: {data}")
return
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
try:
# Check if scrape was successful
if 'product_name' in data and data['product_name'] != 'Error':
# Insert successful scrape data
cursor.execute('''
INSERT INTO data (
product_url, product_name, mrp, discount, sale_price,
rating, no_of_reviews, color, category, brand, details,
product_code, size_and_fit, care_instructions,
material_type, material_composition
)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
''', (
data['product_url'], data['product_name'], data['mrp'],
data['discount'], data['sale_price'], data['rating'],
data['no_of_reviews'], data['color'], data['category'],
data['brand'], data['details'], data['product_code'],
data['size_and_fit'], data['care_instructions'],
data['material_type'], data['material_composition']
))
# Mark URL as successfully scraped (1)
cursor.execute("UPDATE urls SET scraped = 1 WHERE url = ?", (data['product_url'],))
print(f"Successfully scraped and marked as done: {data['product_url']}")
else:
# Log failed scrape
cursor.execute('''
INSERT INTO error_urls (product_url, error_message)
VALUES (?, ?)
''', (data['product_url'], 'Error occurred during scraping'))
# Mark URL as failed (-1) for potential retry later
cursor.execute("UPDATE urls SET scraped = -1 WHERE url = ?", (data['product_url'],))
print(f"Failed to scrape and marked for retry: {data['product_url']}")
conn.commit()
except Exception as e:
print(f"Error processing {data['product_url']}: {e}")
conn.rollback()
finally:
conn.close()
The save_product_details function takes care of an important task- persisting our scraped data. It is supposed to handle successful as well as unsuccessful scrapes so that we have complete records of our operations in it. In case of successful scrapes, it stores a ton of our product information in our 'data' table, from the bare bones of name and price to the minutest detail about material composition or even caring instructions.
In the event of a scraped failure, it doesn't give up on the information. It logs that failure to a 'separate error_urls' table. This method allows us to understand problematic urls that could be diagnosed later or retried. The function in either event will update the 'scraped' status in our 'urls' table, where successfully scraped have a value of 1 and those that fail will have -1.
The function employed robust error handling mechanisms to take control of errors that could arise while interacting with the database. This is done through the use of transactions, ensuring data integrity: committing each successful operation while rolling back upon errors. This aspect of error management and data handling makes our scraper not only great at gathering data but also reliable in maintaining the integrity and completeness of our dataset.
Main Scraping Loop
async def main(user_agents_file, db_path):
"""
Main function to orchestrate the scraping process.
Args:
user_agents_file: Path to file containing user agents
db_path: Path to SQLite database
The function:
1. Loads user agents
2. Gets unscraped URLs
3. Launches browser and scrapes each URL
4. Saves results and updates URL status
5. Continues until all URLs are processed
"""
user_agents = load_user_agents(user_agents_file)
while True:
# Get batch of unscraped URLs
product_urls = get_product_urls(db_path)
if not product_urls:
print("All URLs have been processed. Exiting.")
break
async with async_playwright() as p:
# Launch browser in headless mode for faster operation
browser = await p.chromium.launch(headless=True)
for product_url in product_urls:
result = await scrape_product_details(browser, product_url, user_agents)
# Verify result format before saving
if not isinstance(result, dict):
print(f"Expected dictionary but got {type(result)}: {result}")
continue
save_product_details(db_path, result)
# Random delay between scrapes
await asyncio.sleep(random.uniform(1, 3))
await browser.close()
The key function acts as our orchestrator, coordinating all the components in a symphony of work. It starts by loading our list of user agents from a file, providing a wide range of browser identities to use for our requests. This helps in making the behavior of our scraper appear even more human-like and less recognizable as automated traffic.
The heart of the function is a loop, which runs until all the available URLs have been processed. In each iteration, it fetches a batch of URLs from our database that have not yet been scraped. This batching allows us to chop our scraping into morsecle-sized chunks-easier to manage, also clean to resume if the script gets interrupted.
With this function, it launches one browser instance with Playwright for every batch of URLs, then it iterates over all the URLs in that batch calling our scrape_product_details function for each URL, and keeps our code clean because the scenario from the Golang solution above is already included inside our modular function. The results after a successful scrape are saved to the database, and the URL's status is updated for each scrap.
What is cool about this function is the addition of adding in random delays between scrapes. This does two things: while it reduces the load on the target server it also makes our scraping pattern less predictable and therefore more human-like. That is a critical consideration for ethical web scraping, because it should enable us to collect data without causing undue stress to the website's structure in the process.
Script Initialization
if __name__ == "__main__":
user_agents_file = 'user_agents.txt'
db_path = 'asos_data.db'
# Initialize database schema
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
# Add 'scraped' column to urls table if it doesn't exist
cursor.execute("PRAGMA table_info(urls)")
columns = [column[1] for column in cursor.fetchall()]
if 'scraped' not in columns:
cursor.execute('ALTER TABLE urls ADD COLUMN scraped INTEGER DEFAULT 0')
# Create tables for product data and error logging
cursor.execute('''
CREATE TABLE IF NOT EXISTS data (
product_url TEXT,
product_name TEXT,
mrp NUMERIC,
discount NUMERIC,
sale_price NUMERIC,
rating NUMERIC,
no_of_reviews NUMERIC,
color TEXT,
category TEXT,
brand TEXT,
details TEXT,
product_code TEXT,
size_and_fit TEXT,
care_instructions TEXT,
material_type TEXT,
material_composition TEXT
)
''')
cursor.execute('''
CREATE TABLE IF NOT EXISTS error_urls (
product_url TEXT,
error_message TEXT
)
''')
conn.commit()
conn.close()
# Start the scraping process
asyncio.run(main(user_agents_file, db_path))
The script initialization section, it is here that we lay the groundwork for our operation of scraping. This script is set to run if the script is run directly, and then all the required setups happen before the main scraping process has started. This section first defines the paths of the file containing our user-agents and our SQLite database. These are critical paths because they define where our script will source its browser identities as well as where it will store the scraped data.
Then this section does great things for us: initializing our database schema. It connects to the SQLite database, checks whether the required tables are there, and creates them in case they are not. These tables shall include product data, tracked URLs, and error logs. The script also tries to locate a column named 'scraped' in the urls table - adds it if absent. Such a forward-looking strategy enables our script to evolve according to the changes in the database structure across time.
It initializes our database to serve every operation in our scraping process. It sets tables into our details regarding products, URL queueing, and error tracking. Through this, our scraper won't only collect data but also do it in an organized and trackable manner.
The script calls the main() function at the end of execution, where the entire scraping process is started. It is this call that fires up our scraper, launching into the phase of downloading the URLs, scraping product pages, and saving them. All this setup and initialization are written within a different section so that our scraper launches every time with a clean well-prepared environment set to collect and store data efficiently.
Connect with Datahut for top-notch web scraping services that bring you the information you need, hassle-free.
FAQ SECTION
1. What is the purpose of web scraping ASOS data?
Web scraping ASOS data helps uncover trends in pricing, discounts, product diversity, and customer preferences. By analyzing this data, businesses can gain insights to optimize their inventory, refine marketing strategies, and better understand market dynamics.
2. Is it legal to scrape data from ASOS?
Web scraping is generally legal when done ethically and in compliance with ASOS's terms of service. It’s important to scrape publicly available data responsibly, avoiding any actions that might overload their servers or violate intellectual property laws.
3. What kind of insights can I expect from analyzing ASOS trends?
By analyzing ASOS data, you can uncover patterns in pricing, discount strategies, popular categories, and customer reviews. These insights can guide decisions related to pricing strategies, product development, and targeted marketing.
4. Do I need programming knowledge to scrape ASOS data?
While having programming knowledge, especially in Python, is beneficial, there are tools and services available that allow non-programmers to collect and analyze data. Alternatively, you can outsource data scraping to professional service providers for accurate and efficient results.