If you ever wanted to analyze fashion trends or track product prices, you know how a huge pain it is to collect this data manually from e-commerce websites. Enter web scraping : basically, writing code that automatically browses websites and pulls out the data you need. One can imagine having a super-fast assistant visiting thousands of web pages, copying important information, and pasting it into a spreadsheet - except it does all this automatically.
This project centers around web scraping data from Net-a-Porter, one of the largest websites that sell luxury fashion. Net-a-Porter sells products belonging to hundreds of designer brands and updates its catalogue quite often. The website is pretty well structured, so it makes it great for web scraping, though you will have to deal with their anti-scraping measures carefully.
Workflow
URL Collection Phase
Our first task is to collect all those product URLs. The scraper systematically goes through Net-a-Porter's category pages, scraping all the links it encounters for products. We don't blindly scrape HTML here; we're smart about it. Sometimes we use their API endpoints when they're available and sometimes parse the HTML directly. The code deals with pagination automatically so you do not have to think in terms of missing products on page 2, 3, or 200.
Data Extraction Phase
Once we've got our URL collection, we move into the meat of the scraping process. The system visits each product page and extracts a treasure trove of data: prices, brand names, descriptions, size availability, colour options, material composition - basically anything you might want to analyse later. We've built in retry logic and error handling because, let's face it, web scraping rarely goes perfectly the first time.
Technologies Under the Hood
Requests Library
Making HTTP requests is all about getting data from websites, and that's where the Requests library comes in. It can be imagined as your browser's engine, but in code form. Scrapping Net-a-Porter basically requires dealing with quite a lot of different situations-for example, pages take too long to load, sometimes they throw SSL errors, and sometimes they just fail for no good reason. Requests handle all this stuff for us. We've configured it so when requests fail, Requests will automatically retry them and manage cookies just as a real browser would. So cool about Requests: you can customize pretty much everything about how you access a website. We use custom headers to make the scraper look like a real browser, and we maintain sessions so our cookies are consistent, just as they would be in normal shopping on the site.
Beautiful Soup 4
Now that Requests fetches a web page for us, we need to make sense of all that HTML. That's where Beautiful Soup comes in. It is quite a nightmare to try to parse HTML with regular expressions. Beautiful Soup makes it extremely easy, as it reads the HTML and puts it into a structure you can navigate like folders on your computer. Want to find all the product prices on a page? Beautiful Soup can do that with a single line of code. It's extremely useful for Net-a-Porter's site because their product pages have a lot of nested information-so color variations, size availability, are a long way down in the HTML. Beautiful Soup lets us drill down and grab exactly what we need without getting lost in the markup.
Proxy Management System
When you're scraping at scale, you can't just use your own internet connection-websites will block you pretty quick. That's why we created a system of proxies through which our requests go to different IP addresses. Think about it in terms of having many computers in different locations all accessing the site at the same time. Our system keeps a log of working proxies and auto-switches to a different one once one becomes problematic. We alternate between residential proxies, which appear to be just like most home internet connections, and datacenter proxies to make sure our traffic doesn't look suspicious. The system has enough smarts to know which proxies are doing the best jobs and automatically removes any that become too slow or keep getting blocked.
SQLite Database
All this scraped data needs somewhere to live, and SQLite is a great place to put it. Unlike large database servers that require a lot of setup, SQLite just works right out of the box - it's basically a superpowered file that acts like a database. We've grouped the data into logical tables: one for basic product information, another for prices, another for sizes, and so on. This makes it really easy to answer questions like "show me all products that got a price drop in the last week" or "find all dresses available in size M". We've also added a few indexes to make common searches really fast, and we're using foreign keys to keep everything properly connected. And the best part is you don't need to run any special database software-your Python code can work with the SQLite file directly.
Data Cleaning and Processing
OpenRefine Processing
Raw scraped data is usually pretty messy, and that's where OpenRefine comes in handy. Think of OpenRefine as a super-powered spreadsheet that's really good at finding and fixing patterns in data. So imagine you are scraping together thousands of products, and their brand names are sometimes slightly different ("Saint Laurent" vs "Saint-Laurent")-with OpenRefine, you can easily find these inconsistencies and correct them in bulk. It's also great when you scrape together lots of prices that are pretty different, standardizing the color names (so "Navy Blue" and "Dark Blue" become the same thing), or catching typos in product descriptions. What's really nice about OpenRefine is that you can see what you're doing while cleaning the data, and you can always undo changes if something goes wrong.
Python Data Cleaning Scripts
While OpenRefine is great for manual cleaning, sometimes you need to automate things. That's where our Python cleaning scripts come in. These scripts tackle things that would be a pain to do manually: convert all prices to the same currency, pull out material percentages from product descriptions, or organize products into proper categories. We will use pandas for most of this because it makes handling large datasets really easy. For example, if you need to standardize how sizes are written across different brands, pandas will help you apply the same rules to thousands of products at once. All the cleaned data gets saved back to our SQLite database, ready for whatever analysis you want to do.
URL Collection Phase
Import Section
import requests
from bs4 import BeautifulSoup
import sqlite3
import random
import time
import urllib3
The import section brings all the tools we need into play for web scraping. Requests and BeautifulSoup are our main web-scraping libraries: it is the former that will handle getting the web pages, and the latter helps parse them. SQLite3 lets us store our data in a database, while random and time help make our scraper behave more like a human by adding random delays. urllib3 is going to assist us with handling SSL certificate warnings if we need to make a secure connection.
Variable Declarations
# Disable SSL warnings (use with caution)
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# Proxies
PROXIES = { "https": "http://dummy_proxy_username:dummy_proxy_password@proxy.example.com:8001" }
# Base URL
BASE_URL = "https://www.net-a-porter.com"
# Category URLs
CATEGORY_URLS = {
'Coats and Jackets': 'https://www.net-a-porter.com/en-us/shop/clothing/coats-and-jackets',
'Denim': 'https://www.net-a-porter.com/en-in/shop/clothing/denim',
'Dresses': 'https://www.net-a-porter.com/en-in/shop/clothing/dresses',
'Jeans': 'https://www.net-a-porter.com/en-in/shop/clothing/jeans',
'Jumpsuits and Playsuits': 'https://www.net-a-porter.com/en-in/shop/clothing/jumpsuits-and-playsuits',
'Knitwear': 'https://www.net-a-porter.com/en-in/shop/clothing/knitwear',
'Lingerie': 'https://www.net-a-porter.com/en-in/shop/clothing/lingerie',
'Loungewear': 'https://www.net-a-porter.com/en-in/shop/clothing/loungewear',
'Matching Separates': 'https://www.net-a-porter.com/en-in/shop/clothing/matching-separates',
'Shorts': 'https://www.net-a-porter.com/en-in/shop/clothing/shorts',
'Skirts': 'https://www.net-a-porter.com/en-in/shop/clothing/skirts',
'Skiwear': 'https://www.net-a-porter.com/en-in/shop/edit/skiwear',
'Sport': 'https://www.net-a-porter.com/en-in/shop/clothing/sport',
'Suits': 'https://www.net-a-porter.com/en-in/shop/clothing/suits',
'Swimwear and Beachwear': 'https://www.net-a-porter.com/en-in/shop/clothing/swimwear-and-beachwear',
'Tops': 'https://www.net-a-porter.com/en-in/shop/clothing/tops',
'Pants': 'https://www.net-a-porter.com/en-in/shop/clothing/pants'
}
DATABASE_NAME = 'net-a-porter-products.db'
In this subsection, we're initializing all our constant values, which we are going to use within the scraper. We've got our proxy configuration (which will help us avoid getting blocked), the main website URL we're going to scrape from, and dictionary holding all the categories of clothing we want to scrape. We also specify the name of our database file where we're going to store all the URLs we find. The disable ssl warning is sort of a hack - we're telling our scraper to ignore the SSL certificate issues and it's one of those things you can be using in case of proxied work but should use with caution in production.
load_user_agents Function
def load_user_agents(file_path='user_agents.txt'):
"""
Load user agents from a file.
This function reads a text file containing user agent strings and returns them as a list.
Each line in the file is expected to contain one user agent string.
Args:
file_path (str): The path to the file containing user agent strings. Defaults to 'user_agents.txt'.
Returns:
list: A list of user agent strings.
Raises:
FileNotFoundError: If the specified file is not found.
IOError: If there's an error reading the file.
"""
try:
with open(file_path) as f:
return [line.strip() for line in f.readlines()]
except FileNotFoundError:
print(f"Error: User agent file '{file_path}' not found.")
return []
except IOError as e:
print(f"Error reading user agent file: {e}")
return []
This function is all about making our scraper look more legitimate by using different browser identities. Normally, when you visit a website, your browser tells the site what kind of browser and system you are using-this can be like "Chrome on Windows" or "Safari on iPhone." We store a bunch of these browser identities in a text file, and this function reads them all in.
The function is quite straightforward; it opens the file, reads it line by line, creates a list, and each one of these lines becomes a user agent string. We use a 'with' statement here because it's the clean way to handle files in Python-it automatically closes the file even if something goes wrong. If we can't locate the file or reading from it fails, this function won't raise an exception itself but informs us what had failed, returning an empty list instead.
Error handling is important here because we're reading an external file. If someone deletes the user agents file or puts it in the wrong place, we want to know about it rather than our scraper crashing mysteriously. The function checks for two specific problems: the file not existing (FileNotFoundError) and general problems reading the file (IOError).
The 'file_path' parameter has a default value 'user_agents.txt. This means that in case you do not specify an alternative file path when you are calling the function, it will look for that specific filename in the same folder as your script. This makes it convenient to use but still flexible when it's necessary.
init_db Function
def init_db(db_name):
"""
Initialise the SQLite database and create a table for storing URLs.
This function creates a new SQLite database connection, initialises a cursor,
and creates a table named 'urls' if it doesn't already exist. The table has
columns for id (primary key), category, and url.
Args:
db_name (str): The name of the SQLite database file to be created or connected to.
Returns:
tuple: A tuple containing the database connection object and cursor object.
Raises:
sqlite3.Error: If there's an error in creating the database or table.
"""
try:
conn = sqlite3.connect(db_name)
cur = conn.cursor()
cur.execute('''
CREATE TABLE IF NOT EXISTS urls (
id INTEGER PRIMARY KEY AUTOINCREMENT,
category TEXT,
url TEXT
)
''')
conn.commit()
return conn, cur
except sqlite3.Error as e:
print(f"Database error: {e}")
return None, None
The init_db function is how we establish our SQLite database, where we will store all the URLs we scrape. Just like opening an Excel sheet, it's incredibly powerful and automated. Actually, the function does two things: it creates a new file if one doesn't exist and makes sure we have a table ready to store our data.
Inside the function, we first create a connection to the database. If the database file doesn't exist yet, SQLite automatically creates it for us. Then we create a cursor, which is like our pointer for telling the database what to do. The cursor is what we'll use to execute SQL commands and fetch results.
Actually, the most important part is creating the table structure. We use SQL's CREATE TABLE IF NOT EXISTS command, which is a safe way to set up our table, skipping the action if it's already there. Our table is pretty simple: it has an ID column that automatically numbers each entry, a category column for storing what type of product it is - like "Dresses" or "Shoes" - and a URL column for the actual product link.
The function wraps everything in a try/except block because database operations can be finicky. It prints out what happened if anything goes wrong and then returns None values. This is essential because it lets the rest of our code know that something went wrong with the database setup, so we can handle that gracefully instead of crashing.
scrape_category Function
def scrape_category(category_name, url, user_agents, cur, conn):
"""
Scrape product URLs from a category page and handle pagination.
This function sends requests to the category URL, extracts product URLs from the page,
saves them to the database, and follows pagination links to scrape subsequent pages.
Args:
category_name (str): The name of the category being scraped.
url (str): The initial URL of the category page to scrape.
user_agents (list): A list of user agent strings to use for requests.
cur (sqlite3.Cursor): The database cursor object.
conn (sqlite3.Connection): The database connection object.
Raises:
requests.exceptions.RequestException: If there's an error in making the HTTP request.
sqlite3.Error: If there's an error in database operations.
"""
headers = {'User-Agent': random.choice(user_agents)}
while url:
try:
response = requests.get(url, headers=headers, proxies=PROXIES, timeout=10, verify=False)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# Extract product URLs
products = soup.select('body > main > div > div.ProductListingPage0 > section > div.ProductListingPage0__layoutGridWrapper > div > div.ProductGrid53.ProductListWithLoadMore0__listingGrid > div.ProductList0__productItemContainer')
for product in products:
a_tag = product.select_one('a')
if a_tag:
product_url = a_tag.get('href')
if product_url:
full_product_url = BASE_URL + product_url if not product_url.startswith(BASE_URL) else product_url
cur.execute('INSERT INTO urls (category, url) VALUES (?, ?)', (category_name, full_product_url))
conn.commit()
# Find the next page URL
next_page_tag = soup.select_one('body > main > div > div.ProductListingPage0 > section > div.ProductListingPage0__layoutGridWrapper > div > div.Pagination7.ProductListingPage0__PaginationMarginExperiment > div > div:nth-child(3) > a.Pagination7__next.ProductListingPage0__PaginationMarginExperiment')
if next_page_tag:
next_page_url = next_page_tag.get('href')
if next_page_url:
url = BASE_URL + next_page_url
time.sleep(random.uniform(1, 3))
else:
break
else:
break
except requests.exceptions.RequestException as e:
print(f"Error scraping {url}: {e}")
break
except sqlite3.Error as e:
print(f"Database error while scraping {category_name}: {e}")
break
This is where the actual work takes place. The scrape_category function retrieves all product URLs from a single category page as well as any subsequent pages. It is the equivalent of having a robot click through all the pages in a category and write down all of the product links it finds.
The function first selects a random user agent from our list - it helps each request look unique, kind of like different people visiting the site. Then it enters a loop to process the URLs as long as there is one in the pile. It then makes a request on the website using our proxy settings and the random user agent in the loop. We avoid hanging on slow sites by using a timeout of 10 seconds, and we're disabling SSL verification since we're using proxies.
Once we have the content of the page, we make use of BeautifulSoup to extract the links to all products. Our current CSS selector to look up each product ('body > main > div.') is like following directions to find each product on the page. For every product found, we get its link and store it in our database. We are also careful to handle relative URLs - those that begin with just '/' - and only prepend the base URL if that's appropriate.
Clever pagination handling: after processing every product on a page, we look for a "next page link". If there is one, we update our URL and enter the loop to process the next page. Between pages, we add a random delay of 1-3 seconds to avoid hammering the server. If anything goes wrong (network issues, database problems, etc.), we catch the error, print what happened and break out of the loop gracefully.
scrape_all_categories Function
def scrape_all_categories(category_urls, user_agents, cur, conn):
"""
Loop through each category and scrape data.
This function iterates over the provided category URLs, calling the scrape_category
function for each category. It introduces a delay between category scrapes to avoid
rate limiting.
Args:
category_urls (dict): A dictionary mapping category names to their URLs.
user_agents (list): A list of user agent strings to use for requests.
cur (sqlite3.Cursor): The database cursor object.
conn (sqlite3.Connection): The database connection object.
"""
for category, url in category_urls.items():
print(f"Scraping category: {category}")
scrape_category(category, url, user_agents, cur, conn)
time.sleep(random.uniform(1, 3))
Think of this function as the manager overseeing the whole scraping operation. It takes our dictionary of category URLs and methodically works through each one, making sure we scrape every category completely. This is a simple function, but it is important in organizing the scraping process.
The function loops through the dictionary of category URLs, where each entry is a category name (such as "Dresses") and its corresponding URL. For each category, it prints out which category it's about to scrape-this is useful for following progress when the scraper is running, and then calls our scrape_category function to do the actual scraping work:.
Between categories, the function introduces a random delay of 1 to 3 seconds. This is important because our crawling pattern will look much more natural. Real users do not instantly jump from one category to another. It also prevents us from overwhelming the server with too many requests too quickly.
Minimal error handling at this level as most error handling happens within scrape_category. Good design - we handle errors where they occur, but we keep the high-level organisation of our code clear and uncomplicated.
main Function
def main():
"""
Main function to run the scraper.
This function orchestrates the entire scraping process:
1. Loads user agents from a file.
2. Initialises the database connection.
3. Calls the function to scrape all categories.
4. Closes the database connection upon completion.
If any step fails, it prints an error message and exits gracefully.
"""
try:
# Load user agents from file
user_agents = load_user_agents()
if not user_agents:
print("Error: No user agents loaded. Exiting.")
return
# Initialise the database
conn, cur = init_db(DATABASE_NAME)
if not conn or not cur:
print("Error: Failed to initialise database. Exiting.")
return
# Scrape all categories
scrape_all_categories(CATEGORY_URLS, user_agents, cur, conn)
# Close the database connection
conn.close()
print("Scraping completed and data saved to the database.")
except Exception as e:
print(f"An unexpected error occurred: {e}")
finally:
if 'conn' in locals() and conn:
conn.close()
The main function is like a checklist for running our scraper. It organises all the steps we need to take and makes sure they happen in the right order. First, it loads our user agents by calling load_user_agents(). If that fails (like if the file is missing), it prints an error and exits early: there is no point trying to scrape if we can't make our requests look legitimate.
Now it tries to initialize our database by calling init_db(). Again, if this is not successful, we exit early because we need a place to save our scraped URLs. Again, we are applying the principle of failing fast - we want to know quickly if something critical isn't working so that we can fix it, rather than halfway through scraping.
The actual scraping takes place by calling scrape_all_categories() with all our setup data - our category URLs, user agents, and database connection. This is wrapped in a try/except block to catch any unexpected errors that may happen during scraping. Even if something goes wrong, our finally block makes sure we always close our database connection properly.
The pattern for the main() function is fairly commonplace in Python: resource setup, use, clean up; all done with error graceful handling. If an error occurs anywhere during the implementation above, we can rest assured that we will clean up our resources (like closing a connection to the database) before we leave. So, we avoid resource leaks, and our system stays clean.
Script Entry Point
# Run the main function if this script is executed directly
if __name__ == '__main__':
main()
This is just Python's way of saying "only run the main() function if someone is running this script directly" (as opposed to importing it as a module). It's a standard Python pattern that keeps our code organised and prevents it from running when we don't want it to.
Data Extraction Phase
Imports Section
import asyncio
import random
import requests
from bs4 import BeautifulSoup
import sqlite3
import logging
import time
import urllib3
The import section imports the critical libraries that will be used during web scraping operations in the application; these are asyncio for asynchronous running, requests and BeautifulSoup for the requests and parsing of an HTML page, sqlite3 for handling database operations, logging for tracking and managing errors, and urllib3 to handle HTTP-related warnings.
Variable Declarations
# Suppress only the single InsecureRequestWarning from urllib3 needed to disable the warnings
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# Configure logging to output timestamps, log levels, and messages
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
# Constants
USER_AGENTS_FILE = 'user_agents.txt'
DB_FILE = 'net-a-porter-products.db'
RETRY_ATTEMPTS = 1
# Proxies configuration
proxies = { "https": "http://dummy_proxy_username:dummy_proxy_password@proxy.example.com:8001" }
# Common headers template for requests
headers_template = {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br, zstd",
"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
}
# Cookies to be used in requests
cookies = [
{"name": "_gcl_au", "value": "1.1.550599228.1720606510", "domain": ".net-a-porter.com", "path": "/"},
{"name": "_bamls_usid", "value": "0483e18a-fc07-42b6-8963-8436dbcf23fa", "domain": ".net-a-porter.com", "path": "/"},
{"name": "rmStore", "value": "dmid:9361", "domain": ".net-a-porter.com", "path": "/"},
]
The variable declaration part configures primary configuration parameters that the web scraper will use. It specifies file paths for proxy settings, HTTP headers, and cookies that would help authenticate and interact with the target website. In this case, it has been configured specifically for scraping the website of Net-a-Porter, with proper headers and cookies imitating real browser behavior.
Parsing Functions
def parse_product_name(soup):
"""
Extract the product name from the HTML content.
Args:
soup (BeautifulSoup): Parsed HTML content
Returns:
str: Product name or 'N/A' if not found
Note:
Uses CSS selector to locate the product name element in the specific HTML structure
"""
title_tag = soup.select_one('body > main > div > div:nth-child(2) > div > div.ProductDetailsPage88__productDetailsGrid.ProductDetailsPage88__productDetailsGrid--stickyCta > div.ProductDetails88__basicInfo.ProductDetails88__basicInfo--stickyCta.ProductDetails88__basicInfo--sustainabilityModifier > div.ProductInformation88.ProductDetails88__productInformation > p.ProductInformation88__name.ProductInformation88__name--stickyCta')
return title_tag.string.strip() if title_tag else 'N/A'
The parse_product_name function goes ahead and uses a specific CSS selector path - 'body > main > div > div:nth-child(2)...' - to navigate through several nested HTML elements and find the product name. Imagine being on a journey using a map to get to the exact location of the product name inside the webpage structure. Once it attains this element, it extracts the text content, removes extra white space using strip(), and returns it. Otherwise, if the element is not attained, it returns 'N/A' such that all errors while collecting the data do not affect the data.
def parse_stock_type(soup):
"""
Extract the stock availability status from the HTML content.
Args:
soup (BeautifulSoup): Parsed HTML content
Returns:
str: Stock status or 'N/A' if not found
"""
stock_div = soup.select_one('body > main > div > div:nth-child(2) > div > div.ProductDetailsPage88__productDetailsGrid.ProductDetailsPage88__productDetailsGrid--stickyCta > div.ProductDetails88__basicInfo.ProductDetails88__basicInfo--stickyCta.ProductDetails88__basicInfo--sustainabilityModifier > div.SingleBadge3__badge.TransitionSingleBadge3__transitionBadge.TransitionSingleBadge3__transitionBadge--stickyCta')
return stock_div.text.strip() if stock_div else 'N/A'
This parse_stock_type function is to search for product availability information through looking for a particular div element containing stock status. It makes use of a CSS selector that is really complex to target the SingleBadge3__badge class within the product details grid. It simply picks out text that could display things like "In Stock", "Out of Stock", or "Low Stock". At finding this element, it cleans up the text by deleting extra whitespaces and returns the status. In case no stock information is found, it safely returns 'N/A'.
def parse_brand(soup):
"""
Extract the brand name from the HTML content.
Args:
soup (BeautifulSoup): Parsed HTML content
Returns:
str: Brand name or 'N/A' if not found
"""
brand_tag = soup.select_one('body > main > div > div:nth-child(2) > div > div.ProductDetailsPage88__productDetailsGrid.ProductDetailsPage88__productDetailsGrid--stickyCta > div.ProductDetails88__basicInfo.ProductDetails88__basicInfo--stickyCta.ProductDetails88__basicInfo--sustainabilityModifier > div.ProductInformation88.ProductDetails88__productInformation > a > h1 > span')
return brand_tag.find('span').text.strip() if brand_tag else 'N/A'
The function parse_brand parses the brand name for the product from within the heading element in the product details section. The function searches through several layers of nested spans to find the brand information, generally visible at the top of a product page. This function is crafted so carefully to handle the nested structure in which brands are often presented: it has found the outer heading element, located the specific span containing the brand name, cleaned it up with strip(), and returned it.
def parse_price(soup):
"""
Extract the current price from the HTML content.
Args:
soup (BeautifulSoup): Parsed HTML content
Returns:
str: Current price or 'N/A' if not found
"""
price_span = soup.select_one('body > main > div > div:nth-child(2) > div > div.ProductDetailsPage88__productDetailsGrid.ProductDetailsPage88__productDetailsGrid--stickyCta > div.ProductDetails88__basicInfo.ProductDetails88__basicInfo--stickyCta.ProductDetails88__basicInfo--sustainabilityModifier > div.SingleBadge3__badge.TransitionSingleBadge3__transitionBadge.TransitionSingleBadge3__transitionBadge--stickyCta')
return price_span.text.strip() if price_span else 'N/A'
The parse_price function searches for the price by inspecting elements with classes that contain pricing information. It was written with the specifics of Net-a-Porter in mind-to identify the SingleBadge3__badge class, which contains the current price the product is selling at. It grabs this value, removes any extra blank spaces and formatting, and returns this price as a clean string-or returns 'N/A' if it can't find price-related data.
def parse_discount(soup):
"""
Extract any discount information from the HTML content.
Args:
soup (BeautifulSoup): Parsed HTML content
Returns:
str: Discount amount or 'N/A' if not found
"""
discount_span = soup.select_one('span', class_='PriceWithSchema9__discount')
return discount_span.text.strip() if discount_span else 'N/A'
The parse_discount function explicitly searches for any discount information by searching for elements with the PriceWithSchema9__discount class. This class is used by Net-a-Porter to show how much the product's price has been reduced. The function pulls this percentage or amount off, cleans it up, and returns it, making it easy to track price reductions and sales information across products. If no discount is present, it returns 'N/A'.
def parse_sale_price(soup):
"""
Extract the original price before sale from the HTML content.
Args:
soup (BeautifulSoup): Parsed HTML content
Returns:
str: Original price or 'N/A' if not found
"""
sale_price_span = soup.select_one('span', class_='PriceWithSchema9__wasPrice')
return sale_price_span.text.strip() if sale_price_span else 'N/A'
The parse_sale_price function focuses on finding the original price before any discounts by looking up elements of class PriceWithSchema9__wasPrice. This is significant in the understanding of pricing history and discount computation. It extracts this original price that is usually crossed out on the website, removes any trailing formatting, and returns it as a string, which helps in keeping track of how the price has changed over time.
def parse_color(soup):
"""
Extract the product color from the HTML content.
Args:
soup (BeautifulSoup): Parsed HTML content
Returns:
str: Color name or 'N/A' if not found
"""
color_tag = soup.select_one('body > main > div > div:nth-child(2) > div > div.ProductDetailsPage88__productDetailsGrid.ProductDetailsPage88__productDetailsGrid--stickyCta > div.ProductDetails88 > div.ProductDetailsColours88.ProductDetails88__colours > p > span')
return color_tag.string.strip() if color_tag else 'N/A'
The parse_color function goes down through the product details section to retrieve color information found in a given paragraph. Using a very exact CSS selector, it picks the span that contains the colour name from the ProductDetailsColours88 section. Then, it retrieves the text, strips off extra spaces or formatting, and returns the cleaned color name to provide essential variant information for the product.
def parse_description(soup):
"""
Extract the product description from the HTML content.
Args:
soup (BeautifulSoup): Parsed HTML content
Returns:
str: Product description or 'N/A' if not found
"""
description_div = soup.select_one('#EDITORS_NOTES > div.AccordionSection3__content.AccordionSection3__content--pdpAccordion.content > div > div > div > p')
return description_div.text.strip() if description_div else 'N/A'
The parse_description function is supposed to navigate through the EDITORS_NOTES section of the page and catch the detailed product description. It looks inside particular div elements in which the marketing copy and the product details are covered. The function always captures all the text content in this section with proper formatting or line breaks if important to interpret the product description while returning it as a clean string.
def parse_size_and_fit(soup):
"""
Extract size and fit information from the HTML content.
Args:
soup (BeautifulSoup): Parsed HTML content
Returns:
str: Size and fit details or 'N/A' if not found
"""
fit_details = soup.select_one('#SIZE_AND_FIT > div.AccordionSection3__content.AccordionSection3__content--openAnimation.AccordionSection3__content--pdpAccordion.content > div > div > div')
return fit_details.text.strip() if fit_details else 'N/A'
The parse_size_and_fit function is targeting the sizing and fit section under the accordion of the product page. It searches for div elements that have sizing information. It extracts the details regarding how the product fits, recommended size, and model measurement from these div elements. The function outputs it as a clean text string, allowing customers to easily understand characteristics of product fit.
def parse_details_and_care(soup):
"""
Extract product care instructions and details from the HTML content.
Args:
soup (BeautifulSoup): Parsed HTML content
Returns:
str: Care instructions and details or 'N/A' if not found
"""
care_details = soup.select_one('#DETAILS_AND_CARE > div.AccordionSection3__content.AccordionSection3__content--openAnimation.AccordionSection3__content--pdpAccordion.content > div > div > div')
return care_details.text.strip() if care_details else 'N/A'
The parse_details_and_care function focuses on DETAILS_AND_CARE accordion and scrapes the important information from a product, including what the material is, how it should be cared for, and other specification details. It searches nested div elements to find the complete care details and specification of product to then compile these into a single clean text string for providing all-around product care information.
def parse_product_code(soup):
"""
Extract the product code/SKU from the HTML content.
Args:
soup (BeautifulSoup): Parsed HTML content
Returns:
str: Product code or 'N/A' if not found
"""
product_code_div = soup.select_one('body > main > div > div:nth-child(2) > div > div.ProductDetailsPage88__productDetailsGrid.ProductDetailsPage88__productDetailsGrid--stickyCta > div.ProductDetails88 > div.PartNumber88.ProductDetails88__partNumber > span')
return product_code_div.text.strip() if product_code_div else 'N/A'
The function "parse_product_code " looks into the elements of Class PartNumber88 for a unique identifier of the product, which is very essential to inventory counting and identification of specific products at the database level. The function gets the exact span element that contains this code, gets rid of any extra formatting or spaces surrounding it and returns it as a string so that each product could uniquely be identified in the system.
fetch_html Function
def fetch_html(url, headers, cookies):
"""
Fetch HTML content from a given URL using specified headers and cookies.
Args:
url (str): The URL to fetch the HTML content from.
headers (dict): HTTP headers to be used in the request.
cookies (list): List of cookies to be set for the request.
Returns:
str or None: The HTML content if the request is successful, None otherwise.
"""
session = requests.Session()
for cookie in cookies:
session.cookies.set(cookie['name'], cookie['value'], domain=cookie['domain'], path=cookie['path'])
try:
response = session.get(url, headers=headers, proxies=proxies, verify=False)
if response.status_code == 200:
return response.text
else:
logging.error(f"Failed to load page content for: {url} with status code: {response.status_code}")
return None
except requests.exceptions.RequestException as e:
logging.error(f"Request failed for URL {url} with exception: {e}")
return None
The fetch_html function is like the request handler for a web browser. First, it creates a new session object, which functions similar to opening a fresh window in the browser. Then it loads all cookies into this session-think of that as telling the browser to remember your login information. It employs the headers given, which would make the request look like it is coming from a 'real' browser and routes it through a proxy server, which is helpful to avoid getting blocked.
In case everything is fine and the website returns a response with a status code of 200 (which means "OK"), the function returns the HTML content of the page. If something goes wrong-for instance, if the website is down, the page does not exist, or the connection to it fails-the function logs the error and returns None. This allows the main program to know whether it needs to retry requesting the page or move on to the next URL.
Create_tables function
def create_tables(conn):
"""
Create necessary tables in the SQLite database if they don't exist.
Args:
conn (sqlite3.Connection): SQLite database connection object.
"""
with conn:
# Create 'urls' table
conn.execute('''
CREATE TABLE IF NOT EXISTS urls (
url TEXT PRIMARY KEY,
category TEXT,
scraped INTEGER DEFAULT 0
)
''')
# Create 'data' table
conn.execute('''
CREATE TABLE IF NOT EXISTS data (
url TEXT PRIMARY KEY,
category TEXT,
product_name TEXT,
stock_type TEXT,
brand TEXT,
price TEXT,
discount TEXT,
sale_price TEXT,
color TEXT,
description TEXT,
size_and_fit TEXT,
details_and_care TEXT,
product_code TEXT
)
''')
# Create 'error_urls' table
conn.execute('''
CREATE TABLE IF NOT EXISTS error_urls (
url TEXT PRIMARY KEY,
error TEXT
)
''')
The create_tables function is like setting up filing cabinets in a new office. It will set up three entirely different SQLite database tables and serves various purposes. The 'urls' table is like a to-do list, keeping track of which pages need to be scraped and which ones are already done. The primary storage for scraping all the product information it can gather is the 'data' table; think of it like a spreadsheet with columns for each bit of product information. The 'error_urls' table is like an error log, recording any URLs that cannot be scraped properly along with what went wrong.
The function uses SQL commands with 'CREATE TABLE IF NOT EXISTS' to ensure that it doesn't accidentally erase existing tables if they're already there. It's kind of checking if a filing cabinet already exists before trying to set up a new one. There are predefined columns in every table, defining what type of information will be held in this table, and just as a spreadsheet has columns for different types of data.
Add_scraped_column_if_not_exists function
def add_scraped_column_if_not_exists(conn):
"""
Add a 'scraped' column to the 'urls' table if it doesn't exist.
Args:
conn (sqlite3.Connection): SQLite database connection object.
"""
with conn:
cursor = conn.cursor()
cursor.execute("PRAGMA table_info(urls)")
columns = [column[1] for column in cursor.fetchall()]
if 'scraped' not in columns:
conn.execute('ALTER TABLE urls ADD COLUMN scraped INTEGER DEFAULT 0')
The add_scraped_column_if_not_exists function acts like a database maintenance helper in that it ensures the 'urls' table has a way to track which URLs have been processed. It begins by connecting to the database, then using a special SQLite command "PRAGMA table_info(urls)" to get information about all existing columns in the 'urls' table-think of this like getting a list of all column headers in a spreadsheet. It then translates this into a simple list of column names, using a list comprehension, and then checks that the list contains a column named 'scraped'. If it doesn't, it adds one using the ALTER TABLE command. This new column is set up as an INTEGER with a default value of 0, where 0 means "not scraped yet" and 1 will mean "already scraped". It is particularly useful when upgrading or modifying an existing database and preserves backward compatibility while adding new tracking capabilities without losing any existing data.
random_delay function
def random_delay():
"""
Introduce a random delay between 20 to 40 seconds.
This function is used to avoid overwhelming the target server with rapid requests.
"""
delay = random.uniform(20, 40)
print(f"Delaying for {delay:.2f} seconds")
time.sleep(delay)
The random_delay function is an analogous polite pause between requests. Imagine you were scrolling through a website manually; you wouldn't click on hundreds of pages per second, would you? That's basically what this function does-simulates natural-looking delays between requests by randomly waiting between 20 to 40 seconds, i.e., it's almost like rolling a dice to decide how long to wait.
It uses Python's random.uniform() to generate these random delays and then prints how long it's going to wait (so you know the program hasn't frozen) before using time.sleep() to actually pause the program. This does make the scraping behaviour look much more human-like and reduces the chance of getting blocked by the website's security systems.
Main function
async def main():
"""
Main execution function that orchestrates the web scraping process.
This function:
1. Loads user agents from a file
2. Establishes database connection
3. Retrieves unscraped URLs
4. Iterates through URLs, scraping data with retry logic
5. Stores successful scrapes in the database
6. Logs errors for failed attempts
Global variables required:
USER_AGENTS_FILE: Path to file containing user agents
DB_FILE: Path to SQLite database file
RETRY_ATTEMPTS: Number of retry attempts for failed scrapes
headers_template: Template for HTTP request headers
cookies: Cookie data for requests
"""
# Load user agents from file
with open(USER_AGENTS_FILE, 'r') as f:
user_agents = [line.strip() for line in f if line.strip()]
# Initialize database connection
conn = sqlite3.connect(DB_FILE)
create_tables(conn)
add_scraped_column_if_not_exists(conn)
# Retrieve unscraped URLs
cursor = conn.cursor()
cursor.execute("SELECT url, category FROM urls WHERE scraped = 0")
urls = cursor.fetchall()
# Process each URL
for url, category in urls:
# Rotate user agent for each request
random_user_agent = random.choice(user_agents)
headers = {**headers_template, 'User-Agent': random_user_agent}
# Implement retry logic
for attempt in range(RETRY_ATTEMPTS):
try:
# Add random delay between requests
random_delay()
html_content = fetch_html(url, headers, cookies)
if html_content:
# Parse HTML content
soup = BeautifulSoup(html_content, 'html.parser')
# Extract all product details
product_data = {
'product_name': parse_product_name(soup),
'stock_type': parse_stock_type(soup),
'brand': parse_brand(soup),
'price': parse_price(soup),
'discount': parse_discount(soup),
'sale_price': parse_sale_price(soup),
'color': parse_color(soup),
'description': parse_description(soup),
'size_and_fit': parse_size_and_fit(soup),
'details_and_care': parse_details_and_care(soup),
'product_code': parse_product_code(soup)
}
# Store data in database
with conn:
conn.execute('''
INSERT OR REPLACE INTO data (
url, category, product_name, stock_type, brand,
price, discount, sale_price, color, description,
size_and_fit, details_and_care, product_code
)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
''', (url, category, *product_data.values()))
conn.execute('UPDATE urls SET scraped = 1 WHERE url = ?', (url,))
logging.info(f"Successfully scraped data for URL: {url}")
break # Break retry loop on success
else:
logging.warning(f"Failed to retrieve HTML content for URL: {url}")
except Exception as e:
logging.error(f"Error scraping {url} on attempt {attempt + 1}: {e}")
# Log error to database on final attempt
if attempt == RETRY_ATTEMPTS - 1:
with conn:
conn.execute('INSERT OR REPLACE INTO error_urls (url, error) VALUES (?, ?)',
(url, str(e)))
# Clean up database connection
conn.close()
if __name__ == "__main__":
asyncio.run(main())
The main function is kind of like an orchestra conductor, making sure that all the other functions work well together. First of all, it loads a list of different browser user-agents that help make each request look as if it is coming from a different browser and sets up the connection to the SQLite database. Then, it grabs a list of all the URLs that haven't been scrapped yet from the database.
For each URL in the list, the function generates a unique request by picking a random user-agent and combining it with the standard headers. Then it attempts to scrape the page with a system for retrying in the case of failures. Between each attempt, it waits a little using the random_delay function. When it successfully scrapes a page, it extracts all the product information using the parsing functions and saves everything to the database. If it fails even after retrying, it logs the error in the error_urls table.
The 'async' in the function name is an indication that it's meant to be non-blocking, but in this example not maximally utilizing the async capability. The proper closing off of database transactions is also taken care of such that when the function is done, it ensures the connection is properly closed, much like how all the filing cabinets should be closed before leaving the office.
All of which work together seamlessly like a well-oiled machine: main coordinates everything, fetch_html fetches the web pages, random_delay keeps the timing natural, and create_tables so that there's place to store all the collected data. Error handling throughout the process assures that if something goes wrong with one page, the whole scraping operation doesn't crash.
Connect with Datahut for top-notch web scraping services that bring you the information you need, hassle-free.
FAQ SECTION
1. What data can be extracted from Net-a-Porter using web scraping?
Our solution can extract a wide range of data, including product names, prices, discounts, descriptions, brand names, categories, materials, stock availability, and customer reviews. This data enables detailed insights into luxury fashion trends, pricing strategies, and product performance.
2. How can web scraping from Net-a-Porter benefit luxury fashion brands?
By analyzing scraped data, luxury fashion brands can identify market trends, optimize pricing strategies, monitor competitor offerings, and understand customer preferences. This information is invaluable for making data-driven decisions and staying competitive in the high-end market.
3. Is it legal to scrape data from Net-a-Porter?
Web scraping is subject to legal and ethical considerations. We prioritize compliance with Net-a-Porter’s terms of service and applicable laws by using publicly available data responsibly. If you have specific compliance concerns, our team can provide tailored solutions.
4. Can your solution handle dynamic content and frequent website changes?
Yes, our advanced web scraping tools are designed to handle dynamic content, such as JavaScript-rendered pages, and adapt to changes in website structure. We ensure data extraction remains accurate and reliable, even with updates on Net-a-Porter’s website.