Gucci founded by Guccio Gucci in Florence, Italy, in 1921, Gucci has emerged as one of the most renowned luxury fashion houses producing top-notch clothing, accessories, and leather goods. In the ready-to-wear category, especially for women, the artful marriage of tradition and innovation defines their sensuous flash of bold prints, intricate embroidery, and fine craftsmanship that continue to influence global fashion trends and respond to the high-end tastes of discerning individuals.
Web scraping used to extract data from websites using automated scripts or tools. It can be described as an indispensable tool for data analysts, marketers, and researchers, since it enables easy and fast acquisition of lots of information.
Women ready to wear category of gucci includes a lot of products, from dresses to outerwear, with constantly changing styles, prices, and stock levels. Scraping this data allows us to monitor trends, analyze price strategies, and gain insights into product popularity, which can prove very important for competitive analysis, forecasting, and data-driven business decisions. The web scraping project for the website of Gucci specifically involves a two-phase approach on the Datahut API service, which is a must for getting around the most common error in web scraping, the 403 Forbidden.
The 403 Forbidden error occurs whenever a site determines and blocks scraping attempts using anti-scraping mechanisms like IP blocking, user agent detection, request rate limit or CAPTCHAs. Data services like Datahut assist in helping to break such restrictions by offering services like rotating IP addresses, handling request headers and cookies, dealing with JavaScript rendering, and adhering to the crawl rate limits of the website. The script fetches the basic product information from Gucci's product grid, targeting women's ready-to-wear items, using network API endpoints for the first phase of the project. The authentication and randomized user agents help in extracting details on product names, variants, prices, and links across multiple pages. The second stage details extraction from the individual product pages focuses on in-depth information regarding products by using BeautifulSoup for parsing HTML content and extraction of comprehensive description along with other details. All through the process, the script uses robust error handling with three retries for failed requests, as well as uses random delays to simulate a human browsing pattern. All scraped data is managed through a SQLite database system, which tracks scraping status and stores the final, comprehensive dataset.
The data collected is then refined through cleaning using tools like OpenRefine, and then it undergoes further processing using Python libraries like Pandas and Numpy in order to change it into a structured dataset ready to use for analysis and decision-making.
An Overview of Libraries for Seamless Data Extraction
Requests
The requests library is used to make HTTP requests in Python. It simplifies sending GET and POST requests to websites and APIs. In this code, requests.get() is used to send requests to the Datahut API (which retrieves product data from Gucci’s website) with various headers (like User-Agent) to mimic a real browser. It handles network communication between the script and the web resources.
BeautifulSoup
BeautifulSoup belongs to the bs4 package and is a powerful Python library for parsing HTML and XML documents. It makes searching inside the parsed HTML structure quite simple. In this script, it was applied to extract product descriptions and details from the HTML response fetched via the Datahut API.
Sqlite3
The sqlite3 library is a light-weight, efficient way to manage SQLite databases within Python. It will allow the script to work with an SQLite database to store product URLs, product information, and status on whether or not they have already been scraped. SQLite is used so data may be saved persistently between sessions, as well as track which products have already been scraped.
Time
The module time is part of the Python library and contains lots of time-related functions. In this script, time.sleep() is used to introduce random delays between requests. This will help to mimic human behavior and avoid triggering the anti-scraping mechanisms on the target website.
Uniform
The uniform() function from the random library makes random floating-point numbers within a range. In this code, it generates a random sleep between 2 and 5 seconds. This is because the randomization of delays between requests will decrease chances of getting blocked or rate-limited by the website.
Why SQLite Outperforms CSV for Web Scraping Projects
SQLite is an excellent choice for managing scraped data storage due to its simplicity, reliability, and efficiency. SQLite is a self-contained, serverless, and zero-configuration database: it can easily be integrated into the web scraping workflow without having to add the overhead of setting up and maintaining larger database systems. It stores data directly on the file system. In scraping, large amounts of data or URLs need to be stored locally and quickly, so it's ideal for that. The lightweight design allows for fast retrieval of data with simple queries, making it much more efficient than the reliance on CSV files, that, when having large amounts of data, become cumbersome and slow.
One of the most valuable features for SQLite in web scraping is that it can graciously handle interruptions. This means that, while monitoring the status of scrapping in the database, we can keep track of which URLs have successfully scrapped and which have not be scrapped. The best thing about SQLite is that in case of an interruption caused by network failure or system crash, we may continue scraping from where it left off and not repeat the same URLs. This is really helpful when scraping big datasets; otherwise, we lose all our progress.
STEP 1 : Extracting Gucci Product Attributes from JSON API Responses: Scraping Name, Variant, Price, and Product Link
Importing Libraries
import requests
import time
from random import uniform
import sqlite3
This section imports essential libraries for making HTTP requests, introducing random delays, and managing SQLite databases.
Setting Up the Base URL and API Key to Avoid Access Restrictions
# Define the base URL and API key for the Datahut API
BASE_URL = 'https://api.datahut.com/'
API_KEY = '1234567890abcdef1234567890abcdef'
Using the Datahut API provides a good alternative when traditional web scraping attempts induce forbidden errors. Sites incorporate different forms of security, such as IP blocking or bot detection, which block your automated scraping scripts from accessing their contents. This often results in a 403 Forbidden error because the server is refusing the request due to access restrictions.
The mentioned restrictions can be completely avoided because the API is specifically designed for exposing structured data for direct access without facing the same web scraping constraints. APIs, like Datahut's, provide a controlled authenticated environment to extract data from and thus reduce forbidden errors and overall unreliability of the scrap.
How the API and Base URL function in This Code
BASE_URL: This is the root URL for all the API calls. It serves as the entry point from which to make requests to the Datahut servers.
API_KEY: An authentication key required by the API to authenticate the identity of the user. It checks whether the request is valid and whether the user is authorized to access the resources provided by the API.
The code posts the request to the BASE_URL along with the proper endpoint and attaches the API_KEY. After authentication, the API processes the request and returns data in a structure format. This is much more straightforward than traditional scraping, where users need to parse HTML pages and also navigate through complex webpage structures.
Flexibility in Choosing Data Service
Though this code is set up to use Datahut's API, it does not limit users to only use this service. In cases where websites enforce strict access controls or block traditional scraping methods, users have the flexibility to choose from other services. There are many free or paid APIs that cater to different data extraction needs. Whether using Datahut or a different provider, the general approach is the same: the APIs simplify and secure the collection of data without issues related to website restrictions. However, people can find other services that suit their needs and preferences.
Setting Header Templates for Efficient Scraping
HEADERS_TEMPLATE = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br, zstd',
'Accept-Language': 'en-IN,en-GB;q=0.9,en-US;q=0.8,en;q=0.7,ml;q=0.6',
'Cache-Control': 'no-cache',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'X-Requested-With': 'XMLHttpRequest'
}
This code defines the name of a dictionary as HEADERS_TEMPLATE, which is usually used to set HTTP Headers while making requests to a web server. While scraping websites or interacting with APIs, it is important to send headers simulating a real browser request. Every key-value pair in this dictionary represents an HTTP header that helps shape the request so the server will expect it.Using such headers helps resemble browser send requests which makes it even less likely for the server to block or refuse your scraping attempt. Especially, certain headers are used by the server to check the authenticity of a request. The restriction or CAPTCHA challenge may be encountered without it.
Establishing a Connection to an SQLite Database
def connect_db(db_path):
"""
Establish a connection to the SQLite database.
This function uses the provided database path to create a connection
to the SQLite database. If the specified database does not exist,
a new database file will be created.
Parameters:
db_path (str): The file path to the SQLite database. This should
be a valid file path where the database is stored
or will be created.
Returns:
sqlite3.Connection: A connection object that can be used to interact
with the SQLite database. This object allows
the execution of SQL commands and queries on
the database.
"""
return sqlite3.connect(db_path)
The connect_db(db_path) function establishes the link between a Python application and an SQLite database. This function takes one argument: the path of the database file. If this file exists, the function opens the database for SQL operations. Otherwise, if the file doesn't exist, SQLite will create a new one at the specified path. It returns an object type: sqlite3.Connection, which lets SQL commands be executed through methods such as cursor(), execute(), and commit().
Fetching a Random User Agent from the Database
def get_random_user_agent(db_path):
"""
Fetch a random user agent from the SQLite database.
This function establishes a connection to the specified SQLite database
and retrieves a random user agent string from the 'user_agents' table.
The user agent is used to simulate different browsers or devices in
web scraping, helping to avoid detection by websites.
Parameters:
db_path (str): The file path to the SQLite database containing the
'user_agents' table. This should be a valid path
where the database is stored.
Returns:
str: A random user agent string retrieved from the database. If
the 'user_agents' table is empty, this function will raise an
IndexError when attempting to access the first element of
the fetched result.
"""
conn = connect_db(db_path)
cursor = conn.cursor()
cursor.execute(
'SELECT user_agent FROM user_agents '
'ORDER BY RANDOM() LIMIT 1'
)
user_agent = cursor.fetchone()[0]
conn.close()
return user_agent
The get_random_user_agent(db_path) function returns one random user agent string from an SQLite database. User-agents can be used to simulate various browsers or devices when crawling and can thus avoid detection as well. By randomly selecting a user agent from an available list, this function makes web requests appear to come from different sources, hence making scraping more successful.
Function takes one parameter, db_path-the path to SQLite database containing the table of the user_agents; it connects to the database, runs an SQL query to select a random user agent from the table, and returns the result. If the query is successful, it returns a user-agent string; otherwise, errors are raised if there are connection problems or the table is empty. This method offers a solid means through which user agent cycling can take place to limit the possibility of being blocked when scraping web pages.
Creating the Gucci Products Table in the Database
def create_table_if_not_exists(conn):
"""
Create the 'gucci_products' table in the database if it doesn't exist.
This function checks for the existence of the 'gucci_products' table
in the connected SQLite database. If the table does not exist, it
creates the table with the specified columns: product_name, variant,
price, and product_link. This function ensures that the table is
available for storing product data scraped from the Gucci website.
Parameters:
conn (sqlite3.Connection): A connection object to the SQLite database.
This should be a valid connection
established using sqlite3.connect().
Returns:
None: This function does not return a value. It performs an operation
on the database and commits any changes made.
Raises:
sqlite3.Error: If there is an error executing the SQL command to
create the table, an sqlite3.Error will be raised,
providing information about the issue.
"""
cursor = conn.cursor()
cursor.execute(
'''
CREATE TABLE IF NOT EXISTS gucci_products (
product_name TEXT,
variant TEXT,
price REAL,
product_link TEXT
)
'''
)
conn.commit()
The main purpose of creating a function called create_table_if_not_exists(conn) is to ensure that there exists a table for scraping the product information from Gucci within an SQLite database. Essentially, it looks for the existence of the gucci_products table and creates it if not found. This is important in keeping track of data collected through web scraping and provides a structured format for the collection of product information.
The function takes one argument, conn, a connection object to the SQLite database, established by sqlite3.connect(). When run, it executes an SQL command that creates the table with four columns: product_name, variant, price and product_link, each of these columns representing a relevant attribute for Gucci products. The function itself doesn't have a return statement; it does its table creation in place and commits the changes to the database.
Inserting Scraped Product Data into the Database
def insert_data_to_db(conn, data):
"""
Insert scraped product data into the database.
This function takes a connection object and a list of product data
tuples and inserts them into the 'gucci_products' table in the
connected SQLite database. Each tuple should contain values for
the columns: product_name, variant, price, and product_link.
Parameters:
conn (sqlite3.Connection): A connection object to the SQLite database.
This should be a valid connection
established using sqlite3.connect().
data (list of tuples): A list containing tuples of product data to be
inserted into the database. Each tuple must
have the format (product_name, variant,
price, product_link).
Returns:
None: This function does not return a value. It performs an operation
on the database and commits any changes made.
"""
cursor = conn.cursor()
cursor.executemany(
'''
INSERT INTO gucci_products
(product_name, variant, price, product_link)
VALUES
(?, ?, ?, ?)
''',
data
)
conn.commit()
The insert_data_to_db(conn, data) function is used to input scraped product data into the gucci_products table in an SQLite database. This function is necessary to populate the database with relevant product details collected during the entire web scraping procedure and ensures that all relevant data is systematically stored for further analysis or retrieval.
The function takes two arguments: conn, which is a valid connection object to the SQLite database created with sqlite3.connect(), and data, which is a list of tuples with product information for insertion. Each tuple must follow a certain structure, such as: including a product_name, variant, price, and product_link values. Such structured input facilitates efficient and organized insertion.
The executemany() method is used in execution by the function, which would insert multiple tuples into the database in one call, thus improving performance, and the frequency of calls to the database would be reduced. The function then commits the change after running the insert command, ensuring that the inserted data gets saved. If the list of data is empty, it will raise a ValueError, as there is nothing to add; during the addition process, any problems will raise an sqlite3.Error, allowing for a clear error strategy. In this way, the approach to product scraped data will be strong and efficient.
Saving Scraped Data to the SQLite Database
def save_to_database(db_path, data):
"""
Wrapper function to save the scraped data into the SQLite database.
This function manages the overall process of saving scraped product
data into an SQLite database. It establishes a connection to the
database, ensures that the necessary table exists, and inserts the
provided data into the 'gucci_products' table. Once the data is
successfully inserted, the function closes the database connection.
Parameters:
db_path (str): The file path to the SQLite database where the data
should be saved. This should be a valid path where
the database is stored or will be created.
data (list of tuples): A list containing tuples of product data to be
saved in the database. Each tuple must
contain values corresponding to the
columns: product_name, variant, price,
and product_link.
Returns:
None: This function does not return a value. It performs an operation
on the database and commits any changes made.
"""
conn = connect_db(db_path)
create_table_if_not_exists(conn)
insert_data_to_db(conn, data)
conn.close()
The save_to_database(db_path, data) function is a complete wrapper for all the steps required for saving scrapped product data into an SQLite database. Among other critical operations, it deals with connecting to the database, checking whether a table exists with appropriate specifications, inserting the given data, and finally closing the connection for resource efficiency on the system.
This function takes two parameters: db_path, which is the path of the file SQLite database as a string, and data, which is a list of tuples containing information about the products to be saved. A tuple must be located within every element in this list and must match a certain format, which is suitable for all columns in the table gucci_products, such as product_name, variant, price, and product_link. Data needs to be input in this manner so that the database is correct and neat.
When the function executes, it first connects to the SQLite database given by connect_db(). Then, it checks if the table gucci_products is available using create_table_if_not_exists(). So, once it ascertains existence of the table, the function continues to add the product data into a database using the insert_data_to_db() function. After adding the data, conn.close() closes the connection in order to avoid wastage of resources.
Fetching Data from the API for Gucci Products
def fetch_gucci_page_data(api_key, url, headers):
"""
Make the request to Datahut API for a given page and return the response.
This function constructs and sends a GET request to the Datahut API
using the provided API key, target URL, and headers. It retrieves
the HTML content or data for a specified page of Gucci products.
The function also includes the 'keep_headers' parameter to ensure
that response headers are maintained in the output.
Parameters:
api_key (str): The API key used to authenticate with the Datahut API.
This key should be kept secure and not shared publicly.
url (str): The target URL for which the data is being requested.
This should be the complete URL of the Gucci products page.
headers (dict): A dictionary of HTTP headers to include in the request.
This can include headers like 'User-Agent', 'Accept',
and others necessary for proper request formation.
Returns:
response (requests.Response): The response object returned by the
requests library. This object contains
the server's response to the HTTP
request, including status code,
headers, and content.
"""
payload = {
'api_key': api_key,
'url': url,
'keep_headers': 'true'
}
response = requests.get(
BASE_URL,
params=payload,
headers=headers
)
return response
The fetch_gucci_page_data(api_key, url, headers) function is designed to fetch data from Datahut's API. This is the HTML content or data for a given page of Gucci products. To do this, this function needs to have very important details, such as an API key that will be used for authentication, a URL for the page with Gucci products, and HTTP headers that will make a valid request.
The function would take three parameters: api_key; a string that keeps a secure key to use for logging into the Datahut API; url; the full web address of the page of Gucci products where the request for data is being made; and headers; a list that defines different HTTP headers such as 'User-Agent' and 'Accept'. All these are necessary for setting up a correct request.
When the function is called, it generates a dictionary which contains the API key and the destination URL. The function will send a GET request to the Datahut API using requests.get(), including the base URL, parameters, and headers specified.The function returns a response object. The object returned is a subclass of requests.Response. It has all essential information about what the server said in reply to an HTTP request, such as a status code, headers, and the content returned.
The function throws a requests.exceptions.RequestException when there is an issue that will arise during the request process. This might be network errors, bad URLs, or a timeout. The exception that gets thrown clearly indicates what happened wrong and hence can help developers manage the error as smoothly as possible.
Parsing Product Data from API Response
def parse_product_data(data):
"""
Parse the product data from the API response.
This function takes the JSON response data from the Datahut API and
extracts relevant product information, including product name, variant,
price, and product link. The extracted data is organized into a list of
tuples, each representing a single product. This structured format is
suitable for further processing or storage in a database.
Parameters:
data (dict): A dictionary representing the JSON response from the
Datahut API. This dictionary is expected to contain a
structure with a 'products' key, which in turn has an
'items' key that holds a list of product details.
Returns:
list of tuples: A list containing tuples, where each tuple consists
of the following elements:
(product_name, variant, price, product_link).
Each tuple represents a single product.
Raises:
KeyError: If the expected keys ('products' or 'items') are not present
in the provided data dictionary, a KeyError will be raised,
indicating that the structure of the data is not as expected.
TypeError: If the input data is not a dictionary or if 'items' is not
a list, a TypeError will be raised, informing that the
function received unexpected data types.
"""
scraped_data = []
for item in data['products']['items']:
product_name = item['productName']
variant = item['variant']
price = item['rawPrice']
product_link = (
f"https://www.gucci.com/us/en{item['productLink']}"
)
scraped_data.append(
(product_name, variant, price, product_link)
)
return scraped_data
The parse_product_data(data) is defined with the purpose of extracting product information from the JSON response Datahut returns. Specifically, this function extracts information about products made by Gucci: the name of the product, variant, price, and link to the product. In this structured format, in particular, a list of tuples, the information can be reprocessed or stored in a database.
The function takes in a single parameter, data; the function expects it to be a dictionary. This dictionary is expected to represent the JSON response returned by the Datahut API, where it expects a specific structure consisting of a key called products, which then consists of another key items, which contains the list of product details. The function then iterates through all the items in the items list, extracting the desired product attributes.
With each iteration, the function fetches the productName, variant, rawPrice, and builds up the product_link by appending the link of the product relative to the main Gucci site. This way, the resulting product_link would be a complete link that one could browse straight away. Then all the values fetched are appended to scraped_data as a tuple.
The function returns this list of tuples where the tuple is composed of (product_name, variant, price, product_link), representing one product.
Overall, the parse_product_data() function acts as a critical step in the transformation of raw JSON data from the Datahut API into a structured format that could be easily processed or stored. This leads to efficient data handling for web scraping applications.
Scraping Product Data from Gucci's Website
def scrape_page(base_url, api_key, db_path, category_code, page):
"""
Scrape a single page of product data from the Gucci website.
This function constructs the URL for a specific category page on the
Gucci website, makes a GET request to the Datahut API to retrieve
the product data, and then parses the response to extract the relevant
product information. The function handles the request process,
including setting the appropriate headers and managing user agents
to avoid detection.
Parameters:
base_url (str): The base URL of the Datahut API. This is used for
making API requests.
api_key (str): The API key for authentication with the Datahut API.
This key is used to authorize requests.
db_path (str): The file path to the SQLite database containing user
agents. This is used to fetch a random user agent for
the request headers.
category_code (str): The category code for the specific Gucci product
category being scraped. This code is used in the
URL to specify which products to retrieve.
page (int): The page number to scrape from the Gucci product grid.
This parameter is used to construct the request URL.
Returns:
list: A list of tuples containing the parsed product data for the
specified page. Each tuple consists of the product name,
variant, price, and product link. If the request fails, an
empty list is returned.
Raises:
requests.exceptions.RequestException: If there is an issue with the
network request, such as a
timeout or invalid URL, a
requests.exceptions.RequestException
will be raised.
"""
url = (
f'https://www.gucci.com/us/en/c/productgrid?'
f'categoryCode={category_code}&show=Page&page={page}'
)
headers = HEADERS_TEMPLATE.copy()
headers['User-Agent'] = get_random_user_agent(db_path)
print(f"Making GET request for page {page}...")
response = fetch_gucci_page_data(api_key, url, headers)
if response.status_code == 200:
print(f"Request successful for page {page}. Parsing data...")
return parse_product_data(response.json())
else:
print(
f"Failed to retrieve data for page {page}. "
f"Status code: {response.status_code}"
)
return []
The scrape_page function is used to automatically retrieve product data from a specific page of the Gucci website using a Datahut API. The general function that is supercritical to the entire scrapping process is comprised of several components, including URL construction, API requests, handling the response, and parsing the data. It accepts five parameters: base_url, a base URL of the Datahut API required for sending requests; api_key, an authentication key for API access authorization; db_path, the path to a file storing an SQLite database with user agents for headers of sending requests; category_code, a code specifying which category of products is required for scraping; and page, which is a page number to fetch.
The function builds the request URL with category code and page number, then sets up headers with a randomly chosen user agent and sends the GET request to the Datahut API. If the request succeeds-200 status code-it parses JSON data response, extracting necessary information about relevant products, returning a list of tuples with the product name, variant, price, and product link. In case of a failure, it logs an error message with the status code and returns an empty list. It also raises a requests.exceptions.RequestException for catching any network-related errors. Finally, the scrape_page() function streamlines the collection of product data from Gucci's online catalog and integrates essential elements of the scraping workflow into one, efficient function.
Comprehensive Product Scraping: The scrape_gucci_products Function
def scrape_gucci_products(
base_url,
api_key,
db_path,
category_code,
start_page=0,
end_page=13
):
"""
Scrape multiple pages of Gucci product data and save it to the SQLite database.
This function orchestrates the scraping process for a specified range of
pages of Gucci products from a given category. It utilizes the Datahut API
to fetch the product data for each page, aggregates the results, and
stores the data in an SQLite database. To avoid detection and potential
blocking by the server, a random delay is introduced between requests.
Parameters:
base_url (str): The base URL of the Datahut API. This is used for
making API requests.
api_key (str): The API key for authentication with the Datahut API.
This key is used to authorize requests.
db_path (str): The file path to the SQLite database where the scraped
product data will be stored.
category_code (str): The category code for the specific Gucci product
category being scraped. This code is used in the
URL to specify which products to retrieve.
start_page (int, optional): The starting page number for scraping.
Defaults to 0, which indicates the first page.
end_page (int, optional): The ending page number for scraping.
Defaults to 13, allowing for up to 14 pages
to be scraped.
Returns:
None: This function does not return a value. The scraped data is saved
directly to the database.
Raises:
ValueError: If the start_page is greater than the end_page, a
ValueError will be raised to indicate an invalid range.
Exception: Any other exceptions raised during the scraping process,
such as network errors or database issues, will propagate up.
"""
scraped_data = []
for page in range(start_page, end_page + 1):
data = scrape_page(
base_url,
api_key,
db_path,
category_code,
page
)
scraped_data.extend(data)
# Introduce a random delay to avoid detection
delay = uniform(2, 5)
print(
f"Delaying for {delay:.2f} seconds before next request..."
)
time.sleep(delay)
# Save all scraped data to the database
save_to_database(db_path, scraped_data)
The scrape_gucci_products(base_url, api_key, db_path, category_code, start_page=0, end_page=13) function is a broad solution for downloading multiple pages of data about products at Gucci and for saving the output of this into SQLite. It orchestrates at the high level the scraping workflow, making use of API Datahut for fetching the data on products from the women ready-to-wear category for a predetermined range of pages. This function accepts several arguments, namely, base_url showing the base URL of the Datahut API. Besides, it requires api_key for authentication purposes, then the file path to the SQLite database for the stored scraped data, category_code for specifying the products' category for scraping, and two optional parameters start_page and end_page to specify the range of the pages to be processed.
An empty list, scraped_data is declared inside the function to accumulate results from different pages. The function loops through the pages specified and calls the scrape_page() function to fetch product data for each page. Following data fetching, it collects the results into the scraped_data list. To avoid detection and potential blocking by the server, a random delay between 2 to 5 seconds is added between requests to mimic more realistic browsing behavior. Upon successful processing of all pages, the function calls save_to_database() to write the total data straight into the target SQLite database. The function returns nothing since it's assigned to the task of data collection and storage in an efficient manner. Overall, scrape_gucci_products() encapsulates the entire process of scraping and storing product data into a seamless, automated function.
Script Execution Entry Point: The main Block
if __name__ == "__main__":
"""
Entry point of the script for scraping Gucci product data.
This block of code is executed when the script is run as the main module.
It sets the category code for the Gucci products to be scraped and specifies
the path to the SQLite database where the scraped data will be stored.
It then calls the `scrape_gucci_products` function to initiate the scraping
process using the defined base URL, API key, database path, and category code.
The following parameters are initialized:
- category_code (str): A string that specifies the product category to
scrape. In this case, it is set to
'women-readytowear', indicating that the script
will scrape women's ready-to-wear products from Gucci.
- db_path (str): A string representing the file path for the SQLite
database where the scraped product data will be saved.
This is set to 'gucci_webscraping.db'.
"""
category_code = 'women-readytowear'
db_path = (
'gucci_webscraping.db'
)
scrape_gucci_products(
BASE_URL,
API_KEY,
db_path,
category_code
)
The if name == "__main__": block serves as the entry point of the script, executing the scraping process when the script is run directly. Within this block, the category code for the Gucci products is set to 'women-readytowear', indicating that the script will specifically target the women's ready-to-wear collection. Additionally, the path to the SQLite database is defined as 'gucci_webscraping.db', where all scraped product data will be stored. Finally, the scrape_gucci_products function is invoked with the necessary parameters, including the base URL, API key, database path, and category code, thereby initiating the scraping process for Gucci products in the specified category.
STEP 2 : Extracting Detailed Descriptions and Specifications from Individual Product URLs
Importing Libraries
import requests
from bs4 import BeautifulSoup
import sqlite3
import time
from random import uniform
This section imports essential libraries for making HTTP requests, introducing random delays, and managing SQLite databases.
Setting Up the Base URL and API Key to Avoid Access Restrictions
BASE_URL= 'https://api.datahutapi.com/'
API_KEY = '1234567890abcdef1234567890abcdef'
Defined the base url and api key
Database Path Configuration: Setting DB_PATH
DB_PATH ='user_agents.db'
The variable DB_PATH is defined to specify the location of the SQLite database used for storing data.
Global HTTP Headers Configuration: Defining HEADERS
# Define global headers with a placeholder for User-Agent
HEADERS = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br, zstd',
'Accept-Language': 'en-IN,en-GB;q=0.9,en-US;q=0.8,en;q=0.7,ml;q=0.6',
'Cache-Control': 'no-cache',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'X-Requested-With': 'XMLHttpRequest'
}
Defined headers
Function for Retrieving a Random User Agent: get_random_user_agent
def get_random_user_agent():
"""
Fetch a random user agent from the database.
This function connects to the SQLite database specified by the
global constant `DB_PATH`, retrieves a random user agent string
from the `user_agents` table, and returns it. The user agents
are used to simulate requests from different browsers or devices,
which can help avoid detection when scraping websites.
Returns:
str: A random user agent string retrieved from the database.
Raises:
sqlite3.Error: If a database error occurs during the connection,
execution, or closing of the database.
IndexError: If the query returns no results (i.e., the table is empty).
"""
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.execute(
"SELECT user_agent FROM user_agents "
"ORDER BY RANDOM() LIMIT 1"
)
user_agent = cursor.fetchone()[0]
conn.close()
return user_agent
The get_random_user_agent function retrieves a random user agent string from an SQLite database specified by the global constant DB_PATH. This function connects to the database, executes a query to select a random user agent from the user_agents table, and returns it. By simulating requests from different browsers or devices, the function helps reduce the risk of detection during web scraping.
Scraping Product Description and Details from a URL
def scrape_product_details(url):
"""
Scrape product details from the provided URL using Datahut API.
This function makes a GET request to the Datahut API to retrieve
product details from the specified URL. It handles the request
using randomized user agents to reduce the likelihood of being
blocked by the target website. If the request fails, it will
attempt to retry up to three times before giving up.
Args:
url (str): The URL of the product page to scrape.
Returns:
tuple: A tuple containing two elements:
- str: The product description. If the description
cannot be retrieved, it returns 'N/A'.
- str: A string containing additional product details.
If details cannot be retrieved, it returns 'N/A'.
Raises:
requests.exceptions.RequestException: If there is an issue
with the GET request (e.g., network problems).
Notes:
The function introduces a random delay between 2 to 5 seconds
for each request attempt to mimic human browsing behavior and
avoid triggering anti-scraping mechanisms.
"""
headers = HEADERS.copy() # Create a copy of the global headers
headers['User-Agent'] = get_random_user_agent() # Update User-Agent
payload = {'api_key': API_KEY, 'url': url}
for attempt in range(3):
delay = uniform(2, 5)
time.sleep(delay)
print(f"Making GET request to Datahut API for URL {url} "
f"(Attempt {attempt + 1}/3) "
f"with delay of {delay:.2f} seconds...")
response = requests.get(
BASE_URL,
params=payload,
headers=headers
)
if response.status_code == 200:
return parse_product_details(response.text)
else:
print(f"Failed to retrieve URL {url} "
f"with status code {response.status_code}")
print(f"Failed to scrape URL {url} after 3 attempts.")
return 'N/A', 'N/A'
The scrape_product_details function is intended to fetch product details from a given URL through the Datahut API. To do this, it first creates a copy of the global headers and updates the User-Agent with a random one fetched from the database. The function creates a parameters dictionary containing the API key and the target URL and then attempts to make a GET request to the Datahut API up to three times. In each request, it inserts a delay in 2-5 seconds from its actual event. It is also mimicking human browsability in order to avoid detection by anti-scraping mechanisms.
The function checks the response code after it has sent the request. It then parses the details of the product from the text of the response and returns it as a tuple if the request is successful with status code 200. If the request fails, the function then prints an error and tries again until the number of attempts is at its limit. If all of the above attempts fail, it returns a default value of 'N/A' for both product descriptions and additional details. This function serves as an important one to collect extensive product data while not compromising anonymity at the scraping level.
Parsing Product Details from HTML Response
def parse_product_details(html):
"""
Parse product details from the HTML response.
This function takes an HTML response as input, parses it using
BeautifulSoup, and extracts the product description and additional
product details. It specifically looks for a div with the class
'product-detail' and retrieves information from its paragraph and
list items.
Args:
html (str): The HTML response from the product page to be parsed.
Returns:
tuple: A tuple containing two elements:
- str: The product description. If no description is found,
it returns 'N/A'.
- str: A string containing additional product details,
concatenated from list items. If no details are found,
it returns 'N/A'.
Raises:
None: This function does not raise any exceptions but prints a
message if the required div is not found.
Notes:
The function uses the `strip=True` parameter to clean up whitespace
from the extracted text, ensuring a clean output.
"""
soup = BeautifulSoup(html, 'html.parser')
product_detail_div = soup.find('div', class_='product-detail')
if product_detail_div:
product_description = (
product_detail_div.find('p').get_text(strip=True)
if product_detail_div.find('p')
else 'N/A'
)
product_details_list = [
li.get_text(strip=True)
for li in product_detail_div.find_all('li')
]
product_details = ', '.join(product_details_list)
return product_description, product_details
print("No product detail div found.")
return 'N/A', 'N/A'
The parse_product_details function is the one responsible for extracting product details from an HTML response received via accessing a product page. This function uses BeautifulSoup to parse the HTML content, which makes it easy to navigate and search through HTML elements. Functionally, it looks for a <div> element with class equal to product-detail, whose content will be "the relevant information for the related product".
Upon locating the product detail <div>, the function tries to retrieve a product description by considering the first paragraph set <p> within the div. In case the paragraph exists, its text is acquired and trimmed of whitespaces; otherwise, it returns 'N/A'. The function also gathers any other product details from all list items available <li> within the same div and return each one as a single string, separated from each other via a comma.
If the product detail div is not present, then it displays the message and returns 'N/A' for the description and details. Overall, this function really acts as a conduit to transform the raw HTML response into structured product information for further data processing or storage.
Initializing the SQLite Database
def initialize_database():
"""
Initialize the SQLite database and create necessary tables.
This function connects to the SQLite database specified by the
global constant `DB_PATH`. It performs the following operations:
- Alter the **gucci_products** table to add the `status` column
(if it doesn't already exist).
- Create the **gucci_final_data** table to store final scraped product data,
including URL, name, variant, price, description, and additional details.
Returns:
sqlite3.Connection: The SQLite connection object for further
operations on the database.
"""
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
# Alter the gucci_products table to add 'status' column if it doesn't exist
cursor.execute('''
ALTER TABLE gucci_products ADD COLUMN status INTEGER DEFAULT 0
''')
# Create the gucci_final_data table
cursor.execute('''
CREATE TABLE IF NOT EXISTS gucci_final_data (
product_url TEXT PRIMARY KEY,
product_name TEXT,
variant TEXT,
price TEXT,
product_description TEXT,
product_details TEXT
)
''')
conn.commit() # Commit changes to the database
return conn
The initialize_database function is important for managing SQLite database setup for storing Gucci product data. The function begins by connecting to the SQLite database through a specified path, which enables smooth interaction of data stored in the database. One of the core operations that the function performs is modifying the existing gucci_products table. Adding a new column named status, which defaults to 0 will enable the function to flag each product based on whether or not its data has been fully scraped. This will help in tracking the progress of data scraping, especially with large datasets or when scraping might be paused and picked up again later.
The function does more than just modify the gucci_products table, though-it also creates the gucci_final_data table if it doesn't already exist. This table houses every detail about each product: URL, name, variant, price, product description, and all else. A clean separation of final data into its very own table is helpful for this kind of organization and clear management of data, as one would have an easier chance to query specific product details after scraping.
Once all these operations are done, the function commits the changes to the database, ensuring that the newly added columns and tables are saved for further use. Finally, the SQLite connection is returned, allowing other parts of the program to perform further database operations as needed. With its structure of the database, the function actually presents a robust foundation for scraping and managing volumes of product data sufficiently.
Updating Product Status in the Database
def update_product_status(conn, url, status):
"""
Update the scraping status of a product in the database.
This function modifies the `status` field of a product in the
`gucci_products` table, indicating whether the product has
been successfully scraped or not.
Parameters:
conn (sqlite3.Connection): The SQLite database connection
object used to execute the update query.
url (str): The product link (URL) of the product whose
status needs to be updated. This should match an existing
entry in the `gucci_products` table.
status (int): The new status value to set for the product.
Commonly, this is 0 (not scraped) or 1 (scraped).
Raises:
sqlite3.Error: May raise exceptions related to database
operations, but these are not explicitly handled in this
function.
Notes:
- It is assumed that the `url` provided exists in the
`gucci_products` table. If the URL does not exist, no
changes will be made.
- This function should be called after scraping a product
to update its status accordingly.
"""
cursor = conn.cursor()
cursor.execute('''
UPDATE gucci_products
SET status = ?
WHERE product_link = ?
''', (status, url))
conn.commit()
The update_product_status function is intended to update the status of scraping for a particular product in the gucci_products table of the SQLite database, which is used to keep track of whether this or that product has been scraped or not. It has three parameters: conn is the SQLite database connection object that is used for executing SQL queries, url is the link (URL) of the product whose status has to be updated, and status is the integer status value, usually 0 for NOT SCRAPED or 1 for scraped successfully. When run, the function creates an SQL UPDATE statement to update the status field of the mentioned product according to its URL. If the URL is present in the database, the status will be updated accordingly, and the function will commit the changes to ensure that the database reflects this new status. It is assumed the provided URL already exists in the table of gucci_products; otherwise, no update operation will happen. This must be called after a product has been scraped in order for it to accurately reflect scraping status in the database.
Inserting and Updating Final Scraped Product Data in the Database
def insert_final_data(conn, data):
"""
Insert the final scraped data into the gucci_final_data table.
This function inserts or updates product details in the
`gucci_final_data` table, which stores the final scraped
information for products. If a product URL already exists
in the table, its corresponding record will be replaced with
the new data.
Parameters:
conn (sqlite3.Connection): The SQLite database connection
object used to execute the insert or replace query.
data (tuple): A tuple containing the product details to
insert into the `gucci_final_data` table. The tuple must
have the following elements in order:
- product_url (str): The URL of the product.
- product_name (str): The name of the product.
- variant (str): The variant of the product.
- price (str): The price of the product.
- product_description (str): A description of the product.
- product_details (str): Additional details about the product.
Raises:
sqlite3.Error: May raise exceptions related to database
operations, but these are not explicitly handled in this
function.
Notes:
- It is assumed that the `data` tuple is well-formed
and matches the expected schema of the `gucci_final_data`
table.
- This function should be called after successfully
scraping a product's details to store them in the database.
"""
cursor = conn.cursor()
cursor.execute('''
INSERT OR REPLACE INTO gucci_final_data (
product_url, product_name, variant,
price, product_description, product_details
)
VALUES (?, ?, ?, ?, ?, ?)
''', data)
conn.commit()
The insert_final_data function will be used to insert or update the final scraped data into the gucci_final_data table of the SQLite database in order to keep the product details accurate after scraping. It requires two parameters: conn is the SQLite database connection object which is used to execute SQL queries, and data, a tuple that holds the required essential product details to be inserted in the table. The URL of the product, product name, variant, price, product description, and other additional product details must be included as a tuple, in this order.
It executes the function by preparing a prepared SQL INSERT OR REPLACE statement that can insert new records into the table gucci_final_data or replace existing records if there is an existing product URL. This ensures that, for each product, the table always has up-to-date information. After running the query based on the given data, the function commits its changes to the database so the change persists. It is to be noted that the data tuple must be in proper form and agrees with the expected schema of the gucci_final_data table. This function must be invoked immediately after successfully scraping any product's details so that this information should be appropriately stored in the database for future use.
Scraping Products from the Database
def scrape_products():
"""
Main function to scrape products from the database.
This function connects to the SQLite database, retrieves
products that have not been scraped (status = 0) from the
`gucci_products` table, and scrapes their details using
the `scrape_product_details` function. Successfully scraped
product details are then inserted into the `gucci_final_data`
table, and the scraping status of each product is updated
to indicate completion.
Steps performed by this function:
1. Initialize the database connection.
2. Fetch all products with a status of 0 from the
`gucci_products` table.
3. For each product, scrape its details.
4. If scraping is successful, insert the data into the
`gucci_final_data` table and update the product status.
5. If scraping fails, print an error message with the
product URL.
6. Close the database connection after processing all
products.
Raises:
sqlite3.Error: May raise exceptions related to database
operations, but these are not explicitly handled in this
function.
Notes:
- This function should be called to initiate the
scraping process for products in the database.
- The `scrape_product_details` function is expected
to return 'N/A' for any fields that cannot be scraped,
which indicates a failure to retrieve the product details.
"""
conn = initialize_database()
cursor = conn.cursor()
cursor.execute('''
SELECT product_link, product_name,
variant, price FROM gucci_products
WHERE status = 0
''')
rows = cursor.fetchall()
for row in rows:
url, product_name, variant, price = row
product_description, product_details = (
scrape_product_details(url)
)
if product_description == 'N/A' or product_details == 'N/A':
# Optionally, you can save the failed URLs
print(f"Failed to scrape {url}.")
else:
data = (
url, product_name, variant,
price, product_description,
product_details
)
# Insert final data into gucci_final_data table
insert_final_data(conn, data)
update_product_status(conn, url, 1) # Mark as scraped
conn.close()
print("Scraping completed and data saved to the SQLite database.")
The main entry point function of scrape_products initiates the product detail scraping of items stored on the SQLite database in a structured workflow to ensure efficient data retrieval and management. It connects to the SQLite database via the initialize_database function to ensure all relevant tables are ready for operation.
The function then proceeds to perform a SQL query in search of all products with a status of 0- meaning they have not been scraped-from the gucci_products table. For every product retrieved, it calls scrape_product_details and passes the product URL to fetch detailed information.
Successful scraping has resulted in 'N/A' returned details. In such case, it creates a data tuple with the product's URL, name, variant, price, description, and additional details that is inserted into the gucci_final_data table through the insert_final_data function. The status of the product being scraped is updated to 1 since it is done by calling the update_product_status function. The function prints an error message in any case scraping fails for any product, and it indicates the URL that could not be processed.
After processing all the items, the database connection is closed, and a completion message is printed to notify that the job of scraping is over and that the data has been saved in the SQLite database. This function encapsulates the entire scraping process. From database interaction to treating the data, it covers each and every type of process involved in getting product information and saving it.
Entry Point for the Scraping Process
if __name__ == '__main__':
scrape_products()
The code snippet if __name__ == '__main__': is an entry for running the script as a standalone program by calling it when it is run. Inside this conditional block, the scrape_products() function is invoked to begin with the scraping process.
This construct verifies whether the script is invoked directly instead of being imported as a module of another script. Scrape products function is triggered when the script is run directly to carry on the whole workflow of connecting towards the database, fetching the products, scraping their details, and storing the output in the results.
This inclusion of a check will ensure that the scraping operations are executed only when intended, ensuring modularity and reusability in other contexts if needed. It is also a common convention in Python for organizing your code, making it obvious where the execution is going to begin and therefore to better manage your code.
Libraries and Versions
This code utilizes several key libraries to perform web scraping and data processing. The versions of the libraries used in this project are as follows: BeautifulSoup4 (v4.12.3) for parsing HTML content, Requests (v2.32.3) for making HTTP requests. These versions ensure smooth integration and functionality throughout the scraping workflow.
Connect with Datahut for top-notch web scraping services that bring you the information you need, hassle-free.
FAQ's
1. Can Datahut scrape fashion websites like Gucci for product data?
Answer:Yes, Datahut specializes in web scraping services and can extract product data from fashion websites like Gucci. Our services are tailored to provide detailed insights into product descriptions, pricing, discounts, and more, while adhering to ethical and legal guidelines.
2. What kind of data can I extract from Gucci’s website using web scraping?
Answer:With web scraping, you can extract valuable data from Gucci’s website, including product names, descriptions, prices, availability, material details, and even customer reviews. This data can help you analyze market trends, pricing strategies, and product diversity in the luxury fashion industry.
3. Do I need technical expertise to scrape Gucci’s data?
Answer: Not at all! Datahut provides end-to-end web scraping solutions, so you don’t need any technical expertise. We handle the entire process, from setting up the scraper to delivering clean, structured data ready for analysis.
4. How does Datahut ensure compliance with legal and ethical standards when scraping websites like Gucci?
Answer: At Datahut, we prioritize ethical web scraping practices. We strictly follow legal regulations, including terms of service, and use responsible scraping techniques to avoid overloading target websites. Our services include consultations to help you navigate data usage policies and compliance standards.