How to Scrape Amazon's Smartwatch Data for Time-Series Analysis

Amazon is the world's top-selling e-commerce site offering a wide and dynamic selection of smart watches, which is actually a comprehensive marketplace for wearable technology. There consumers can find a good selection of smart watches in terms of major brands including Apple, Samsung, and Fitbit, among others. This diversity that actually makes the smart watches segment at Amazon a very valuable hub in terms of understanding market dynamics, pricing strategies, and what works for consumers in terms of wearable technology.

The web scraping system used here is two phased and sequential with the intention of gathering and processing data efficiently while being respectful to Amazon's platform. The first phase collects links of products from the Amazon smart watch category pages, while the second phase dives deeper into each product page to gather detailed information. The two-step approach ensures comprehensive data collection while maintaining the reliability and efficiency of the system.

System to first phase: methodically follows on every Amazon smart watches category page and collects product URLs within a structured process. The first thing the system will be doing is developing a connection with the SQLite database to which the collected URLs will be stored, and then start methodically following on all the pages of the category section. It uses BeautifulSoup to parse the HTML content and fetch links from products and introduces smart rate limiting with some random delay to make a gentle scraping pattern. It keeps all product URLs with the date of collection in a wide database of smart watches available.

The actual collector is the second phase of this process. It utilizes Playwright for dynamic content rendering and fetching product details. This stage processes each of the gathered URLs to provide all the details of the products, such as titles, features, customer ratings, price information, descriptions, technical and extra specifications. The system uses advanced error handling mechanisms along with separate tracking of all failed attempts to ensure no product information is missed. All the extracted data are carefully organized and stored within a structured SQLite database; each entry is meticulously date-stamped to enable time-series analysis.

The benefits of this web scraping system go far beyond simple data collection. This system, from a market intelligence perspective, allows the tracking of price variations in time, product popularity as reviewed by metrics, and the identification of emerging trends in features and specifications. The technical implementation is robust, with good data quality, error handling, and tracking of failed URLs. Date-stamped entries enable rich temporal analysis. The advanced approach of the system toward rate limiting and user agent rotation ensures that data is collected reliably without overloading the servers of Amazon.

Step1 :Product Link Scraping

Setting Up the Scraping Workflow

import requests
from bs4 import BeautifulSoup
import time
import random
import sqlite3

In this section, we start with the setup of the base process of scraping. We import all the necessary Python libraries: requests for sending HTTP requests to websites, BeautifulSoup from the bs4 package for parsing HTML content, and some additional utilities such as time, random, and sqlite3. The time and random libraries make use of random pauses between requests to avoid overwhelming the website, and sqlite3 lets us connect to a local database to store the links of the products to be scraped.

Setting Up the Date for Time-Series Data Scraping

# Step 1: Define the date variable at the top
DATE_VALUE = 'October 28, 2024'  # Change this value as needed

At the top of the script, a date variable named DATE_VALUE is defined. This variable will allow us to manually enter the current date on which data is being scraped. For any kind of time-series analysis, the specific date of when the data is gathered needs to be tracked. With each product link that is stored, we can track the change in data, like when products might not be available or when details of products get updated.

Connecting to the Database

# Step 2: Define database connection
def connect_db():
    """
    Establishes a connection to the SQLite database.

    This function connects to the 'amazon.db' SQLite database file. 
    If the database does not exist, SQLite will create a new database with the specified 
    name.

    Returns:
        sqlite3.Connection: A connection object that represents the database.
    """
    conn = sqlite3.connect('amazon.db')
    return conn

I define the function connect_db() to connect to a SQLite database named amazon-timeseries-webscraping.db. SQLite is basically a lightweight, file-based database system; if the database file does not exist, SQLite automatically creates it. The function makes sure that we can structure our product data and retrieve it for analysis. This function will return a connection object to us every time we want to interact with the database and allow us to read and write data to the database.

Creating a Table to Store Product Links

# Step 3: Create table if it doesn't exist (without unique constraint for the link)
def create_table(conn):
    """
    Creates the 'smart_watches_links' table in the database if it 
    does not already exist.

    This function defines a table schema for storing smartwatch links. 
    The table includes the following columns:
        - `id`: Primary key, auto-incremented integer identifier.
        - `link`: URL of the product, stored as text. This column 
          does not enforce a unique constraint, allowing duplicate 
          entries.
        - `status`: Integer indicating the status of the link 
          (default is 0).
        - `date`: Text storing the date the link was added.

    Args:
        conn (sqlite3.Connection): A connection object to the 
        SQLite database.

    Returns:
        None
    """
    query = '''CREATE TABLE IF NOT EXISTS 
    "smart_watches_links" (
    id INTEGER PRIMARY KEY AUTOINCREMENT, 
    link TEXT NOT NULL, 
    status INTEGER DEFAULT 0,
    date TEXT NOT NULL)''' 
    
    conn.execute(query)
    conn.commit()

The create_table() function creates a table in the SQLite database if it doesn't exist; in this case, the table is called smart_watches_links, and it is meant for storing links of smartwatches product. Every row in this table is a product link, and there are several fields in it: an id - it is a unique identifier for every entry and is automatically incremented for every new row; a link - it stores the URL of the product and does not require uniqueness, allowing duplicate entries; a status - it is an integer used to track the link's status, set to 0 by default; a date - it records the date when the link was added to the database.

Setting HTTP Headers for Web Requests

# Step 4: Define the base URL and headers
def get_headers():
    """
    Returns the HTTP headers required for making
    requests to the website.

    These headers include the 'User-Agent' and
    'Accept-Language' fields to mimic a legitimate
    browser request, helping to avoid detection as
    a bot during web scraping.

    Returns:
        dict: A dictionary containing HTTP headers
                used in the request, including:
            - 'User-Agent': Identifies the client
                            browser (in this case, Chrome).
            - 'Accept-Language': Specifies the preferred
                                language for the response .
   """
    return {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                      'AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/85.0.4183.121 Safari/537.36',
        'Accept-Language': 'en-US,en;q=0.9',
    }

The get_headers() function returns a set of HTTP headers that mimic a real user's browser when making requests to Amazon. Most websites block requests that appear to be coming from bots, so to avoid this we use a User-Agent string that looks like it's coming from a legitimate web browser, such as Google Chrome on Windows. The Accept-Language header is set to English, which tells the server that we want the website content in English. These headers are critical for making successful requests without getting blocked by the website.

Extracting Product Links from a Page

# Step 5: Extract product links from a page
def extract_product_links(soup, base_url):
    """
    Extracts product links from a BeautifulSoup object 
    representing a web page.

    This function searches for anchor tags that match 
    specific classes indicating product links on the 
    page. It constructs the full URLs by appending 
    the relative link found in each anchor tag to 
    the provided base URL. The resulting list of 
    full product URLs is returned.

    Parameters:
    soup (BeautifulSoup): The BeautifulSoup object 
    containing the parsed HTML of the page.
    base_url (str): The base URL of the website to 
    append to the extracted relative links.

    Returns:
    list: A list of full product URLs extracted 
    from the page.
    """
    product_links = soup.find_all(
        'a',
        class_="a-link-normal s-underline-text s-underline-link-text "
               "s-link-style a-text-normal")
    
    links = []
    for link in product_links:
        href = link.get('href')
        if href:
            full_url = f"{base_url}{href}"
            links.append(full_url)
    return links

This is named extract_product_links(). What needs to be done here is to get a list of product links; the web page that contains all these links is given and represented as a BeautifulSoup object-called soup. It checks every such anchor tag, clearly pointed out specifically by class names in those tags which exists somewhere in the parsed HTML that contains href attribute; so it is going to return a string containing the relative URL of this product. If the href is available, then the function builds the full URL by appending this relative link to the base_url provided. The list of complete product URLs is then collected and returned. This function is important in streamlining the URLs that would be saved in the database while performing web scraping, hence making it easy to acquire individual product pages for further extraction of data.

Saving Product Links to the Database

# Step 7: Save links to database 
def save_links_to_db(conn, links, date):
    """
    Saves product links to the database, allowing duplicates.

    This function takes a list of product links and saves each
    link to the specified database, with the current date
    and a default status of 0. It opens a cursor to execute
    an SQL INSERT statement for each link, storing it in
    the "smart_watches_links" table with the provided date
    value. The function allows duplicate entries, meaning
    multiple records with the same link can be added to
    the database. Once all links are saved, the database
    connection commits the changes.
        
    Parameters:
    conn (sqlite3.Connection): The connection object to the
    SQLite database.
    links (list): A list of product URLs to save in the
    database.
    date (str): The date to associate with each link.
    """
    cursor = conn.cursor()
    for link in links:
        cursor.execute(
            '''INSERT INTO "smart_watches_links" 
            (link, status, date) VALUES (?, 0, ?)''', 
            (link, date)
        )  # Save the date
    conn.commit()

The function save_links_to_db stores each product link in a list to a SQLite database table called "smart_watches_links". It requires three parameters: conn- the database connection object; links- list of product URLs; and date- a string with date for data save. It first creates a cursor for database operations and then follows in an iteration over every link in the list. For each link, it inserts a new row into the table with a link, status 0, and the provided date. Finally, after processing all links, the function commits the transaction so that all the changes are permanent in the database.

Finding the Next Page of Results

# Step 8: Get the next page URL
def get_next_page(soup, base_url):
    """
    Retrieves the URL of the next page in a paginated
    list of search results.

    This function searches for the link to the next page
    of results on the current web page using BeautifulSoup.
    If the "next page" link is found, it appends the
    relative link to the provided base URL to construct
    a full URL for the next page. If no "next page" link
    is found, it returns None.

    Parameters:
    soup (BeautifulSoup): The BeautifulSoup object that
    contains the parsed HTML of the current page.
    base_url (str): The base URL of the website to prepend
    to the next page link.

    Returns:
    str or None: The full URL of the next page if found;
    otherwise, None.
    """
    next_page = soup.find('a', class_="s-pagination-next")
    if next_page and 'href' in next_page.attrs:
        return f"{base_url}{next_page['href']}"
    return None

The get_next_page function allows navigation through pages of a paginated web search result. It uses BeautifulSoup to parse the HTML content of the current page and searches for the link with the class that indicates the "next page" button, which is commonly labeled as "s-pagination-next". When found, it extracts the relative URL from the href attribute. Then, by appending that relative URL to the base_url, a full URL is created so that using this URL one can actually navigate the next page. If this "next page" link is not found then the function returns as None showing there are no more pages left for scraping. This comes in useful while scraping web data while having to structure the loops for multiple pages.

Scraping All Pages

# Step 9: Scrape all pages and save links to the database
def scrape_all_pages(conn, start_url, base_url):
    """
    Scrapes multiple pages and saves product links
    to a database.

    This function starts at a given URL and iterates
    through paginated search result pages, scraping
    product links from each page. It retrieves and
    parses each page’s content, extracts product links,
    and saves them to the database with the provided
    date. The function continues to the next page if
    available, until no more pages are found.

    Parameters:
    conn (sqlite3.Connection): Database connection
    object to store product links.
    start_url (str): The URL of the first page to scrape.
    base_url (str): The base URL of the website used
    to construct full product URLs.

    Returns:
    None
    """
    current_url = start_url
    headers = get_headers()
    
    while current_url:
        print(f"Scraping page: {current_url}")
        response = requests.get(current_url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Extract links and save to DB
        links = extract_product_links(soup, base_url)
        save_links_to_db(conn, links, DATE_VALUE)  # Pass the date value
        
        # Find the next page URL
        next_page_url = get_next_page(soup, base_url)
        if next_page_url:
            current_url = next_page_url
            time.sleep(random.uniform(2, 5))  # Avoid blocking
        else:
            print("No more pages to scrape.")
            break

The scrape_all_pages function is designed to walk through and scrape multiple pages of a search results listing. At the start_url, it fetches the HTML content by sending a request and then parse the content using BeautifulSoup. After parsing, it applies the extract_product_links function to gather all product links on that page, which then saves those links into a database with a given date using the save_links_to_db function. To continue scraping, the function checks for the existence of a "next page" link on the current page by calling the get_next_page function. If there is a next page, it updates the current URL and waits a random amount of time between 2 to 5 seconds to avoid being blocked by the server. The function keeps on repeating the loop until it finds the last page where no "next page" link exists and then stops, thus ending the process of scraping. This will make for an organized way to get links from various pages.

Main Function: Tying Everything Together

# Step 10: Main function to run the scraper
def main():
    """
    Main function to initiate and run the web scraper.

    This function establishes a connection to the
    database, creates a table for storing product
    links, and then initiates the process to scrape
    multiple pages for product URLs. Once scraping
    is complete, it closes the database connection
    to ensure all data is properly saved.

    Parameters:
    None

    Returns:
    None
    """
    base_url = "https://www.amazon.in"
    start_url = "https://www.amazon.in/s?i=fashion&rh=n%3A27413352031&s=popularity-rank&fs=true&ref=lp_27413352031_sar"
    
    # Connect to the database and create table
    conn = connect_db()
    create_table(conn)
    
    # Scrape all pages and save links
    scrape_all_pages(conn, start_url, base_url)
    
    conn.close()

The main function is the entry point of the scraper program. There, it initializes two very important URLs: base_url - the main domain of the website, and start_url-the first page with products that are to be scraped. Then, it connects to the database with the connect_db function and creates the table structure required by calling create_table. After initializing the database the function invokes scrape_all_pages, handling navigation through all the pages and saving the link to the product. On finish with scraping, the function closes the database connection as a guarantee that all saved data has been successfully retrieved.

Running the Scraper

# Run the main function
if __name__ == "__main__":
    """
    Run the main function if this script is executed
    as the primary module.

    This condition checks whether the script is being
    run directly or imported as a module in another
    script. If it is the main script being run, it
    calls the main function to start the web scraping
    process.
    """
    main()

This function allows to give the flow structure in its run to perform an entirely scraping process. In Python, each script has a special built-in variable called name. When the script is run directly, then name is set to " main". This check is what prevents the main function from running if this script is imported as a module in another script. For instance, it calls the main() inside it, which begins the whole process of scraping-starting from connecting to the database up to saving product links. This approach is helpful for modularity as functions can be reused elsewhere if needed without running the scraping logic automatically.

STEP 2:Product Data Scraping From Product Links

Library Imports for Web Scraping and Database Management

import sqlite3
import random
import time
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

This module imports necessary libraries for web scraping and SQLite database for data managing. Every library added performs some specific function within the functionality of the program:

sqlite3:

This is a Python's built-in library that offers the SQLite database interface, meaning you can create, read, update, and delete records in a lightweight database. SQLite is very good for web scraping because data is efficiently stored without full database server overhead.

random:

This library is also used for generating random numbers and doing random operations. In web scraping, it is very commonly used to introduce time delays between requests, like time.sleep() to emulate the effect of human browsing, not being blocked by the server.

time:

There are several time-related functions that the time module has. When it comes to web scraping, it is typically utilized to stop the execution of the program for a specific amount of time. That is important not to load the server with requests and to control the timing of requests properly.

playwright.sync_api:

This is a library in Playwright that is an automation library for browser-based applications. Sync_playwright allows the controlling of browsers like Chrome, Firefox, and Safari in a synchronous manner. This makes the automation of web scraping much easier, especially when you need to navigate through certain web pages, interact with some elements, or perhaps need to extract data which JavaScript renders dynamically.

bs4 (BeautifulSoup):

Beautiful Soup is a library used to parse HTML and XML documents. It generates a parse tree for parsing the given HTML document and allows search queries to find and navigate through elements easily. This is highly beneficial in extracting specific data points from web pages once they are loaded into the browser, especially if one uses Playwright in scraping content rendered by JavaScript.

Function to Load User Agent Strings for Web Scraping

def load_user_agents(filepath="data/user_agents.txt"):
    """
    Load user agent strings from a specified file.

    This function reads user agent strings from a text file, where each 
    user agent is expected to be on a new line. It strips any leading or 
    trailing whitespace from each line and filters out empty lines. If 
    no user agents are found, it raises a ValueError to indicate that 
    the list of user agents is empty.

    Parameters:
    filepath (str): The path to the text file containing user agents. 
                    Defaults to "data/user_agents.txt".

    Returns:
    list: A list of user agent strings.
    """
    with open(filepath, 'r') as file:
        agents = [
            line.strip() for line in file if line.strip()
        ]
        if not agents:
            raise ValueError("No user agents found in file")
        return agents

The load_user_agents function should read a file containing user agents, used to identify web browsers and devices when making requests towards a website. Each line of the file should contain exactly one user agent. At first, the function opens a file, reads all its lines removing unnecessary spaces. It then removes all empty lines so that only the valid user agents will remain in the final list. If no user agent exists in the file, it will throw an error so that the user will be informed that the file is empty. Finally, it returns a list of user agents that can be used to make web scraping requests appear as if they come from different browsers or devices, which can help prevent the scraping from being blocked.

Function to Generate Random Delay for Web Scraping

def get_random_delay():
    """
    Generate a Random Delay for Web Scraping

    This function returns a random floating-point number that represents 
    a delay in seconds. The delay is generated within a specified range 
    of 3.0 to 6.0 seconds, which can be used to space out web scraping 
    requests.

    Returns:
    float: A random delay value between 3.0 and 6.0 seconds.
    """
    return random.uniform(3.0, 6.0)

The get_random_delay function is designed to help avoid detection of web scraping activity as automated behavior. It returns a random floating-point number that represents a delay in seconds from 3.0 to 6.0. This delay can be used between consecutive web scraping requests to make the requests appear more human-like. It reduces the risk of being blocked by the website being scraped through introduction of variability in the timing of requests, which makes for a more sustainable and respectful approach to scraping.

Function to Fetch Web Page Content Using Random User Agent

def fetch_page_content(url, user_agents):
    """
    Fetch Web Page Content Using a Random User Agent

    This function retrieves the HTML content of a specified web page 
    by navigating to the provided URL. It uses Playwright to open 
    a browser instance with a randomly selected user agent from the 
    provided list. The function introduces a random delay before 
    making the request to avoid detection by the website.

    Parameters:
    url (str): The URL of the web page to be scraped.
    user_agents (list): A list of user agent strings to choose from.

    Returns:
    str: The HTML content of the web page.
    """
    user_agent = random.choice(user_agents)  
    
    with sync_playwright() as p:  
        browser = p.chromium.launch(headless=True)  
        context = browser.new_context(
            user_agent=user_agent
            )  
        page = context.new_page()  
        
        try:
            delay = get_random_delay()  
            time.sleep(delay)  
            page.goto(url)  
            content = page.content()  
            return content  
        finally:
            browser.close()

fetch_page_content is the function aimed at fetching the HTML content on a web page by navigating to a specified URL. It loads a user agent from a list provided to the function to behave like different browsers, which may help to evade possible blocking by the website. Before making the request, the function introduces some random delays for requests made, so that requests seem more human-like than algorithmic. By using the Playwright, the function launches the headless browser, navigates to the URL, and captures the resulting HTML content, returning it to you as a string. This function is very useful in web scraping: while you will scrape information from website pages, you don't run much risk of getting flagged as a bot.

Function to Extract Product Title from HTML Soup

def extract_title(soup):
    """
    Extract Product Title from HTML Soup

    This function retrieves the product title from an 
    HTML document represented by the BeautifulSoup object. 
    It looks for a specific HTML element identified by 
    the span tag with the ID 'productTitle'. If the title 
    is found, it returns the cleaned text; otherwise, it 
    returns None.

    Parameters:
    soup (BeautifulSoup): The BeautifulSoup object 
                          representing the HTML document.

    Returns:
    str or None: The product title as a string, or None 
                 if the title is not found.
    """
    title_tag = soup.find('span', id='productTitle')
    
    if title_tag:
        title = title_tag.get_text(strip=True)
        return title  
    return None

The extract_title function will extract the product title from an HTML document using BeautifulSoup. It looks for a certain element, which is usually a <span> tag with the ID productTitle, containing the name of the product. If such an element is found, the function retrieves and cleans the text removing any extra whitespace. The clean title is then returned as a string. In case the title is not available, it returns None. This is a critical function for any web scraping activity where retrieving a title for a product is required in analysis or presentation.

Function to Extract Product Ratings from HTML Soup

def extract_ratings(soup):
    """
    Extract Product Ratings from HTML Soup

    This function retrieves product ratings and review 
    counts from an HTML document represented by the 
    BeautifulSoup object. It looks for a specific 
    div element with the ID 'averageCustomerReviews'. 
    If the ratings and review counts are found, they are 
    returned as a dictionary; otherwise, the function 
    returns a dictionary with None values.

    Parameters:
    soup (BeautifulSoup): The BeautifulSoup object 
                          representing the HTML document.

    Returns:
    dict: A dictionary containing the product rating and 
          review count, with keys 'Rating' and 
          'Review Count'.
    """
    ratings = {}
    
    reviews_div = soup.find(
        'div', id='averageCustomerReviews'
    )
    
    if reviews_div:
        rating_tag = reviews_div.find(
            'span', class_='a-size-base a-color-base'
        )
        
        if rating_tag:
            rating = rating_tag.get_text(strip=True)
        else:
            rating = None
        
        review_count_tag = reviews_div.find(
            'span', id='acrCustomerReviewText'
        )
        
        if review_count_tag:
            review_count = review_count_tag.get_text(strip=True)
        else:
            review_count = None
        
        ratings['Rating'] = rating
        ratings['Review Count'] = review_count
    
    return ratings

The extract_ratings function is used to pull ratings for a product along with the number of reviews from an HTML document. It looks for a <div> element with an ID of averageCustomerReviews, which contains the relevant data. It will look inside this for <span> tags carrying rating and review counts. Function takes the inner text out from the tag, processes the extracted string by taking away unnecessary whitespace, then returns those as values of a dictionary. Both ratings and the number of reviews are being checked first and if cannot find a respective value puts None. The final output is a dictionary containing the product's rating and review count, making it useful for data analysis in web scraping applications.

Function to Extract Sale and Retail Prices from HTML Soup

def extract_prices(soup):
    """
    Extract Prices from HTML Soup

    This function retrieves both sale and retail prices 
    from an HTML document represented by the BeautifulSoup 
    object. It looks for a specific div element with the 
    ID 'corePriceDisplay_desktop_feature_div'. If the 
    prices are found, they are returned as a dictionary; 
    otherwise, the function returns a dictionary with None 
    values.

    Parameters:
    soup (BeautifulSoup): The BeautifulSoup object 
                          representing the HTML document.

    Returns:
    dict: A dictionary containing the sale price and 
          retail price, with keys 'Sale Price' and 
          'Retail Price'.
    """
    prices = {}
    
    price_div = soup.find(
        'div', id='corePriceDisplay_desktop_feature_div'
    )
    
    if price_div:
        sale_price_tag = price_div.find(
            'span', class_='a-price-whole'
        )
        
        if sale_price_tag:
            sale_price = sale_price_tag.get_text(strip=True)
        else:
            sale_price = None
        
        retail_price_tag = price_div.find(
            'span', class_='a-text-price'
        )
        
        if retail_price_tag:
            retail_price = retail_price_tag.get_text(strip=True)
        else:
            retail_price = None
            
        if retail_price:
            if '₹' in retail_price:
                retail_price = retail_price.split('₹')[-1].strip()
        
        prices['Sale Price'] = sale_price
        
        if retail_price:
            prices['Retail Price'] = f'{retail_price}'
        else:
            prices['Retail Price'] = None
    
    return prices

This is an extract_prices function that pulls out the sale and retail prices from an HTML document that has been represented with a BeautifulSoup object. It's finding a <div> that contains its id corePriceDisplay_desktop_feature_div. If the former exists, then it'll check within the div for certain span tags holding the sale and retail prices. It fetches and cleans the text from these tags, removing any unwanted whitespace. The function also checks for the currency symbol and processes the retail price accordingly. Finally, it returns a dictionary containing both the sale price and retail price, with values set to None if they cannot be found. This function is helpful for web scraping applications focused on collecting product pricing data.

Function to Extract Product Description from HTML Soup

def extract_description(soup):
    """
    Extract Product Description from HTML Soup

    This function retrieves the product description 
    from an HTML document represented by the BeautifulSoup 
    object. It first looks for an unordered list (<ul>) 
    with the class 'a-unordered-list a-vertical a-spacing-small'. 
    If found, it collects all list items (<li>) within that 
    list. If no such list is found, it checks for another 
    unordered list with a slightly different class 
    'a-unordered-list a-vertical a-spacing-mini'. 
    The collected texts are returned as a list.

    Parameters:
    soup (BeautifulSoup): The BeautifulSoup object 
                          representing the HTML document.

    Returns:
    list: A list of strings containing the product description 
          items extracted from the HTML.
    """
    description = []
    
    # Condition 1: Check for <ul class="a-unordered-list a-vertical a-spacing-small">
    ul_lists = soup.find_all(
        'ul', 
        class_='a-unordered-list a-vertical a-spacing-small'
    )
    
    if ul_lists:  
        # If any <ul> elements found for Condition 1
        for ul in ul_lists:
            list_items = ul.find_all('li')
            
            for item in list_items:
                text = item.get_text(strip=True)
                description.append(text)
    else:
        # Condition 2: If no <ul> elements found in Condition 1,
        # check for <ul class="a-unordered-list a-vertical a-spacing-mini">
        ul_mini = soup.find_all(
            'ul', 
            class_='a-unordered-list a-vertical a-spacing-mini'
        )
        
        for ul in ul_mini:
            list_items = ul.find_all('li')
            
            for item in list_items:
                text = item.get_text(strip=True)
                description.append(text)

    return description

This extract_description function is designed to extract the product description from an HTML document represented by a BeautifulSoup object. The search looks for unordered lists with the class a-unordered-list a-vertical a-spacing-small. If such lists exist, it retrieves the text of each list item and appends it to a description list. If no lists of the first class are found, it looks for another unordered list with the class a-unordered-list a-vertical a-spacing-mini. The function is very useful for web scraping applications, which require understanding the product's features or details from an e-commerce page. The result is a list of descriptive strings that give insight to the characteristics of the product.

Function to Extract Technical Details from HTML Soup

def extract_technical_details(soup):
    """
    Extract Technical Details from HTML Soup

    This function retrieves technical specifications 
    of a product from an HTML document represented 
    by a BeautifulSoup object. It first checks for 
    a table with the ID 'productDetails_techSpec_section_1'. 
    If found, it collects key-value pairs from the rows 
    of that table. If this table is not present, 
    it attempts to find an alternative table 
    with the ID 'technicalSpecifications_section_1' 
    and extracts the details from there. 
    The collected technical details are returned as a dictionary.

    Parameters:
    soup (BeautifulSoup): The BeautifulSoup object 
                          representing the HTML document.

    Returns:
    dict: A dictionary containing technical details 
          with keys as specifications and values as 
          their corresponding information.
    """
    technical_details = {}
    
    # First check the original table structure
    tech_table = soup.find(
        'table', 
        id='productDetails_techSpec_section_1'
    )
    
    if tech_table:
        tech_rows = tech_table.find_all('tr')
        
        for row in tech_rows:
            key_tag = row.find(
                'th', 
                class_='a-color-secondary a-size-base prodDetSectionEntry'
            )
            value_tag = row.find(
                'td', 
                class_='a-size-base prodDetAttrValue'
            )
            
            if key_tag and value_tag:
                key = key_tag.get_text(strip=True)
                value = value_tag.get_text(strip=True)
                technical_details[key] = value
    else:
        # Else condition to handle the alternative HTML structure
        tech_table_alt = soup.find(
            'table', 
            id='technicalSpecifications_section_1'
        )
        
        if tech_table_alt:
            tech_rows_alt = tech_table_alt.find_all('tr')
            
            for row in tech_rows_alt:
                key_tag_alt = row.find(
                    'th', 
                    class_='a-span5 a-size-base'
                )
                value_tag_alt = row.find(
                    'td', 
                    class_='a-span7 a-size-base'
                )
                
                if key_tag_alt and value_tag_alt:
                    key_alt = key_tag_alt.get_text(strip=True)
                    value_alt = value_tag_alt.get_text(strip=True)
                    technical_details[key_alt] = value_alt
    
    return technical_details

This extract_technical_details function is supposed to scrap and return the technical details of a product from an HTML representation in the form of a BeautifulSoup object. The function will look for a particular table within the HTML that has a productDetails_techSpec_section_1 ID when called since most online retailers such as Amazon use it in storing their product details. If this table is available, it scans through every row of the table searching for a specification name in the header cell (<th>) and its corresponding detail in the data cell (<td>). Then, it pulls out and cleans text from these elements and puts them in a dictionary where keys are the names of specifications, and values are their respective details. If the table with the original ID is not found, then the function identifies an alternative table with the ID technicalSpecifications_section_1 and continues to repeat the extraction if this table exists. The function returns a dictionary with all details that are relevant at the end so access and sorting of product specifications becomes easy. This is very useful in web scraping applications, hence allowing a user to extract, compare, and analyze product information easily.

Extracting Product Information with HTML Parsing

def extract_additional_details(soup):
    """
    Extracts product details from a given HTML structure.
    
    This function attempts to gather product information from 
    two possible structures commonly found on product pages. 
    It first searches for a bullet-point list format located 
    within a div with the ID 'detailBulletsWrapper_feature_div', 
    extracting key-value pairs from list items ('li') that contain 
    bold labels (in a 'span' with the class 'a-text-bold') and 
    corresponding values. If this structure is not found, the 
    function will look for a table structure within a table with 
    the ID 'productDetails_detailBullets_sections1'. Here, each 
    row ('tr') of the table provides a label ('th') and a value 
    ('td') that are extracted and added to a dictionary.
    
    Parameters:
        soup (BeautifulSoup): Parsed HTML content of the webpage.
        
    Returns:
        dict: A dictionary containing product details, where keys 
              are attribute labels and values are attribute details.
    """
    details = {}
    
    # Attempt to extract details from the first structure 
    # (ul list with detail bullets)
    detail_wrapper = soup.find(
        'div', 
        id='detailBulletsWrapper_feature_div'
    )
    
    if detail_wrapper:
        print(
            "Extracting details from detailBulletsWrapper_feature_div..."
        )
        
        ul_list = detail_wrapper.find_all('li')
        
        for li in ul_list:
            label = li.find(
                'span', 
                class_='a-text-bold'
            )
            value = li.find_all('span')[-1]  # Value is usually the last span
            
            if label and value:
                label_text = label.get_text(strip=True)
                value_text = value.get_text(strip=True)
                
                details[
                    label_text.replace(':', '')
                ] = value_text

    # Attempt to extract details from the second structure 
    # (table format)
    detail_table = soup.find(
        'table', 
        id='productDetails_detailBullets_sections1'
    )
    
    if detail_table:
        print(
            "Extracting details from productDetails_detailBullets_sections1..."
        )
        
        rows = detail_table.find_all('tr')
        
        for row in rows:
            th = row.find('th')
            td = row.find('td')
            
            if th and td:
                th_text = th.get_text(strip=True)
                td_text = td.get_text(strip=True)
                
                details[th_text] = td_text

    return details

This function, extract_additional_details, is actually used to pull out specifications for products and any other information off HTML-based content. It tries to find the two common formats on pages describing products. In the first format, a bulleted list usually makes up product details, where label is paired with its value. Function detects this list by finding an ID detailBulletsWrapper_feature_div, then the dictionary is filled with text, which was taken from the bolded labels and the values of those labels.

If the first structure is not available, then the function will look for a table structure with the ID productDetails_detailBullets_sections1. Here, every row would have a label (th tag) and a value (td tag). All of them are extracted and added to the dictionary. Finally, the output will be a dictionary with attribute labels as keys and their corresponding details as values, providing a structured way of accessing product data for further use.

Extracting Product Features with HTML Parsing

def extract_features(soup):
    """
    Extracts product features from a webpage.

    This function attempts to gather product features from 
    specific HTML structures found on product pages. First, 
    it checks for an expandable "About this item" section, 
    locating product details within a div container with the 
    class 'a-fixed-left-grid product-facts-detail'. It extracts 
    the label and value pairs found in left and right columns.

    If this section is unavailable, the function looks for a 
    table format with class 'a-normal a-spacing-micro'. It 
    iterates through the table rows, locating labels in 'td' 
    elements with class 'a-span3' and values in 'td' elements 
    with class 'a-span9'. The extracted features are returned 
    as key-value pairs in a dictionary.

    Parameters:
        soup (BeautifulSoup): Parsed HTML content of the webpage.
        
    Returns:
        dict: A dictionary containing feature labels as keys 
              and their corresponding values as entries.
    """
    features = {}

    # Condition 1: Click to expand and scrape the "About this item" section
    expand_button = soup.find(
        'span', 
        class_='a-expander-prompt'
    )
    
    if expand_button:
        # Simulate the click (if using Selenium or a similar tool, you'd actually click it)
        div_container = soup.find_all(
            'div', 
            class_='a-fixed-left-grid product-facts-detail'
        )
        
        for div in div_container:
            key_tag = div.find(
                'div', 
                class_='a-col-left'
            )
            value_tag = div.find(
                'div', 
                class_='a-col-right'
            )

            if key_tag and value_tag:
                key = key_tag.find(
                    'span', 
                    class_='a-color-base'
                ).get_text(strip=True)
                
                value = value_tag.find(
                    'span', 
                    class_='a-color-base'
                ).get_text(strip=True)
                
                if key and value:
                    features[key] = value

    # Condition 2: Scraping the features from the table structure
    table_container = soup.find(
        'table', 
        class_='a-normal a-spacing-micro'
    )
    
    if table_container:
        tbody = table_container.find('tbody')
        feature_rows = (
            tbody.find_all('tr') 
            if tbody 
            else []
        )

        for row in feature_rows:
            key_tag = row.find(
                'td', 
                class_='a-span3'
            )
            value_tag = row.find(
                'td', 
                class_='a-span9'
            )

            if key_tag and value_tag:
                key_span = key_tag.find(
                    'span', 
                    class_='a-size-base a-text-bold'
                )
                value_span = value_tag.find(
                    'span', 
                    class_='a-size-base'
                )

                key = (
                    key_span.get_text(strip=True) 
                    if key_span 
                    else None
                )
                
                value = (
                    value_span.get_text(strip=True) 
                    if value_span 
                    else None
                )

                if key and value:
                    features[key] = value

    return features

This function extract_features collects product feature information from structured HTML elements found on a product page. It starts trying to find details in an "About this item" expandable section. When this section is there, it will simulate a click action, handy when working with libraries such as Selenium and extract details from a grid layout in which the left column contains the feature name and the right column contains its value.

If "About this item" section isn't available, then it functions like a table format search gathering the feature names and their values from the rows. It stores every feature in a dictionary with the help of a key-value pair for access of some organized feature of the product.

Scraping Amazon Product Details with Ease

def scrape_amazon_product(url, user_agents):
    """
    Scrapes product information from an Amazon product page.
    
    This function retrieves various details about a product by 
    parsing the HTML content of its Amazon webpage. It begins by 
    fetching the page content using the provided URL and rotating 
    user agents to avoid detection. Key details are then extracted, 
    including the title, features, ratings, price, description, 
    technical specifications, and additional details. These are 
    returned in a structured format for easier access.

    Parameters:
        url (str): The URL of the Amazon product page to scrape.
        user_agents (list): A list of user agent strings for 
                            rotating during requests.

    Returns:
        tuple: A tuple containing product information, with 
               each element corresponding to a specific detail.
    """
    
    content = fetch_page_content(url, user_agents)
    
    soup = BeautifulSoup(content, "html.parser")

    title = extract_title(soup)

    features = extract_features(soup)

    ratings = extract_ratings(soup)

    prices = extract_prices(soup)

    description = extract_description(soup)

    technical_details = extract_technical_details(soup)

    additional_details=extract_additional_details(soup)

    return (
        title, 
        features, 
        ratings, 
        prices, 
        description, 
        technical_details,
        additional_details
    )

The scrape_amazon_product function is designed to gather comprehensive details from an Amazon product page. It starts by requesting the webpage's HTML content, with rotating user agents to prevent detection. The function then parses the HTML to collect specific details, such as the product title, features, ratings, price, description, and technical and additional details. Each piece of information is extracted by dedicated functions to ensure that each aspect is accurately retrieved.

All collected data is returned in a tuple, providing an organized structure that allows easy access to each piece of product information for further analysis or storage.

Setting Up a Database for Product Data Storage

def initialize_database(db_path):
    """
    Initializes a SQLite database for storing product data 
    and failed URLs.

    This function creates a connection to the specified SQLite 
    database file and initializes two tables if they do not 
    already exist. The first table, 'smart_watch_product_data', 
    is designed to store product information such as URL, date, 
    title, features, ratings, prices, description, technical 
    details, and additional details, with a unique primary key 
    based on the URL and date to prevent duplicate entries. 

    The second table, 'failed_urls', stores URLs that fail to 
    scrape successfully, along with the error message, timestamp, 
    and status. After creating the tables, the connection to 
    the database is closed to ensure data integrity.

    Parameters:
        db_path (str): The file path of the SQLite database file 
                       where tables will be created.

    Returns:
        None
    """
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()
    
    # Create smart_watch_product_data table if it doesn't exist
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS smart_watch_product_data (
            url TEXT,
            date TEXT,
            title TEXT,
            features TEXT,
            ratings TEXT,
            prices TEXT,
            description TEXT,
            technical_details TEXT,
            additional_details TEXT,       
            PRIMARY KEY (url, date)
        )
    """)
    
    # Create failed_urls table with date and status columns
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS failed_urls (
            url TEXT,
            date TEXT,
            error_message TEXT,
            timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
            status INTEGER DEFAULT 0,
            PRIMARY KEY (url, date)
        )
    """)
    
    conn.commit()
    conn.close()

The function initialize_database would be used to get a structured database to put both successfully scraped product information and any failed URLs during scraping. It initializes a connection with the SQLite file specified as the database to the function then creates two tables if the tables do not exist. The smart_watch_product_data table is created for storing information of the all-inclusive products and every record can be uniquely identified using a combination of the url and date fields as the primary key so that duplicate entries cannot be created. Another table, Failed_urls is designed to keep track of those URLs which are throwing error messages while scraping, with error messages, timestamps and status field to track the failure. It ends up closing the connection once the tables are initialized to make sure data integrity and the database is ready for further operations.

Retrieving Unscraped URLs from the Database

def fetch_unscraped_urls_from_main_table(db_path):
    """
    Fetches URLs marked as unscraped from the main database table.

    This function connects to a specified SQLite database to retrieve 
    URLs that have not yet been scraped. It queries the 'smart_watches_links' 
    table for entries where the 'status' field is set to 0, indicating 
    these URLs are pending for scraping. Each result includes the URL 
    link and the associated date, which are then returned as a list 
    of tuples for easy access. After retrieving the data, the database 
    connection is closed to maintain data integrity.

    Parameters:
        db_path (str): The file path of the SQLite database to query.

    Returns:
        list: A list of tuples, each containing a URL and its associated 
              date, for URLs marked as unscraped.
    """
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()
    cursor.execute("""
        SELECT link, date FROM smart_watches_links
        WHERE status = 0
    """)
    results = cursor.fetchall()
    conn.close()
    return [(url, date) for url, date in results]

The fetch_unscraped_urls_from_main_table function scans and retrieves URLs from the database that have not been scrapped yet. It connects to the SQLite database file passed as an argument and runs a query on the smart_watches_links table by selecting entries with a status of 0, which marks them as unscraped. The function retrieves both the URL and the date in each entry and returns it in a list of tuples. After gathering this information, the function securely closes the database connection to ensure data integrity.

Retrieving Pending URLs from the Failed Table

def fetch_unscraped_urls_from_failed_table(db_path):
    """
    Fetches URLs marked as unscraped from the failed URLs table.

    This function connects to the specified SQLite database to retrieve 
    URLs that failed to be scraped previously. It queries the 
    'failed_urls' table, specifically selecting entries with a 
    'status' field set to 0, indicating they are still pending for 
    re-scraping. Each result includes the URL and the date it was 
    added to the failed list, and these are returned as a list of 
    tuples. The database connection is closed after retrieval to 
    ensure data security and maintain database integrity.

    Parameters:
        db_path (str): The file path of the SQLite database to query.

    Returns:
        list: A list of tuples, each containing a URL and its 
              associated date, for URLs marked as unscraped 
              in the failed URLs table.
    """
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()
    cursor.execute("""
        SELECT url, date FROM failed_urls
        WHERE status = 0
    """)
    results = cursor.fetchall()
    conn.close()
    return [(url, date) for url, date in results]

The fetch_unscraped_urls_from_failed_table function is used to collect URLs from the database that, on previous attempts, failed to get scraped. It connects to the SQLite database specified and then queries the failed_urls table, looking for entries where status is set to 0-that is, pending for re-scraping. The function fetches the URL and date of each entry and returns them as a list of tuples. Finally, the database connection is closed to maintain security and integrity of data as well as for keeping the database ready for further operations after fetching the data.

Updating URL Status in the Database

def update_url_status(db_path, url, date,
                       table_name, status=1):
    """
    Updates the scraping status of a specified URL in the database.

    This function connects to the specified SQLite database to update 
    the status of a given URL based on its table location. Depending on 
    the table provided ('smart_watches_links' or 'failed_urls'), it 
    updates the 'status' field for the URL and date combination, 
    marking it as scraped or pending for further action. By default, 
    the status is set to 1 (indicating successful scraping), but 
    this can be modified as needed. After executing the update, the 
    database connection is closed to ensure data integrity.

    Parameters:
        db_path (str): The file path of the SQLite database to update.
        url (str): The URL whose status needs to be updated.
        date (str): The date associated with the URL entry.
        table_name (str): The table to update ('smart_watches_links' 
                          or 'failed_urls').
        status (int, optional): The new status value (default is 1, 
                                indicating success).

    Returns:
        None
    """
    
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()
    
    if table_name == "smart_watches_links":
        query = """
            UPDATE smart_watches_links
            SET status = ?
            WHERE link = ? AND date = ?
        """
    else:  # failed_urls table
        query = """
            UPDATE failed_urls
            SET status = ?
            WHERE url = ? AND date = ?
        """
    
    cursor.execute(query, (status, url, date))
    conn.commit()
    conn.close()

The update_url_status function updates the scraping status of particular URLs in the database. It connects to the SQLite database specified and updates the status field for the given URL and date in either the smart_watches_links or failed_urls table. By default, it sets the status to 1, marking the URL as successfully scraped, but this can be customized. Upon closure, the function ensures that data received is securely stored after updating it by closing the database connection, thus allowing any further operations on the database without having inconsistencies.

Logging Failed URLs in the Database

def save_failed_url(db_path, url, 
                    date, error_message):
    """
    Saves a failed URL along with error details in the database.

    This function connects to the specified SQLite database to log 
    URLs that failed during the scraping process. It inserts a new 
    record into the `failed_urls` table, capturing the URL, date, 
    and a descriptive error message associated with the failure. The 
    status for this entry is set to `0`, indicating it is pending 
    for re-scraping. After the insertion, the function commits the 
    changes and closes the database connection to maintain data 
    integrity.

    Parameters:
        db_path (str): The file path of the SQLite database to update.
        url (str): The URL that failed to be scraped.
        date (str): The date associated with the failed scraping attempt.
        error_message (str): A message detailing the reason for the failure.

    Returns:
        None
    """
    
    conn = sqlite3.connect(db_path)
    
    cursor = conn.cursor()
    
    cursor.execute("""
        INSERT INTO failed_urls 
        (url, date, error_message, status)
        VALUES (?, ?, ?, 0)
    """, 
    (url, date, str(error_message)))
    
    conn.commit()
    
    conn.close()

The save_failed_url function is created to capture the error URLs in a database when they fail at the time of scraping. It first establishes a connection with the SQLite database specified and then inserts a new row into the failed_urls table with the URL itself, date of failure, and an error message in detail. This entry is indicated by the status 0, meaning it's still pending at the next attempted time to be re-scraped. The function commits changes and closes the database so that the data will safely lock up, and then the database will be ready for future operations after successfully saving information.

Storing Product Data in the Database


def save_data_to_db(db_path, product_data):
    """
    Saves product data into the database.

    This function connects to the specified SQLite database and 
    inserts a new record into the `smart_watch_product_data` table. 
    It takes a tuple containing various details about a product, 
    such as URL, date, title, features, ratings, prices, 
    description, technical details, and additional details. After 
    executing the insertion, the function commits the changes to 
    the database and closes the connection to ensure that the 
    data is securely stored and the database is ready for future 
    operations.

    Parameters:
        db_path (str): The file path of the SQLite database to update.
        product_data (tuple): A tuple containing product details 
                              in the following order:
                              (url, date, title, features, 
                               ratings, prices, description, 
                               technical_details, additional_details).

    Returns:
        None
    """
    
    conn = sqlite3.connect(db_path)
    
    cursor = conn.cursor()
    
    cursor.execute("""
        INSERT INTO smart_watch_product_data 
        (url, date, title, features, ratings, 
        prices, description, technical_details,additional_details)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, 
    product_data)
    
    conn.commit()
    
    conn.close()

The save_data_to_db function will save the product data to the database. After establishing a connection to the SQLite database specified, it saves a new record in the smart_watch_product_data table. The function accepts a tuple of necessary information about the product, including URL, date of scraping, title, features, ratings, prices, description, technical details, and others. It commits the changes after the execution of the insert operation so that data might be safely saved and would close the database connection. In this way, this step ensures that data would remain sound during any such operation of data insertion or retrieval in the database.

Comprehensive URL Scraping Functionality

def scrape_urls(
    db_path, 
    urls_to_scrape, 
    table_name, 
    user_agents
):
    """
    Scrapes product information from a list of URLs.

    This function iterates through a list of URLs and attempts to scrape 
    product data for each one. For each URL, it introduces a random delay 
    before initiating the scrape to avoid overwhelming the server. The 
    scraped data, which includes the title, features, ratings, prices, 
    description, technical details, and additional details, is then saved 
    to the database. If the scrape is successful, the URL status is updated 
    in the specified table. In case of an error during scraping, the URL 
    and error message are recorded, and the status is updated accordingly.

    Parameters:
        db_path (str): The file path of the SQLite database.
        urls_to_scrape (list of tuples): A list containing tuples of URLs 
                                           and corresponding dates to scrape.
        table_name (str): The name of the table to update with the URL 
                          status (either 'smart_watches_links' or 'failed_urls').
        user_agents (list): A list of user-agent strings to use for scraping.

    Returns:
        None
    """
    for url, date in urls_to_scrape:
        try:
            delay = get_random_delay()
            
            print(
                f"Waiting {delay:.2f} seconds before "
                f"scraping next URL..."
            )
            time.sleep(delay)
            
            print(
                f"Scraping: {url} for date: {date}"
            )
            
            title, features, ratings, prices, \
            description, technical_details, \
            additional_details = scrape_amazon_product(
                url, 
                user_agents
            )
            
            # Convert data to strings for storage
            product_data_str = (
                url,
                date,
                title,
                str(features),
                str(ratings),
                str(prices),
                str(description),
                str(technical_details),
                str(additional_details)
            )
            
            # Save the scraped data
            save_data_to_db(
                db_path, 
                product_data_str
            )
            
            # Update URL status to scraped
            update_url_status(
                db_path, 
                url, 
                date, 
                table_name
            )
            
            print(
                f"Successfully scraped: {url} "
                f"for date: {date}"
            )
        
        except Exception as e:
            print(
                f"Failed to scrape {url} for date {date}: {e}"
            )
            
            if table_name == "smart_watches_links":
                # Only save to failed_urls if from the main table
                save_failed_url(
                    db_path, 
                    url, 
                    date, 
                    str(e)
                )
            
            else:
                # If already from failed_urls, update 
                # status to mark as permanent failure
                update_url_status(
                    db_path, 
                    url, 
                    date, 
                    "failed_urls", 
                    status=1
                )

Function scrape_urls is developed to scrape product information on given URLs. Its input parameters are a database path, the list of the URLs together with respective dates, the name of the table for update and the collection of user agents. For each URL, this function waits for some random delay to avoid server blocking and then it proceeds to scrape the product details using the function scrape_amazon_product. Then format all these data into strings, namely title, features, ratings, prices, description, technical details etc and store in the database. Later, it marked the status of the URL as "completed" after successful scraping. If an error occurs at any stage of this process, the function will catch its message and record the URL in a failed_urls table for further review. This structured approach enables the handling of both successful attempts to scrape and failed ones with optimal reliability to the web scraping workflow process.

Comprehensive Web Scraping Orchestration

def main():
    """
    Main function to execute the web scraping workflow.

    This function orchestrates the entire web scraping process by 
    initializing the database, fetching URLs from both the main 
    and failed URLs tables, and performing the scraping 
    operations. It first sets up the database by calling the 
    `initialize_database` function, handling any potential 
    operational errors related to database schema. Then, it 
    retrieves URLs that have not yet been scraped from the main 
    table and attempts to scrape product data from those URLs. 
    If there are any URLs recorded in the `failed_urls` table, 
    the function will also attempt to scrape those URLs as well. 
    Finally, it notifies the user when the scraping process is 
    complete.

    Returns:
        None
    """
    db_path = (
        "amazon-timeseries-webscraping.db"
    )
    user_agents = load_user_agents()
    
    try:
        # Initialize database tables
        initialize_database(db_path)
    except sqlite3.OperationalError as e:
        if "duplicate column" not in str(e).lower():
            raise e
    
    # Step 1: Scrape from main table
    print(
        "Starting to scrape URLs from main table..."
    )
    urls_from_main = fetch_unscraped_urls_from_main_table(
        db_path
    )
    print(
        f"Found {len(urls_from_main)} unscraped URLs "
        f"in main table"
    )
    scrape_urls(
        db_path, 
        urls_from_main, 
        "smart_watches_links", 
        user_agents
    )
    
    # Step 2: Scrape from failed_urls table
    print(
        "\nStarting to scrape URLs from failed_urls table..."
    )
    failed_urls = fetch_unscraped_urls_from_failed_table(
        db_path
    )
    print(
        f"Found {len(failed_urls)} unscraped URLs in "
        f"failed_urls table"
    )
    scrape_urls(
        db_path, 
        failed_urls, 
        "failed_urls", 
        user_agents
    )
    
    print("\nScraping process completed!")

This basic center function basically just performs the web scraping workflow, which at first starts off by taking a path to a database and loading up a list of user agents that it will make use of for actually conducting the scrape. After it's finished initializing the database via the calling initialize_database it goes ahead and pulls down all the URLs for scrapes from the main table, along with pulling from failed_urls. This then calls the scrape_urls function on each batch of URLs with the attempt to fetch their product data. The work flow is designed to catch database errors so that whatever operational issues occur, this can be logged without stopping the rest of the processes. Finally, the function returns notifying the user that all its operations concerning scrapings are done and concludes an organized structure for effective scraping task management. This makes it appropriate for users who may not have an extremely extensive background in programming since the flow is logical and easy to follow.

Main Program Execution Control

if __name__ == "__main__":
    """
    Entry point for the web scraping application.

    This block of code checks if the current script is being run 
    as the main program. If so, it calls the `main` function to 
    initiate the web scraping process. This allows the script to 
    be executed directly, while also ensuring that the web 
    scraping operations are only performed when intended, such 
    as when running the script from the command line. It serves 
    as a standard practice in Python programming to organize 
    the execution flow of the program.

    Returns:
        None
    """
    main()

The if __name__ == "__main__": block, actually, constitutes the entry point to the web scraping application, because it is what checks whether or not the script is called directly, rather than imported elsewhere as a module. If the current script happens to be the main one, then it calls the main function, thereby initiating the whole web-scraping process. This design allows clear isolation of the execution logic as well as other possible imports clearly organized, hence easier code maintenance and readability. This convention ensures the scrap operation of the program only runs when it will actually run the script so will be friendly and user friendly, even for people not trained in programming. This structure is commonly found in Python and also helps build robust applications, which are also modular in their nature.

Conclusion

This guide gives an in-depth overview of developing a well-structured and efficient Amazon smartwatch web scraping system for time-series analysis. By following a two-stage process, the system is able to gather data accurately—first by web scraping product links and then web scraping detailed product information using BeautifulSoup and Playwright.

The use of SQLite for storage allows for efficient tracking of product information over time, and powerful error-handling processes such as failed URL logging and retry procedures make the scraping process more reliable. Moreover, using user-agent rotation, random delays, and rate limiting makes the web scraping process remain ethical and sustainable.

Connect with Datahut for top-notch web scraping services that bring you the valuable insights you need hassle-free.

FAQ SECTION

1. Is it legal to scrape Amazon smartwatch data for time-series analysis?

Web scraping legality depends on Amazon's terms of service and local laws. We recommend scraping only publicly available data and complying with ethical guidelines to avoid legal issues.

2. What kind of smartwatch data can be scraped from Amazon?

You can scrape product details like price history, reviews, ratings, stock availability, and bestseller rankings to perform time-series analysis on pricing trends and consumer demand.

3. How often should I scrape smartwatch data for an accurate time-series analysis?

The frequency depends on your analysis goals. For tracking price changes or stock availability, scraping daily or hourly may be useful. For long-term trends, weekly or monthly scraping might be sufficient.

4. Can you provide an automated solution for scraping Amazon smartwatch data?

Yes, as a web scraping service provider, we offer customized solutions for automated data extraction, ensuring efficient and structured data collection for your time-series analysis needs.

How to Scrape Amazon's Smart watch Data for Time-Series Analysis