Introduction
Web scraping is a highly developed technique that enables programmers to fetch large volumes of data from websites in an automated way. It somewhat acts like having an assistant robot that quickly gathers information from a web page, saving thousands of hours in manual data collecting. In the context of e-commerce, web scraping can provide valuable insights into product offerings, pricing strategies, and market trends.
Web scraping is of particular importance in today's data-driven world. It helps businesses monitor their competitors' prices and products, gather market intelligence, acquire leads, and do research and analysis. Web scraping allows one to automate the process of collecting data without overwhelming the server with requests, besides respecting the website's terms of service and robots.txt file.
Lenskart is an online retailer dealing with eyewear products and yet one of the largest eye-wearing companies in India, the company was established back in 2010. Glasses, sunglasses, and contact lenses are part of its wide inventory. As the website contains all the information about their products, such as frames, lenses, and accessories, we can extract the information with web scraping.
The learnings from this web scraping project can be useful to market researchers digging into the eyewear industry, competitors looking to get a sense of what is happening at Lenskart around product strategy, data scientists building recommendation systems for eyewear, and business analysts tracking pricing trends in online retail.
The Scraping Process
Our Web Scraping Project Obtains detailed information for Lenskart's product lineup through a two-step approach. We start by scraping products along with basic data using Lenskart's API endpoint with the help of the `requests` library. This library helps us send HTTP requests and receive responses from the server. Scraping APIs is often faster and more efficient compared to scraping the rendered web page as it acquires the data source directly.
We scrape the website for extra product information that is not provided by the API. To do this, we use browser automation with Playwright and HTML parsing with BeautifulSoup. This way we will be able to interact with dynamic web pages and pull data from the rendered HTML. Captures data that is either dynamically loaded or needs to be interacted with in order to appear.
Tools of the Trade
Our scraping toolkit consists of several powerful Python libraries, each serving a specific purpose in the data collection process. Requests is a simple yet powerful HTTP library for Python. It abstracts the complexities of making HTTP requests, allowing us to easily send GET, POST, and other types of HTTP requests, handle cookies and sessions, deal with query parameters and request headers, and parse JSON responses. In our project, Requests is used majorly for calls to Lenskart's API endpoints, fetching the initial product data and URLs.
Playwright is a modern browser automation library. It allows us to control Chrome, Firefox, and WebKit using a single API. It supports cross-browser, has the ability to interact with dynamic, JavaScript-heavy websites, automatically waits for elements to be ready, and does network interception and modification. In addition, we use Playwright to navigate Lenskart's website, interact with elements if needed, and render JavaScript-loaded content for scraping.
Beautiful Soup is the Python library used to parse HTML and XML files. It provides a Python interface for parsing and iterating, searching, and modifying the parse tree. In our project, BeautifulSoup was applied for parsing HTML content rendered by Playwright, navigating the DOM tree, and extracting specific data points from an HTML element.
SQLite is a C library that embeds a small, fast, self-contained, full-featured, and highly reliable database. The database can be accessed using nonstandard variant of the SQL query language. SQLite requires zero configuration such that no server setup or administration is required. The SQLite file format is cross-platform, and it is very efficient for read-heavy operations. We use SQLite to store the scraped data in a structured format that allows easy querying and analysis.
Data Refinement
Following the scraping process, the raw data usually needs cleaning and organizing. We may use tools such as OpenRefine and Python for such purposes. OpenRefine (formerly Google Refine) is a powerful tool designed for messy data work. OpenRefine will make cleaning, transformation, and enrichment of large datasets very easy and fast. OpenRefine, as the tool is called, includes clustering algorithms for cleaning up inconsistent data, GREL (General Refine Expression Language) for complex data transformations, and linking and extending data with web services and external data sources.
It offers rich possibilities for further data manipulation and analysis in a Python framework, especially with pandas. Using pandas, missing data could be handled, reshape or pivot datasets, merge and join datasets, and perform complex transformations and calculations in the data.
We could combine scraping techniques along with data refining tools to create a comprehensive dataset about the products Lenskart has to offer ready for analysis and insights. This project showcases how web scraping can extract useful business intelligence from data found on the web.
API Scraping
Import Section
import requests
import pandas as pd
import sqlite3
import logging
import json
Import all the libraries that need to be used with the Lenskart scraper. The `requests` library is powerful for making HTTP requests using the Python language. It simplifies the process of sending HTTP/1.1 requests and handles many of the complexities of making requests behind the scenes. The `pandas` library is imported under `pd` which is an assigned alias. Pandas is an open-source data analysis and manipulation tool built on top of the Python programming language. It is fast, powerful, flexible, and particularly useful for working with structured data, like the product information we are scraping. The `sqlite3` module provides a lightweight disk-based database that doesn't require a separate server process. It can create a new database and store persistent data in a single file. This module is called `logging`. As it happens, it is also part of Python's standard library and a versatile framework for building log messages from Python programs. The `json` module is finally used to encode and decode JSON data, which is probably the most widely accepted format of data exchange over the web.
Logging Configuration
# Configure logging to display information messages
logging.basicConfig(level=logging.INFO)
This line initializes basic configuration for the logging system. The `logging.basicConfig()` function does some basic configuration for the logging system: it creates a `StreamHandler` by using a basic `Formatter` and adds it to the root logger. The argument `level=logging.INFO` sets the threshold for this logger at INFO. All events at or above INFO level are therefore tracked, and all below this threshold are ignored. In the context of this scraper, this allows for informational messages about the scraping process to be provided. This would be especially useful in monitoring or debugging processes if problems crop up.
send_request Function
def send_request(url, headers):
"""
Send an HTTP GET request to the specified URL and handle the response.
This function manages the HTTP request process, including error handling and
logging. It prints the URL being accessed for progress tracking and handles
any network-related exceptions that might occur during the request.
Args:
url (str): The complete URL to send the request to. This should be a valid
Lenskart API endpoint.
headers (dict): A dictionary of HTTP headers to include in the request.
Should contain necessary authentication cookies.
Returns:
str: The text content of the response if successful, None otherwise.
Note:
- The function logs errors if the request fails
- It uses the requests library's raise_for_status() to catch HTTP errors
- Progress tracking is implemented via print statements
"""
try:
response = requests.get(url, headers=headers)
response.raise_for_status()
print(f"URL: {url}")
return response.text
except requests.RequestException as e:
logging.error(f"Request failed for URL {url}: {e}")
return None
The `send_request` function is the heart of the scraper, as it makes communication with the Lenskart API. It takes two arguments: `url`, which the endpoint we want to send the request to, and `headers`, a dictionary containing any necessary HTTP headers - perhaps like authentication tokens or cookies. In the function, a try-except block is used for the function to manage potential network-related exceptions. In the function it attempts to make a GET request to the URL it's been given by calling `requests.get()`. In success, it will print the URL-very helpful for following progress-and then return its response text. A `RequestException`, if it occurs-it is a base class for many different types of request-related exceptions-copies an error message to a logger and returns None. This kind of error handling makes the scraper more robust, so it could continue its operation even if some requests fail.
parse_json_data Function
def parse_json_data(json_data, url):
"""
Parse JSON response data and extract the product list.
This function handles the JSON parsing process, navigating through the response
structure to find and extract the list of products. It includes error handling
for JSON decoding issues and logging of any parsing failures.
Args:
json_data (str): The raw JSON string received from the API response.
url (str): The URL from which the data was fetched, used for error reporting
and logging purposes.
Returns:
list: A list of dictionaries, each containing data for a single product.
Returns an empty list if parsing fails.
Note:
- The function expects a specific JSON structure with 'result' and 'product_list'
- It handles JSON decoding errors gracefully
- Failed parsing attempts are logged for debugging
"""
try:
data = json.loads(json_data)
return data.get("result", {}).get("product_list", [])
except json.JSONDecodeError as e:
print(f"Failed to decode JSON for URL: {url}")
logging.error(f"JSON decode error for URL {url}: {e}")
return []
A function called `parse_json_data` will be called that will extract useful information from the API's JSON response. This function requires two parameters: `json_data`, which is the raw JSON string received from the API, and `url`, which is used for error reporting. The function employs a try-except block in case an error occurs while trying to decode the JSON string. It tries to parse the JSON string with `json.loads()` into the dictionary in Python. Once parsed, the data goes through a structure to extract the product list. The function assumes it is structured in such a way that the product list would be found under `result` and subsequently under `product_list`. If this were not the case, the function would need to adapt to handle a different structure. In the event that a `JSONDecodeError` happens (which is possible if the API returns malformed JSON), the function prints an error message, logs the error, and returns an empty list. This ensures that if one page of results is corrupted, it can continue working from another page.
extract_product_details Function
def extract_product_details(product):
"""
Extract and format relevant details from a product dictionary.
This function processes a raw product dictionary from the API response,
extracting specific fields and formatting them into a consistent structure.
It handles missing data gracefully and performs necessary transformations
on the raw data.
Args:
product (dict): A dictionary containing raw product information from the API.
Expected to contain various nested fields with product details.
Returns:
dict: A flattened and standardised dictionary containing extracted product details.
Keys include:
- Basic info: id, url, product_name, classification
- Physical attributes: frame_color, frame_size, frame_width
- Commercial info: brand, model_no, prices
- Metrics: wishlist_count, purchase_count, ratings
Note:
- Handles missing data by using .get() method with None as default
- Specifically processes nested price information for market and Lenskart prices
- Maintains consistent key names for database compatibility
"""
prices = product.get("prices", [])
market_price = next((price["price"] for price in prices if price["name"] == "Market Price"), None)
lenskart_price = next((price["price"] for price in prices if price["name"] == "Lenskart Price"), None)
return {
"id": product.get("id"),
"url": product.get("product_url"),
"product_name": product.get("searchProductName"),
"classification": product.get("classification"),
"frame_color": product.get("color"),
"frame_size": product.get("size"),
"frame_width": product.get("width"),
"suited_for": product.get("suited_for"),
"brand": product.get("brand_name"),
"brand_collection": product.get("tags"),
"model_no": product.get("model_name"),
"market_price": market_price,
"lenskart_price": lenskart_price,
"wishlist_count": product.get("wishlistCount"),
"purchase_count": product.get("purchaseCount"),
"average_rating": product.get("avgRating"),
"total_ratings": product.get("totalNoOfRatings"),
"quantity": product.get("qty")
}
Actual data processing occurs in the `extract_product_details` function. This function takes a single product dictionary and returns another dictionary with specific fields extracted and formatted within. Its significance is that it translates raw API data into a structured format for all the products, handling various types of data as well as nested structures from the dictionary of the product. For instance, it picks out the market price and Lenskart price from a list of prices, using a `next()` function with a generator expression to locate which price entry is the correct one. The function makes great use of the `.get()` method-a method that access a value safely from a dictionary; if the key does not exist it returns None instead of an error, making this function much less prone to errors in API data structure changes. The created dictionary is a combination of basic product information (such as ID and name), physical attributes (such as color and size), commercial information (like brands and price), and metrics (such as ratings and purchase count).
scrape_urls Function
def scrape_urls(urls_to_scrape, headers):
"""
Orchestrate the scraping process for multiple URLs across different categories.
This function coordinates the entire data collection process, iterating through
multiple sets of URLs (typically different product categories), sending requests,
parsing responses, and aggregating the results.
Args:
urls_to_scrape (list): A list of lists, where each sublist contains URLs for
a specific product category. This structure allows for
organised scraping of different product types.
headers (dict): HTTP headers to include in the requests, typically containing
authentication cookies and other necessary header fields.
Returns:
list: A comprehensive list of dictionaries, where each dictionary contains
the detailed information for a single product.
Note:
- Processes URLs in nested loops: outer for categories, inner for pagination
- Prints separators between URL sets for better progress tracking
- Uses map() for efficient processing of product lists
"""
all_product_data = []
for url_set in urls_to_scrape:
for url in url_set:
json_data = send_request(url, headers)
if json_data:
product_list = parse_json_data(json_data, url)
all_product_data.extend(map(extract_product_details, product_list))
print("=" * 80)
return all_product_data
The function `scrape_urls` acts as the main driver of the scraping process. It proceeds with orchestrating the use of other functions in grabbing data from multiple URLs. This function has two parameters: `urls_to_scrape`, which is a list of lists that presents URLs for different product categories, and `headers`, which contains necessary HTTP headers that are used for making requests. The function employs nested loops-iterating over every set of the URLs-addressing all different product categories-and then over every URL in such a set. Finally, for each URL, it calls `send_request` to retrieve the data, `parse_json_data` to retrieve the list of products, and then uses `map()` with `extract_product_details` to process every product in the list. `map()` applies `extract_product_details` to every product without actually going inside every element in the list. Using `extend()`, all of the processed products are accumulated into one list. The function also prints a separator line between URL sets, helping track progress visually when running the scraper.
generate_urls Function
def generate_urls(base_url, start_page, end_page):
"""
Generate a list of paginated URLs for a given product category.
This function creates a series of URLs for paginated API endpoints, allowing
for systematic collection of data across multiple pages of results.
Args:
base_url (str): The base API URL for a specific product category.
start_page (int): The page number to start from (0-based indexing).
end_page (int): The final page number to generate (inclusive).
Returns:
list: A list of formatted URLs covering the specified page range.
Note:
- Uses a fixed page size of 15 items per page
- Generates URLs inclusive of both start_page and end_page
- Page numbers in URLs are 0-based to match API expectations
"""
return [
f"{base_url}?page-size=15&page={page}"
for page in range(start_page, end_page + 1)
]
This is a utility function known as `generate_urls` which generates a list of paginated URLs for a given product category. It accepts three parameters: the base API URL for a specific product category, `base_url`; the page number to start from, `start_page`; and the final page number to generate, `end_page`. It utilizes a list comprehension to efficiently generate a list of URLs. Each URL is built by appending query parameters to the base URL. The `page-size` parameter is set at 15, so each page will have 15 items. The `page` parameter is set at the current page number within the range. The function calls Python's `range()`, which produces the sequence of numbers from `start_page` up to and including `end_page`. This function comes in really handy as it makes the scraper easily navigate through all the pages of a product category without having to manually specify each one.
save_to_database Function
def save_to_database(df, db_name, table_name):
"""
Save the scraped data to a SQLite database.
This function handles the database operations for storing the collected data,
creating or replacing a table with the scraped product information.
Args:
df (pandas.DataFrame): The DataFrame containing the scraped product data,
pre-processed and ready for storage.
db_name (str): The filename for the SQLite database.
table_name (str): The name of the table to create or replace in the database.
Note:
- Uses 'replace' mode, which will overwrite any existing table
- Implements proper connection handling with try/finally
- Logs any database errors that occur during the process
- Doesn't create an index, which might be needed for larger datasets
"""
try:
conn = sqlite3.connect(db_name)
df.to_sql(table_name, conn, if_exists='replace', index=False)
conn.commit()
except sqlite3.Error as e:
logging.error(f"Database error: {e}")
finally:
conn.close()
The `save_to_database` function takes care of the final step in the data scraping process: saving the gathered data to the database. It accepts three parameters: `df`, which is a pandas DataFrame holding the scraped product data; `db_name`, the filename for the SQLite database; and `table_name`, the name of the table to create or replace in the database. This function uses a try-except-finally block in order to be sure that a database error is caught and that the database connection is closed no matter what happens. Inside the try, it makes a connection to the SQLite database using `sqlite3.connect()`. Then, it uses pandas `to_sql()` method to save the DataFrame as a table in the database. The `if_exists='replace'` parameter tells pandas to overwrite the table if it already exists. Once it saves the data, it commits a transaction ensuring that all the changes have been saved. If a `sqlite3.Error` occurs during the commit process, it logs the error. Finally, in the `finally` block, it closes up the database connection. This function provides a clean way to persist the scraped data, making it accessible later on for reuse or analysis.
main Function
def main():
"""
Orchestrate the entire data collection and storage process.
This function serves as the entry point for the scraping operation:
1. Defines URL sets for different product categories
2. Sets up necessary HTTP headers
3. Initiates the scraping process
4. Processes the collected data (removes duplicates)
5. Saves the results to a SQLite database
The function handles four main product categories:
- Men's eyeglasses (180 pages)
- Women's eyeglasses (75 pages)
- Unisex eyeglasses (19 pages)
- Kids' eyeglasses (10 pages)
Note:
- URLs and page numbers are hardcoded and may need updates
- Uses a specific cookie for authentication
- Duplicates are removed based on product URL
"""
urls_to_scrape = [
generate_urls("https://api-gateway.juno.lenskart.com/v2/products/category/8416", 0, 179),
generate_urls("https://api-gateway.juno.lenskart.com/v2/products/category/8080", 0, 74),
generate_urls("https://api-gateway.juno.lenskart.com/v2/products/category/8427", 0, 18),
generate_urls("https://api-gateway.juno.lenskart.com/v2/products/category/8415", 0, 9)
]
headers = {
'Cookie': '__cf_bm=ZHdfO7vjMIeobbZuVoSXRJQQSzbm_sijggAsBbrZ96Y-1721976601-1.0.1.1-FXYm7d3KOGqXTPV75UMxGDcx16zKzICnsDem5DGsh74yzBDpFtGUP8l0gemYQecigWy0V7c8UIaj.VjRINeRWg; __cfruid=05eba9eca6fbad9a08b6fddc8abb29ff60d36eef-1721975314'
}
all_product_data = scrape_urls(urls_to_scrape, headers)
df = pd.DataFrame(all_product_data)
df.drop_duplicates(subset=['url'], inplace=True)
save_to_database(df, 'lenskart_data.db', 'eyeglasses_data')
logging.info("Data has been successfully scraped and saved to 'lenskart_data.db'")
The entry point and coordinator is the `main` function, which accumulates all others to run the whole process of scraping. It begins with declaring `urls_to_scrape`, which is a list of sets of URL for each type of product. Each set is generated via the `generate_urls` function, specific page ranges for each category: 180 for eyeglasses of men, 75 for women, etc. It also declares `headers` with a cookie to authenticate. Then the function calls `scrape_urls` in order to gather the actual scraping with URLs and headers. The data is then transformed into a pandas DataFrame. Any duplicate entries based upon product URL are removed using `drop_duplicates` method. This cleaned data is then saved to a SQLite database 'lenskart_data.db' in a table called 'eyeglasses_data' by using `save_to_database`. Finally, it logs a success message after all operations. This function holds the entire web scraping process from a URL to data storage inside one function, thus making it easy to run with one call to the function.
Script Execution
if __name__ == "__main__":
main()
This conditional statement is one of the most common idioms in Python when deciding whether you are running the script directly or importing a script as a module. When you run a Python script directly, Python sets the special variable `__name__` to `\"__main__\"`. If the script is being imported as a module, `__name__` is set to the name of the module. By checking if `__name__` is `\"__main__\"`, we make sure to call the `main()` function only when the script is run directly and not when it is imported into another program. This way, the script can be both imported and executed. If the functions from this script are required in another program, one can import them without automatically running the scraping process. However, when run directly, it will do the full scraping operation. This encourages code reuse and is, in general, a best practice in Python programming.
Website Scraping
Imports and Configuration
import asyncio
import random
from bs4 import BeautifulSoup
from playwright.async_api import async_playwright
import sqlite3
This section imports all the necessary libraries for the Lenskart scraper. All of these-scrapy, asyncio, random, beautifulsoup, playwright, and sqlite3-will be used in the given Python code for asynchronous programming, generating a random selection, HTML parsing, web automation, and database operations, respectively. Two constants are defined: WAIT_TIMEOUT-that is, the maximum wait time while loading pages, and TECHNICAL_DETAILS_WAIT-that is, the delay after clicking to reveal technical details.
load_user_agents Function
def load_user_agents(file_path):
"""
Load a list of user agents from a specified file for web scraping.
This function attempts to read user agents from a text file where each line
represents a different user agent string. These user agents are used to
rotate request headers, helping to avoid detection as a bot.
Args:
file_path (str): The path to the text file containing user agents.
Each user agent should be on a separate line.
Returns:
list: A list of strings, each string being a user agent.
Returns an empty list if the file is not found.
Note:
- The function strips whitespace from each line
- If the file is not found, it prints an error message and returns an empty list
"""
try:
with open(file_path, 'r') as file:
user_agents = file.readlines()
return [agent.strip() for agent in user_agents]
except FileNotFoundError:
print(f"Error: The file '{file_path}' was not found.")
return []
The `load_user_agents` function is a pillar of our web scraping strategy, providing a large array of user agent strings. In the context of web scraping, a user agent is similar to a digital ID card, letting sites know what kind of browser or device is making the request. Websites use that information commonly to customize content or, occasionally, to look for and stop automated scrapers. By using different user agents, we can make our scraper look more like various real users instead of a single bot. This function opens a text file specified by `file_path` and reads it line by line. Each line is expected to contain one user agent string. As it stands, the function removes all whitespace from the start and end of every line so that we get clean, usable user agent strings. If the specified file isn't found, the function does not crash the program. It prints an error message and returns an empty list instead. This elegant error handling lets our scraper keep running in case of an error with the user agent file, but might be slightly less effective at evading detection. It also prepares us nicely to rotate through different "identities" for each request we make by returning a list of user agents.
get_random_user_agent Function
def get_random_user_agent(user_agents):
"""
Select a random user agent from a provided list of user agents.
This function is used to rotate user agents for web scraping, which helps
to avoid detection and blocking by websites. It ensures that each request
can potentially use a different user agent.
Args:
user_agents (list): A list of strings, each string being a valid user agent.
This list should not be empty.
Returns:
str: A randomly selected user agent string from the provided list.
Note:
- This function assumes the input list is not empty
- If you need to handle empty lists, additional error checking should be added
"""
return random.choice(user_agents)
This `get_random_user_agent` function is actually hand in hand with `load_user_agents`, so it enhances our scraper's capability to mimic the way humans browse. After we'd got our list of user agents from `load_user_agents`, we now require a way of choosing one randomly for every request. That is what this function does. It takes as its argument the list of user agents and then uses Python's `random.choice` function to pick one at random. This randomization is one of the major parts of our strategy to not be detected. If we were to use the same user agent for every request, this website could easily realize that all these requests came from the same "browser." By changing our user agent for each request, we make our scraper's behavior much less predictable and more human-like. It's as if we're constantly changing disguises while collecting data. Each time we make a request to the website, we will call this function to get fresh, random user agent. This simple yet effective technique significantly improves our chances of avoiding detection and potential blocking by the website's anti-bot measures.
parse_product_name Function
def parse_product_name(soup):
"""
Extract the product name from the parsed HTML content of a Lenskart product page.
This function looks for a specific HTML element that contains the product name
using a CSS selector. It handles cases where the product name might not be found.
Args:
soup (BeautifulSoup): A BeautifulSoup object containing the parsed HTML content
of a Lenskart product page.
Returns:
str: The extracted product name with whitespace stripped.
Returns 'Unknown' if no product name is found.
Note:
- The function uses a specific CSS selector that matches Lenskart's HTML structure
- The selector might need to be updated if Lenskart changes their webpage layout
"""
name_tag = soup.select_one('div.ProductDetailsHeader--l9eycm.qeAcO > div > h1')
return name_tag.text.strip() if name_tag else 'Unknown'
The tool to get to one of the most important pieces of information off a product page would be the name of the product, and we will do this with the `parse_product_name` function. The `soup` variable is a BeautifulSoup object-a parsed representation of the HTML content of a Lenskart product page that lets us search and navigate through the page's structure. The function actually locates the piece of HTML that holds the product name using a CSS selector. This selector, `'div.ProductDetailsHeader--l9eycm.qeAcO > div > h1'`, is akin to a set of instructions that indicate exactly where to look within the HTML structure for BeautifulSoup. What it is saying is, "Find a div with classes 'ProductDetailsHeader--l9eycm' and 'qeAcO', then look for another div inside it, then find an h1 tag inside that." This is a very specific selector, which is honed to select Lenskart's specific HTML structure. If the element exists, it fetches the text content using the `text` attribute and removes extra whitespaces by the `strip()` method. This cleaning step is crucial because sometimes HTML would add extra spaces or newlines that do not appear in our final data. If the function does not find the element due to a possible change in the website structure by Lenskart, it does not crash but instead graciously handles the situation, returning 'Unknown'. This approach ensures that, even if our scraper comes across pages with structures that are not those we are expecting, it will continue running. It alerts us, however, to possible changes in our selectors. This product name our function extracts is important in identifying what we have scraped and will be key in our database.
parse_product_details Function
def parse_product_details(soup):
"""
Extract detailed technical specifications from a Lenskart product page.
This function parses the technical details section of a product page,
extracting key-value pairs of product specifications. It handles various
HTML structures for both regular text and linked values.
Args:
soup (BeautifulSoup): A BeautifulSoup object containing the parsed HTML
of the technical details section.
Returns:
dict: A dictionary where keys are specification names and values are
the corresponding specification details.
Note:
- The function looks for specific CSS classes that match Lenskart's HTML structure
- It handles both <span> and <a> elements for specification values
- Empty or invalid specifications are skipped
"""
product_details = {}
items = soup.find_all('div', class_='TechnicalInfostyles__ItemContainer-sc-j03lii-4 gQJSQF')
for item in items:
key_div = item.find('div', class_='ItemKey--1syx3ni')
key = key_div.get_text(strip=True) if key_div else None
value_div = item.find('span', class_='ItemValueSpan--1dnd1l8') or item.find('a', class_='ItemValue--xmlemn')
value = value_div.get_text(strip=True) if value_div else None
if key and value:
product_details[key] = value
return product_details
It is here, in the `parse_product_details` function, where we properly dig into the product page and extract all of its technical specifications. Similar to `parse_product_name`, it takes a BeautifulSoup object as an argument but does an awful lot more with it. This function is intended to deal with variations of product detail sections, which are often mixed kinds of information. It starts with an empty dictionary called `product_details`. That would be our container for all the specifications we find. The function then loops over the soup searching for all div elements marked with specific classes that Lenskart uses to mark specification items. For each of those items, it searches for two things: a key - the name of the specification and a value - details of that specification. If the key is in a div with class 'ItemKey--1syx3ni', the value will likely be either in a span with class 'ItemValueSpan--1dnd1l8' or it may be in an anchor tag with class 'ItemValue--xmlemn'. This flexibility makes it work well with text-based and link-based specifications. It removes any white space for extra characters on the ends of each and adds them to the dictionary `product_details`. If any key value pairs are missing, the specification is bypassed. This makes our function robust to slight variations or inconsistency in the page structure. Finally, we have a dictionary filled with all the specifications of the product ready to be stored in our database or analyzed. This detailed information makes our scraper valuable enough to give a panoramic view of the characteristics of every product.
scrape_with_playwright Function
async def scrape_with_playwright(url, other_columns, retry_attempts=3):
"""
Asynchronously scrape product information from a Lenskart URL using Playwright.
This function handles the complete scraping process for a single URL:
1. Loads and selects a random user agent
2. Launches a headless browser using Playwright
3. Navigates to the URL and waits for content to load
4. Extracts basic product information
5. Clicks to reveal technical details and extracts them
6. Retries on failure up to the specified number of attempts
Args:
url (str): The Lenskart product URL to scrape
other_columns (dict): Additional data to include in the result,
typically metadata from the database
retry_attempts (int): Number of times to retry scraping if it fails.
Defaults to 3
Returns:
dict: A dictionary containing the scraped data, or None if all attempts fail.
The dictionary includes:
- All key-value pairs from other_columns
- 'product_url': The original URL
- 'name': The extracted product name
- 'data': A dictionary of technical specifications
Note:
- Requires a 'user_agent.txt' file to be present
- Uses a hardcoded cookie for authentication
- Implements waits and timeouts to handle dynamic content loading
- Closes the browser after scraping to prevent resource leaks
"""
attempt = 0
while attempt < retry_attempts:
try:
user_agents = load_user_agents('user_agent.txt')
if not user_agents:
print("No user agents available. Please check the 'user_agent.txt' file.")
return {}
random_user_agent = get_random_user_agent(user_agents)
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
user_agent=random_user_agent,
extra_http_headers={
'Cookie': '__cf_bm=S_vK9dldsy.r.QBSG26mIpxPigZHS2KHto_5HJ2H0Ts-1722228647-1.0.1.1-_AE_gAnZUVHlP4IyuB8wrKGYnJ5M79grVG25p4UJ.k00QwmZl1G1vnJVDP.73hWKOZNuoIBaQSZwcKs2RzTBtw; __cfruid=1d3b4ec42751ee0d5cda4ddb22461dad7dc30704-1722228647'
}
)
page = await context.new_page()
await page.goto(url, wait_until='domcontentloaded', timeout=WAIT_TIMEOUT)
content = await page.content()
soup = BeautifulSoup(content, 'html.parser')
await page.wait_for_selector('#technicalID > div.TechInfoLink--1b082xk.bSGLKE')
await page.click('#technicalID > div.TechInfoLink--1b082xk.bSGLKE')
await page.wait_for_timeout(TECHNICAL_DETAILS_WAIT)
content2 = await page.content()
soup2 = BeautifulSoup(content2, 'html.parser')
await browser.close()
return {
**other_columns,
'product_url': url,
'name': parse_product_name(soup),
'data': parse_product_details(soup2)
}
except Exception as e:
print(f"Error occurred while scraping {url}. Attempt {attempt + 1} of {retry_attempts}. Error: {e}")
attempt += 1
return None
The `scrape_with_playwright` function is where all the pieces come together, actually fetching and pulling information from a product page on Lenskart. Asynchronous by nature, this allows this function to simply stop executing during certain time-consuming operations (such as loading a web page) to not block the rest of the program. Three parameters, the URL to scrape, any additional data we want to include with the scraped info, and how many times to retry if the scraping fails, we're taking three parameters. The function first loads our list of user agents and chooses one by random, and then it uses a powerful browser automation tool called Playwright to launch a headless browser. "Headless" means that the browser runs in the background without opening a visible window. This browser is configured with our random user agent and a specific cookie, which can be useful to access the site. The function moves to the URL and waits for the page to load. After it loads, it extracts the initial HTML content and applies our `parse_product_name` function to retrieve the product name. Next, it simulates a click on an element to disclose more technical details, waits a little bit for this new content to load, and then extracts this updated content. It's using our `parse_product_details` function that extracts all the technical specifications out of this new content. If any step of this process fails, this function will retry up to the specified number of attempts. This retry mechanism makes our scraper more robust, able to handle temporary network issues or other intermittent problems. Finally, if it succeeds, the function returns a dictionary containing the URL, the product name, all the technical details, and any additional data we may have passed in. In the event of all attempts failing, it returns None. This function encapsulates the whole process of visiting a page, interacting with it, and extracting data, providing a high-level interface for the rest of our scraper to use.
get_db_connection Function
def get_db_connection():
"""
Create and return a connection to the SQLite database.
This function establishes a connection to the SQLite database used for
storing scraped data and tracking scrapping progress.
Returns:
sqlite3.Connection: A connection object to the SQLite database.
Note:
- The database file is assumed to be named 'lenskart_data.db'
- The connection should be closed after use to prevent resource leaks
"""
return sqlite3.connect('lenskart_data.db')
The `get_db_connection` function seems pretty simple, but it's really the workhorse of our scraper. Its only purpose is to connect to the SQLite database we're going to use. SQLite is a lightweight, file-based database system perfect for the application at hand: when you do not need a full-fledged database server, this one is quite sufficient. This method employs the `sqlite3.connect()` method in order to connect to the file 'lenskart_data.db'. If this file does not exist, SQLite will create it for us. We also consolidate the creation of the database connection in this function so that each piece of our scraper is working with the same database in exactly the same way. This consistency will ensure data integrity. Moreover, if we ever need to update our database-for example, switching to a different system or changing the file name-we would only need to update this one function rather than rewriting code throughout our scraper. It's a small function, but it encapsulates an important programming principle: Don't Repeat Yourself (DRY). We've actually made this code more maintainable by having it tied to a single point of database connection instead of inconsistently used, which would cause potential errors.
setup_database Function
def setup_database():
"""
Set up the database schema, adding necessary tables and columns.
This function ensures that all required database structures exist:
1. Adds a 'scraped' column to the existing data table if it doesn't exist
2. Creates a scraped_data table for storing successfully scraped information
3. Creates an error_urls table for logging failed scraping attempts
Note:
- This function should be called before starting the scraping process
- It uses SQL transactions to ensure database consistency
- Existing tables and columns are not modified
"""
conn = get_db_connection()
cursor = conn.cursor()
# Check and add 'scraped' column to data table
cursor.execute('PRAGMA table_info(data)')
columns = cursor.fetchall()
column_names = [column[1] for column in columns]
if 'scraped' not in column_names:
cursor.execute('ALTER TABLE data ADD COLUMN scraped INTEGER DEFAULT 0')
print("Added 'scraped' column to data table")
# Create tables for scraped data and error tracking
cursor.execute('''
CREATE TABLE IF NOT EXISTS scraped_data
(url TEXT, name TEXT, data TEXT, original_id INTEGER,
FOREIGN KEY(original_id) REFERENCES data(id))
''')
cursor.execute('''
CREATE TABLE IF NOT EXISTS error_urls
(url TEXT, original_id INTEGER,
FOREIGN KEY(original_id) REFERENCES data(id))
''')
conn.commit()
conn.close()
This is our database architect: the `setup_database` function. Its job is to make sure our SQLite database has every structure we need to store the scraped data. Since it's idempotent, we can call it as many times as we want without messing things up or changing the result beyond the first application. Its first action is to open a connection to the database using our `get_db_connection` function,END. Then it starts looking for and changing the structure of the database if necessary. First, it checks if our main 'data' table contains a column named 'scraped'. This is a column that keeps track of which URLs we have already processed. If it doesn't, the function adds this column to the table. Then, it creates two new tables if they do not already exist: 'scraped_data', 'error_urls'. We will have two new tables. The 'scraped_data' table includes all the information we could successfully scrape by using the URL, product name, and detailed data. The 'error_urls' is designed for logging any URLs that we were not able to scrape, which is valuable for debugging and retry attempts. Both include foreign key references to the original 'data' table, enabling everything to be connected. The function actually employs SQL transactions, meaning that all of these operations are made as one single operation. It's very convenient since, in case any part of the setup process fails, all the changes won't be applied to ensure our database is always in a consistent state. We can run this function right before we start scraping to make sure our database is in the correct state for receiving and storing our scraped data.
save_to_db Function
def save_to_db(data, success=True):
"""
Save scraped data or error information to the database.
This function handles both successful and failed scraping attempts:
- For successful scrapes, it saves the product details
- For failed scrapes, it logs the URL in the error_urls table
- In both cases, it updates the 'scraped' status in the original data table
Args:
data (dict): A dictionary containing the data to be saved. Must include:
- 'product_url': The URL that was scraped
- 'id': The ID from the original data table
If success is True, must also include:
- 'name': The product name
- 'data': A dictionary of product details
success (bool): Whether the scraping attempt was successful
Note:
- This function uses transactions to ensure database consistency
- It assumes all required tables exist in the database
"""
conn = get_db_connection()
cursor = conn.cursor()
if success:
# Save successful scrape
cursor.execute('''
INSERT INTO scraped_data (url, name, data, original_id)
VALUES (?, ?, ?, ?)
''', (data['product_url'], data['name'], str(data['data']), data['id']))
else:
# Log failed scrape
cursor.execute('''
INSERT INTO error_urls (url, original_id)
VALUES (?, ?)
''', (data['product_url'], data['id']))
# Update scraping status
cursor.execute('''
UPDATE data
SET scraped = 1
WHERE id = ?
''', (data['id'],))
conn.commit()
conn.close()
The `save_to_db` function is our data librarian, responsible for saving the results of our scraping efforts to the database. It handles successful scrapes as well as failures; it guarantees that we will have a complete record of our scrapping process. The function has two parameters: 'data', which is a dictionary with all the information we want to save, and 'success', which is a boolean flag that tells us if the scrape was successful or not. It opens a connection to the database and then, depending on the 'success' flag, performs one of these actions. After a successful scrape, it creates a new row in the 'scraped_data' table containing the URL for which the data was scraped, the product name, a string representation of all of the detailed data we scraped, and a reference to the original data entry. If the scrape fails, however, it inserts a row into the 'error_urls' table instead, logging the failure URL along with the original data reference. In either case, successful or not, the function updates the 'scraped' status in the original 'data' table to 1, meaning we've already processed this URL. This update is important for our main scraping loop to know which ones have been attempted. This function makes use of parameterized queries for all these database operations. That is, a good practice in terms of security will protect against SQL injection attacks, hence our database interaction is functional and secure. Once the above operations have been executed, the function commits the transaction, ensuring our changes are preserved in the database. By handling both successes and failures, this function gives us a complete picture of our scraping process, which is invaluable for monitoring progress, identifying problems, and ensuring we don't miss or repeat any URLs.
main Function
async def main():
"""
Main function to orchestrate the scraping process.
This function:
1. Sets up the database
2. Retrieves all unscraped URLs from the database
3. Attempts to scrape each URL
4. Saves the results (both successful and failed) back to the database
The function processes URLs sequentially to avoid overwhelming the target website.
It handles both successful scrapes and failures, ensuring all attempts are logged.
Note:
- This function should be run with asyncio.run()
- It assumes the database has been properly initialised with URLs to scrape
- Progress is automatically saved after each URL is processed
"""
setup_database()
conn = get_db_connection()
cursor = conn.cursor()
# Get unscraped URLs
cursor.execute('SELECT * FROM data WHERE scraped = 0')
rows = cursor.fetchall()
column_names = [description[0] for description in cursor.description]
for row in rows:
# Process each URL
row_dict = dict(zip(column_names, row))
url = row_dict.pop('url')
row_dict.pop('scraped')
print(f"Scraping URL for ID {row_dict['id']}: {url}")
extracted_data = await scrape_with_playwright(url, row_dict)
if extracted_data:
save_to_db(extracted_data, success=True)
else:
save_to_db({'product_url': url, **row_dict}, success=False)
conn.close()
`main` is like the conductor of our scraping orchestra, gathering all the individual pieces we've created and coordinating their performance to execute the full-fledged scrape. This function is noted as `async`, meaning it can be used with other asynchronous functions in our code, especially `scrape_with_playwright`. When the program is executed, `main` first calls `setup_database()` to get our database set up. It now initiates a connection to the database and creates a cursor ready to execute SQL commands. The function then retrieves all the unscraped URLs from our 'data' table. This is done by selecting all rows from the 'data' table where the 'scraped' column equals 0. The result, containing all columns from the 'data' table, is then nicely bundled into a list of dictionaries for further handling. With the list of URLs to scrape in hand, the function enters its main loop. For each URL, it first prints a status message; it then calls `scrape_with_playwright` to attempt to scrape. If `scrape_with_playwright` returns data-as it does on a successful scrape- `main` calls `save_to_db` with that data and `success=True`. On a failed scrape, though, `scrape_with_playwright` will return None ; `main` still calls `save_to_db` - but this time with `success=False` and only the original row data. This ensures that every attempt, successful or not, is recorded in our database. By processing URLs one at a time, `main` helps us avoid overwhelming the target website with too many simultaneous requests. This sequential processing also makes it easier to track progress and handle any errors that might occur during scraping. Once all URLs have been processed, the function closes the database connection, cleaning up our resources. The `main` function encompasses the whole scraping process, from setup to execution to cleanup, thus providing us with one entry point for running our scraper.
Script Execution
if __name__ == "__main__":
asyncio.run(main())
This section checks if the script is being run directly (not imported as a module) and if so, it runs the main function using asyncio.run(). This is the entry point of the script that kicks off the entire scraping process.
Connect with Datahut for top-notch web scraping services that bring you the information you need, hassle-free.
FAQ SECTION
1. What is automated data collection, and how is it relevant to analyzing Lenskart?
Automated data collection involves using scripts or tools to extract structured information from websites like Lenskart. It simplifies gathering insights on product pricing, customer preferences, and trends. Companies like Datahut specialize in providing customized data scraping solutions to meet such analytical needs.
2. Can data scraping help me effectively track Lenskart’s pricing trends and discounts?
Yes! Data scraping is a powerful tool to monitor and analyze real-time pricing, seasonal discounts, and promotional strategies on platforms like Lenskart. With services like those offered by Datahut, businesses can gain actionable insights to stay competitive.
3. Is it legal to scrape data from e-commerce websites like Lenskart?
Web scraping is legal when done responsibly and ethically, following the website’s terms of service. Partnering with experienced providers like Datahut ensures compliance with legal and ethical standards while collecting valuable data.
4. How can I benefit from outsourcing data scraping for Lenskart analytics?
Outsourcing data scraping to experts such as Datahut saves time, ensures accuracy, and allows you to focus on analyzing the data rather than collecting it. For Lenskart analytics, this means a streamlined process to gather insights on pricing, product trends, and market positioning.