Gap.com, the online storefront for Gap Inc., is a major player in the fashion industry, offering a wide array of apparel and accessories for diverse consumers. Given its prominence, Gap's offerings often serve as industry benchmarks in areas ranging from pricing to marketing strategies. In fact, many of Gap's competitors utilize data scraping tools to keep an eye on its inventory, seeking insights into market trends.
As a web scraping company, we frequently receive requests to extract data from Gap. In this tutorial, we will walk you through the process of scraping the 'Men's New Arrivals' category from Gap.com. By doing so, we aim to illuminate the potential insights one can derive, such as tracking fashion item popularity, analyzing pricing dynamics, and gaining a deeper understanding of consumer preferences in the retail landscape.
Why Choose Gap for Web Scraping?
Fashion is an industry where Datahut frequently collaborates with clients. Among the various e-commerce platforms, Gap stands out as a recurring request for our web scraping services. Scraping data from Gap.com can provide many valuable insights, such as:
Market Intelligence: Understand which products are trending, their pricing, promotions, and inventory levels.
Consumer Preferences: Gauge which items are customer favorites and tap into the voice of the consumer.
Trend Spotting: Identify and anticipate current and upcoming market trends.
Pricing Analysis: Examine product pricing strategies, including the impact of discounts and other promotional tactics.
Attributes we’re going to Scrape
Before we start product data scraping, we must decide which data to extract from the product page and the product details, including the specs and more. The following product information is extracted from each product detail page in the Men New Arrivals category:
Product URL: The URL of the product
Product Type: The Category/Type of the product
Product Name: The name of the product
Selling Price: The current selling price of the product
Max Retail Price: The maximum retail price of the product
Rating: The average rating of the product
Number of Ratings: The number of ratings available for the product
Color: The color of the product
Available Sizes: The list of all sizes the product is available in
Fit and Sizing: The fitting and size details of the product
Product Details: Remaining product details, excluding material information
Fabric and Care: Details on product material
Let's dive into our guide to scrape Men New Arrivals from Gap.com.
Importing Required Libraries
The code starts by importing the required Python libraries. These libraries help us to perform various tasks, from controlling web drivers for browser automation to saving data in a tabular form. In our program, we are using a total of seven different libraries, each for a specific purpose.
What is a webdriver?
A WebDriver is a software that allows you programmatically control a web browser. y. Web drivers are essential tools for automating tasks such as web scraping, automated testing of web applications, and other interactions with websites.
a) Browser Interaction: Web drivers allow you to interact with web browsers like Google Chrome, Mozilla Firefox, Microsoft Edge, and others through code. You can open, close, and navigate web pages, click on links and buttons, fill out forms, and extract data from web pages.
b) Cross-Browser Compatibility: Many web drivers are designed to work with multiple web browsers, which is crucial for testing web applications across different browsers to ensure compatibility.
c) Programming Languages: Web drivers are usually available in various programming languages, such as Python, Java, JavaScript, C#, and more, making it accessible to developers with different language preferences.
d) Automation Frameworks: Web drivers are often integrated into test automation frameworks like Selenium, Puppeteer, WebDriverIO, and Playwright, which provide higher-level abstractions and tools for web automation.
e) Headless Browsing: Some web drivers support headless browsing, which means running a browser without a graphical user interface. Headless browsers are useful for automated tasks where you don't need to see the browser window but still need to interact with web pages.
Here are a few examples of popular web drivers used for web scraping and their associated libraries/frameworks:
1. Selenium WebDriver: Selenium is a popular automation framework that supports multiple programming languages and browsers. Selenium WebDriver allows you to automate interactions with web pages using various programming languages.
2. Puppeteer: Puppeteer is a Node.js library developed by Google for controlling Chrome or Chromium browsers. It's often used for web scraping and automated testing.
3. Playwright: Playwright is a relatively newer automation library developed by Microsoft that supports multiple browsers (Chromium, Firefox, and WebKit). It's designed for browser automation and is similar to Puppeteer but more versatile in terms of browser support.
Web drivers have revolutionized web scraping, and it is a tool professional web scraping services use to build Amazon product scrapers and web scrapers from other e-commerce websites.
# importing required libraries
from selenium import webdriver
from time import sleep
from bs4 import BeautifulSoup
import re
import random
import unicodedata
import pandas as pd
Selenium Web Driver
We first import the Selenium Web driver. WebDriver from Selenium is a tool for automating web browser interactions. It allows us to interact with web browsers in a way that mimics a human user. This can automate tasks such as navigating websites, clicking buttons, and filling out forms.
In our program, we use WebDriver to scrape data from a dynamic website. This website requires us to move over elements to load the products properly. We use the Chrome WebDriver, but there are other WebDrivers available for different browsers.
Here are some additional details about how WebDriver can be used for web scraping:
WebDriver can be used to scrape data from websites that are protected by JavaScript.
WebDriver can be used to scrape data from websites that require user authentication.
WebDriver is a powerful tool that can automate various tasks on the web. It is a valuable tool for web scraping and other applications.
Have a time delay between requests
The sleep() function is a Python function that pauses the execution of the program for a specified number of seconds. This can be used to introduce delays in the program, such as between successive requests in a web scraping script.
In web scraping, the sleep() function can be used to avoid overloading the website's server. If you make too many requests to a website in a short period of time, the server may become overloaded and may block your requests. Using the sleep() function, you can space out your requests and give the server time to recover.
The sleep() function can also be used to mimic human behavior. When humans browse websites, they don't click on links or submit forms instantly. They typically take a few seconds to read the page and decide what to do next. Using the sleep() function, you can make your web scraping script more human-like and less likely to be blocked by the website's server.
What is BeautifulSoup or bs4
Beautiful Soup is a Python library used for parsing HTML and XML documents. It creates a parse tree from the document, which can then extract data.
Beautiful Soup is a powerful tool that can be used to extract data from various websites. It is a popular choice for web scraping because it is easy to use and has many features. In web scraping, Beautiful Soup can extract data from websites. This can be done by first parsing the HTML of the website and then using Beautiful Soup to find the desired data.
Here are some of the benefits of using Beautiful Soup for web scraping:
It is easy to use.
It has a wide range of features.
It is fast and efficient.
It is well-documented.
There is a large community of users and developers.
In this tutorial, we use Beautiful Soup to parse the HTML content of a website. They specify the lxml parser, which is a fast and powerful parser that is well-suited for web scraping.
The lxml parser creates a parse tree of the HTML document, a hierarchical representation of the document's structure. This parse tree can then be used to extract data from the document.
What are Regular expressions (or regex)
Regular expressions (or regex) are a powerful tool for pattern matching and text manipulation. They can be used to find specific patterns in text, extract data, and perform other tasks. Web scraping companies use it all the time.
The re-module in Python is a built-in library that supports regular expressions. It can be used to perform a variety of tasks, such as:
Finding specific patterns in text
Extracting data from text
Replacing text
Splitting text into substrings
Matching multiple patterns
In the context of web scraping, regular expressions can be used to:
Find the desired data in the HTML source code of a web page
Extract the data from the HTML source code
Validate the data
Clean the data
Regular expressions can be a bit tricky to learn, but they are a powerful tool that can be used to automate various tasks in web scraping.
The random Module
The random module in Python is a built-in library that provides functions for generating random numbers. It can be used to generate random numbers in a variety of ways. In our case, we use it to create a random delay between successive requests.
Unidecode
Unicode is a character encoding system representing various characters from different writing systems, including Latin, Greek, Chinese, and many others. It is the standard character encoding system used by most modern computers.
Unicode is a character encoding system that represents various characters from different writing systems, including Latin, Greek, Chinese, and many others. It is the standard character encoding system used by most modern computers.
The unicodedata module in Python provides access to the Unicode Character Database. This database contains information about all the characters in the Unicode standard, such as their names, code points, and properties.
In the context of web scraping, the unicodedata module can be used to:
Handle text encoding issues, such as when the text is encoded in a different encoding than your code is expecting.
Remove non-breaking spaces and other invisible characters from the text.
Clean up the text by converting it to a standard form.
The unicodedata module is a powerful tool that can be used to handle text encoding issues in web scraping.
In this tutorial, we use the unicodedata module to normalize zero-width space characters. Zero-width space characters are non-visible characters that denote line breaks and similar aspects. They can cause problems in web scraping, as they can be misinterpreted as regular spaces. By normalizing zero-width space characters, the author can ensure that the text is clean and easy to parse.
Pandas: For working with tabular data
Pandas is a Python library that provides high-performance, easy-to-use data structures and data analysis tools. It is commonly used for data cleaning, manipulation, and analysis.
The DataFrame data structure in Pandas is a tabular data structure similar to a spreadsheet. It is used to store and organize data in rows and columns.
In the context of web scraping, the DataFrame data structure can be used to:
Store the extracted data from a web page
Clean and manipulate the data
Analyze the data
Export the data to a file
We use the .to_csv() method in Pandas to export a DataFrame to a CSV file. A CSV file is a text file that stores tabular data in rows and columns, separated by commas.
Global Constants
After importing the needed libraries, we initialize some global variables used in our script.
The following constants are defined at the beginning :
# Global Constants
# output csv file name
CSV_FILENAME = 'gap_men_new_arrivals.csv'
# PLP stands for Product Listing Page
# It is the page in which the list of products is available
PLP_URL = "https://www.gap.com/browse/category.do?cid=11900&department=75"
CSV_FILENAME: The name of the output CSV file where data will be stored.
PLP_URL: The URL of the web page from where scraping will start.
Initialize the WebDriver
Gap.com is a dynamic website, meaning most elements activate only after we pass over them. Thus, we use selenium to create an instance of the Chrome WebDriver to navigate the page and activate every element.
# Initialize the WebDriver
def initialize_driver():
"""
return: WebDriver instance (Chrome in this case)
"""
driver = webdriver.Chrome()
return driver
The initialize_driver() function creates an instance of the Chrome WebDriver, which will be used to control the web browser.
What happens is that the Chrome WebDriver is initialized into a variable driver using the above-said initialize_driver function.
The driver then goes to the target -Men New Arrivals at Gap web page.
Scroll Down the Page
After reaching the target page, we need to reach the true bottom of the current page, which is an infinite scrolling page to activate every element.
# Scroll down the page to load more products
def scroll(driver, times=1):
"""
our target site is a responsive site with infinite scrolling
infinite scrolling is a technique where the page keeps on loading
more content as the user scrolls down
this is done to avoid pagination and to provide a seamless user
experience
thus we need to scroll down the page to load more products
our function executes in such a way that after going to the bottom
of a page we then go to the middle
this is done in case some elements were returned
here we are using javascript to scroll down to various parts of the
page
:param driver: WebDriver instance
:param times: number of times to scroll down the page, default
value is 1
"""
execution_count = 0
while execution_count < times:
random_sleep()
# bottom
driver.execute_script(
"window.scrollTo(0,document.body.scrollHeight)"
)
random_sleep()
# middle
driver.execute_script(
"window.scrollTo(0, document.body.scrollHeight / 2)"
)
execution_count += 1
# top
driver.execute_script(
"window.scrollTo(0, 0)"
)
random_sleep()
driver.execute_script(
"window.scrollTo(0, document.body.scrollHeight / 2)"
)
random_sleep()
driver.execute_script(
"window.scrollTo(0,document.body.scrollHeight)"
)
The scroll() function simulates scrolling down the web page to load more products. Our target site is a responsive web page, and as such, not all products will be loaded in the beginning. Only after scrolling through or past them do the products appear, and their elements get loaded.
It scrolls multiple times based on the times parameter.
Mimicking human behavior
Random delays are set in the script so human behavior can be somewhat mimicked.
# Random sleep to mimic human behavior
def random_sleep(min_time=4, max_time=7):
"""
here uniform function is used which takes decimal values as well and not just whole numbers
thus showing somewhat more natural human behavior
:param min_time: minimum time to sleep, default value is 4
:param max_time: maximum time to sleep, default value is 7
delays program execution for a random amount of time between min_time and max_time
we use a range of 4-7 seconds as selenium sometimes requires quite a bit of time to load the page
"""
sleep_time = random.uniform(min_time, max_time)
sleep(sleep_time)
The random_sleep() function uses the random library to take a number between four (inclusive) and seven, which is then used to pause the execution of the script for the time which is equal to the random number obtained.
Selenium webdrivers can at times take a bit of time to completely load a web page. To account for such a possibility, we take a range of four to seven for the random sleep function.
To mimic more natural human behavior, we use the uniform function of the random module, which returns decimal numbers and not whole numbers.
Get BeautifulSoup Object
After activating the javascript components of the page, the BeautifulSoup library is used to parse the HTML source code of the web page using the “lxml” parser.
# Get BeautifulSoup object from the current page
def get_soup(driver):
"""
:param driver: WebDriver instance
:return: BeautifulSoup object
"""
return BeautifulSoup(driver.page_source, 'lxml')
The get_soup() function extracts a BeautifulSoup object from the given web page's HTML source. BeautifulSoup is used for parsing and navigating the HTML content.
Main Program Flow
# Main function
def main():
"""
begins with initializing the WebDriver
then goes to the PLP_URL
- PLP means product listing page and it is the page in which the
list of products is available
then scrolls down the page to load more products
then gets the BeautifulSoup object from the current page
then gets each product element from the main page
then extracts the product url from each product element and stores
it in a list
then initializes a pandas dataframe with the required columns
then iterates through each product and extracts information
then stores the information in the initialized pandas dataframe
then prints the progress, which is the count of the current product
after going through every url writes the dataframe to the CSV file
then quits the WebDriver
in the above description each line corresponds to each section of
the main function which is seperated by a blank line
"""
driver = initialize_driver()
driver.get(PLP_URL)
scroll(driver, times=4)
soup = get_soup(driver)
product_info = soup.find_all('div', class_='category-page-1wcebst')
pdp_url_list = []
for product in product_info:
pdp_url = extract_pdp_url(product)
pdp_url_list.append(pdp_url)
df = pd.DataFrame(columns=
['Product_URL', 'Product_Type', 'Product_Name',
'Selling_Price', 'Max_Retail_Price', 'Rating',
'Rating_Count', 'Color', 'Available_Sizes',
'Fit_Sizing', 'Product_Details', 'Fabric_Care']
)
for index, pdp_url in enumerate(pdp_url_list, start=1):
if pdp_url != 'Not available':
driver.get(pdp_url)
random_sleep()
soup = get_soup(driver)
product_type = extract_product_type(soup)
product_name = extract_product_name(soup)
selling_price, max_retail_price = extract_prices(soup)
star_value = extract_star_value(soup)
ratings_count = extract_ratings_count(soup)
color = extract_color(soup)
available_sizes = extract_available_sizes(soup)
details = extract_details(soup)
df.loc[index] = [
pdp_url, product_type, product_name,
selling_price, max_retail_price, star_value,
ratings_count, color, ', '.join(available_sizes),
', '.join(details[0]), ', '.join(details[1]),
', '.join(details[2])
]
print(index)
df.to_csv(CSV_FILENAME, index=False)
print(f"Data written to {CSV_FILENAME}")
driver.quit()
# Run the main function if the script is executed directly
if __name__ == "__main__":
main()
The main() function is where the main scraping process happens by calling all the defined functions.
Storing the extracted data also takes place within the main() function.
The script checks if it is executed directly and not imported as a module. If true, it runs the main() function.
The main() function can be explained in three parts.
Main Program Flow - pdp url extraction
driver = initialize_driver()
driver.get(PLP_URL)
scroll(driver, times=4)
soup = get_soup(driver)
product_info = soup.find_all('div', class_='category-page-1wcebst')
pdp_url_list = []
for product in product_info:
pdp_url = extract_pdp_url(product)
pdp_url_list.append(pdp_url)
We use the PLP_URL to go to our target site page which is Men New Arrivals at Gap. On reaching the target page we can see that there are 288(value changes by each batch of new arrivals) products present in the Men New Arrival section.We need to obtain details of all these 288 products.
The BeautifulSoup object is then used to navigate and extract data to obtain the “div” element for all 288 products from the parsed HTML document. We store the div element containing the product inside a product_info variable.
A "div" element is a fundamental HTML element used for structuring and formatting web content. "Div" stands for "division," and it is primarily used to divide or group together content on a web page.
The product_info variable is looped through to obtain the PDP URL of each product.
PDP stands for product description page and contains all the details about the product.
The URL of every 288 products is obtained and stored in a list. The urls are obtained and stored together at the beginning so that we can refer to it in case of any issues down the line.
Let's look at the function used to extract the PDP url.
extract_pdp_url(product)
# Extract product url from a product element
def extract_pdp_url(product):
"""
pdp stands for product description page
it is the page in which the whole information about the product is
present
product url is present in the form -
'https://www.gap.com/browse/product.do?
pid=774933022&cid=11900&pcid=11900&vid=1&nav=meganav%3AMen%3AJust%20Arrived%3ANew%20Arrivals&cpos=116&cexp=2859&kcid=CategoryIDs%3D11900&ctype=Listing&cpid=res23090805504869997736471#pdp-page-content'
we need to extract the part till the value of pid (inclusive)
the rest of the url is not needed and can even break the url at a
later date
:param product: product element
:return: pdp url which is a string
"""
try:
url = product.find('a').get('href')
url = url.split('&')[0]
except:
url = 'Not available'
return url
This function takes a product element, which is a <div> element of class “category-page-1wcebst” and extracts the url of the product description page by finding the <a> element within the element and retrieving its “href” attribute. The obtained link is cleaned by removing unnecessary parts
<a> is an anchor element, commonly known as a hyperlink or link. It is used to create clickable links within web pages. When a user clicks on an <a> element, it typically redirects them to another web page or resource specified in the href attribute.
The URL thus obtained can contain extra values at the end, which can make the URL broken at a later date. To prevent this, we only take the portion of the URL till the value of pid.
Main Program Flow - product data extraction
for index, pdp_url in enumerate(pdp_url_list, start=1):
if pdp_url != 'Not available':
driver.get(pdp_url)
random_sleep()
soup = get_soup(driver)
product_type = extract_product_type(soup)
product_name = extract_product_name(soup)
selling_price, max_retail_price = extract_prices(soup)
star_value = extract_star_value(soup)
ratings_count = extract_ratings_count(soup)
color = extract_color(soup)
available_sizes = extract_available_sizes(soup)
details = extract_details(soup)
df.loc[index] = [
pdp_url, product_type, product_name,
selling_price, max_retail_price, star_value,
ratings_count, color, ', '.join(available_sizes),
', '.join(details[0]), ', '.join(details[1]),
', '.join(details[2])
]
print(index)
The product data extraction begins with looping through the pdp URL list to obtain the pdp url of the respective product.
During each iteration, we obtain the PDP URL of a product from the list, and then the driver goes to the obtained URL.
The get_soup function then uses BeautifulSoup and “lxml” to parse the HTML source code of the product page. The soup object is then used to extract all the necessary information about the product, using the functions defined for obtaining data about each aspect of the product.
After the data extraction has been completed the count of the product whose data has been extracted is displayed.
Now let's look into the part where data is extracted.
Extracting Product Information
Several functions are defined for extracting different information from product pages, including types, names, prices, ratings, the count of ratings, colors, available sizes, and various details. Here's an explanation of each of the functions :
extract_product_type(soup)
# Extract product type from the product page
def extract_product_type(soup):
"""
the div with class pdp-mfe-1atmbpz contains the two a tags
the first a tag contains the product section - men, women, boys,
baby
the second a tag contains the product type - jeans, t-shirts,
shirts, etc.
we need to extract the second a tag
:param soup: BeautifulSoup object
:return: product type which is a string
"""
try:
product_type_element = soup.find('div',
class_='pdp-mfe-1atmbpz'
)
product_type_a = product_type_element.find_all('a')
product_type = product_type_a[1].get_text()
except:
product_type = 'Not available'
return product_type
This function extracts the product type from the product page by looking for a specific <div> element with the class “pdp-mfe-1atmbpz” and extracts the text from the second <a> element within, which corresponds to the product category or type.
extract_product_name(soup)
# Extract product name from the product page
def extract_product_name(soup):
"""
the h1 tag which contains the product has a different class name
for each product
but every h1 tag has the class name starting with pdp-mfe-
:param soup: BeautifulSoup object
:return: product name which is a string
"""
try:
product_name_element = soup.select('h1[class^="pdp-mfe-"]')
product_name = product_name_element[0].text
except:
product_name = 'Not available'
return product_name
This function extracts the product's name from the product page. It searches for an <h1> element whose class attribute starts with “pdp-mfe-” and retrieves the text within it. This will provide the product's name.
We search for elements with class attributes starting with “pdp-mfe-” because the names of different products present in their respective pages are present in <h1> elements of different class names, and their only common point is the “pdp-mfe-” part.
extract_prices(soup)
# Extract product prices from the product page
def extract_prices(soup):
"""
the price is present in the div with class pdp-pricing pdp-mfe-
1x0pbuu
the price element can contain either a single price or two prices
when the selling price and max retail price are different, then
there are two prices in the price element
selling price element exists only if the selling price and max
retail price are different
otherwise the price element contains only a single price and that
is taken as the selling price
max retail price element exists only if the selling price and max
retail price are different
otherwise the selling price is taken as the max retail price
re library is used to remove any text within parentheses
:param soup: BeautifulSoup object
:return: selling price and max retail price which are strings
"""
try:
price_element = soup.find(
'div',
class_='pdp-pricing pdp-mfe-1x0pbuu'
)
selling_price_element = price_element.find(
'span',
class_='pdp-pricing--highlight pdp-pricing__selected pdp-mfe-1x0pbuu'
)
if selling_price_element:
selling_price = selling_price_element.text.strip('$')
else:
selling_price = price_element.text.strip('$')
selling_price = re.sub(r'\([^()]*\)', '', selling_price).strip()
max_retail_price_element = price_element.find(
'span',
class_='product-price__strike pdp-mfe-eyzase'
)
if max_retail_price_element:
max_retail_price = max_retail_price_element.text.strip('$')
else:
max_retail_price = selling_price
except:
selling_price = 'Not available'
max_retail_price = 'Not available'
return selling_price, max_retail_price
This function extracts the product's prices (both the selling price and maximum retail price) from the product page by searching for elements related to pricing and extracting the relevant text.
It also handles cases where only a single price is available and cases where price ranges are given.
extract_star_value(soup)
# Extract product rating from the product page
def extract_star_value(soup):
"""
the span with class pdp-mfe-3jhqep contains the star rating in the
form - 5 stars, x are filled
we need to extract the value of x
:param soup: BeautifulSoup object
:return: star value which is a string
"""
try:
star_value = soup.find('span', class_='pdp-mfe-3jhqep').text
star_value = star_value.split(',')[1].split(' ')[1]
except:
star_value = 'Not available'
return star_value
This function extracts the product's rating, i.e., the average rating, which is the star count, from the product page by looking for a <span> element with the class “pdp-mfe-3jhqep”, which contains the rating information. It extracts and cleans the rating value from the text.
extract_rating_count(soup)
# Extract the number of product ratings from the product page
def extract_ratings_count(soup):
"""
the div with class pdp-mfe-17iathi contains the number of ratings
in the form - x ratings
we need to extract the value of x
:param soup: BeautifulSoup object
:return: ratings count which is a string
"""
try:
ratings_count = soup.find('div', class_='pdp-mfe-17iathi').text
ratings_count = ratings_count.split(' ')[0]
except:
ratings_count = 'Not available'
return ratings_count
This function extracts the number of product ratings from the product page by searching for a <div> element with the class “pdp-mfe-17iathi” and extracts the first part of the text present within the element.
extract_color(soup)
# Extract product color from the product page
def extract_color(soup):
"""
the span with class swatch-label__value contains the color of the
product
:param soup: BeautifulSoup object
:return: color which is a string
"""
try:
color = soup.find('span', class_='swatch-label__value').text
except:
color = 'Not available'
return color
This function extracts the product's color from the product page by looking for a <span> element with the class “swatch-label__value” and extracts the text within it.
extract_available_sizes(soup)
# Extract available sizes from the product page
def extract_available_sizes(soup):
"""
the div with class pdp-mfe-17f6z2a pdp-dimension pdp-dimension--
should-display-redesign-in-stock contains the available sizes
the available sizes are stored into a list
in cases where there is no size available, the div with class pdp-
mfe-17f6z2a pdp-dimension pdp-dimension--should-display-redesign-
in-stock is not present
in such cases we return a list with 'Not applicable' as the only
element
this can be seen in case of accessories such as bags
:param soup: BeautifulSoup object
:return: available sizes which is a list
"""
try:
available_sizes_element = soup.find_all('div', class_='pdp-mfe-17f6z2a pdp-dimension pdp-dimension--should-display-redesign-in-stock')
available_sizes = []
for size in available_sizes_element:
available_sizes.append(size.text)
except:
available_sizes = ['Not available']
if not available_sizes:
available_sizes = ['Not applicable']
return available_sizes
This function extracts the available sizes for the product from the product page by searching for specific elements related to size information, which is a <div> element with a class “pdp-mfe-17f6z2a pdp-dimension pdp-dimension--should-display-redesign-in-stock” and extracts the text for each available size as a list of values.
In cases of certain products like bags, caps and similar products most times won’t have any proper size, in such cases we use the value - ‘Not Applicable’.
extract_details(soup)
# Extract product details from the product page
def extract_details(soup):
"""
the product details are present in the form of a list
there are three sets of details - fit and sizing, product details,
fabric and care
each set of details is present in a ul tag with class name starting
with product-information-item__list
the text obtained is then normalized to remove any unicode
characters
normalizing means converting the special characters to their normal
form
in our case we can particularly see zero width space characters
(u200b) in the text
:param soup: BeautifulSoup object
:return: fit and sizing, product details, fabric and care which are
lists
"""
try:
details_elements = soup.select('ul[class^="product-information-item__list"]')
if len(details_elements) == 3:
fit_sizing_element = details_elements[0].find_all('li')
fit_sizing = []
for detail in fit_sizing_element:
if 'wearing' not in detail.text:
text = unicodedata.normalize(
"NFKD",
detail.text
).rstrip('. ')
fit_sizing.append(text)
product_details_element = details_elements[1].find_all('li')
product_details = []
for detail in product_details_element:
if '#' not in detail.text and 'P.A.C.E.' not in detail.text and 'pace' not in detail.text:
text = unicodedata.normalize(
"NFKD",
detail.text
).rstrip('.')
product_details.append(text)
fabric_care_element = details_elements[2].find_all('li')
fabric_care = []
for detail in fabric_care_element:
text = unicodedata.normalize(
"NFKD",
detail.text
).rstrip('. ')
fabric_care.append(text)
else:
fit_sizing = ['Not applicable']
product_details_element = details_elements[0].find_all('li')
product_details = []
for detail in product_details_element:
if '#' not in detail.text and 'P.A.C.E.' not in detail.text and 'pace' not in detail.text:
text = unicodedata.normalize(
"NFKD",
detail.text
).rstrip('.')
product_details.append(text)
fabric_care_element = details_elements[1].find_all('li')
fabric_care = []
for detail in fabric_care_element:
text = unicodedata.normalize(
"NFKD",
detail.text
).rstrip('. ')
fabric_care.append(text)
fabric_care.append(text)
except:
fit_sizing = ['Not available']
product_details = ['Not available']
fabric_care = ['Not available']
return [fit_sizing, product_details, fabric_care]
This function is responsible for extracting various details about the product, such as fit & sizing, product details, and fabric & care instructions. It identifies and extracts these details from the ul element with the class attribute starting with “product-information-item__list” . The obtained values are divided into three or two based on the said conditions.
Unicode normalization is performed on the obtained text so that characters with accents, like é, for example, are stored in their standard form, which is e here.
In our case we encounter ‘\u200b’, which represents the “ZERO WIDTH SPACE” character, which isn’t visible and is primarily used for controlling line breaks and word boundaries.
We use the ‘NFKD’ normalization form, which decomposes characters with accents and then represents them in their identical standard form. In the case of é, it will be decomposed into e and ‘, and then e will be used.
Most products belonging to the accessories category don’t have any fit and sizing information, and in such cases, we also use the value - ‘Not Available.
Main Program Flow - data storage
This part also takes place in the main function.
After the data scraping is done, it has to be saved for proper analysis. For this, we store the data in a CSV file.
After obtaining the div element, we initialize a pandas Dataframe, which is a data structure that is used to work with tabular data.
df = pd.DataFrame(columns=['Product_URL', 'Product_Type', 'Product_Name', 'Selling_Price', 'Max_Retail_Price', 'Rating', 'Rating_Count', 'Color', 'Available_Sizes', 'Fit_Sizing', 'Product_Details', 'Fabric_Care'])
The information obtained from each product page during each iteration is then added initialized dataframe during the respective iteration turn
df.loc[index] = [pdp_url, product_type, product_name, selling_price, max_retail_price, star_value, ratings_count, color, ', '.join(available_sizes), ', '.join(details[0]), ', '.join(details[1]), ', '.join(details[2])]
After iterating through the 288 products, we then write the dataframe to the csv file and then output a positive message and then the Chrome WebDriver instance using the driver.quit() command.
df.to_csv(CSV_FILENAME, index=False)
print(f"Data written to {CSV_FILENAME}")
driver.quit()
Want access to the code? See it in our GitHub repo: How to scrape product information from Gap
Conclusion
For individuals exploring web scraping tools or APIs to extract product information from Gap, this script serves as a valuable resource. The data extracted can be leveraged for in-depth analysis, unlocking numerous insights into market trends and consumer preferences. Stay tuned for our next blog, where we share our analysis of this data and share some exciting insights.
Whether you're focused on Gap or any other e-commerce platform, Datahut is here to streamline your web scraping needs. Reach out to us and explore how we can empower your business through data-driven insights!