Scraping Marks & Spencer Nightwear Products: Analysis and Visualization

Marks & Spencer is a renowned British retailer known for its diverse product range, which includes high-quality clothing, premium food and groceries, home and living products, beauty and cosmetics, and a commitment to sustainability and social responsibility. This unique combination of offerings has made M&S a popular choice for consumers in the UK and beyond.

In this blog post, we will explore Mark and Spencer's nightwear product range and gain an understanding of the following aspects:

The availability of different brands and the count of products in each brand.
Most trusted and reviewed product brands.
Top brand's average rating with respect to their count.
The average discount on the brands that are on sale.

What is web scraping?

Web scraping is a sophisticated method for extracting valuable information from websites. While it's possible to collect data manually, the process is often laborious, time-consuming, and prone to errors. In contrast, web scraping offers a faster, more efficient, and highly accurate way to automate this task.

One of the most significant advantages of web scraping is its ability to capture non-tabular or poorly structured data from websites and transform it into a structured, usable format. This can include converting the extracted data into formats like .csv files or spreadsheets, making it readily accessible for analysis and other applications.

However, web scraping is not merely about data collection but also a powerful tool for data archiving and monitoring changes to online data sources. By automating the process, users can effortlessly keep track of dynamic web content and ensure they always have the most up-to-date information at their fingertips.

Also Read: Web Scraping vs API: What’s the best way to extract data

Data collection

To acquire preliminary information regarding nightwear products, we utilized our proprietary data scraping platform to collect publicly available data from Marks and Spencer. Python and the BeautifulSoup library were used for extracting the data through web scraping.

The extracted data was saved to a CSV file using the Pandas library for easy manipulation and analysis.
The CSV file was integrated into an SQLite database to enhance data management.
Column field data types in the database were adjusted to match the respective columns.
The SQLite database was connected to Metabase, a business intelligence and analytics tool.
Metabase was utilized to create analytics charts by adding relevant fields and applying filters.

Attributes Scraped

The following data are extracted from each product present in Marks and Spencer women's lingerie

Brand Name: The name of the product
Title: The Category/Type of the product
Product URL: The URL of the product
Average rating: Average number of ratings available for the product
Product Code: Product code specified for each product
Reviews: Number of reviews each product received
Selling Price: The current selling price of the product
Original price: The maximum retail price of the product
Discount: Difference between the original price and the selling price
Sales status: The list of products that are on sale
Composition: Inventory of materials employed in the production of the product

This Python script is designed to scrape data from a Marks & Spencer (M&S) website related to lingerie and nightwear products. It uses various libraries, such as requests, BeautifulSoup, pandas, regular expressions, and others, to extract, clean up, and export product information from web pages.

Requests is a popular Python library for making HTTP requests to web pages. In the script, it is used to send HTTP GET requests to URLs, allowing the retrieval of the HTML content of web pages. Once the content is obtained, it can be parsed using BeautifulSoup, enabling the extraction of specific data or information from the web pages. This combination of Requests and BeautifulSoup is a common approach in web scraping and data extraction tasks in Python, making it a powerful tool for interacting with web resources.

Beautiful Soup is a versatile library for parsing HTML and XML documents. In the script, it plays a crucial role in handling the HTML content of web pages retrieved through Requests. By using BeautifulSoup, the script can effectively navigate and parse the structure of these web pages. This capability empowers the script to locate and extract specific elements, data, and information within the HTML, facilitating the extraction of valuable data from web resources. Whether it's scraping data from websites or performing web-related data processing tasks, BeautifulSoup is a valuable tool for developers working with HTML and XML documents in Python.

Pandas is a robust data manipulation library in Python, renowned for providing versatile data structures such as Data Frames. These data frames are instrumental for managing and analyzing structured data efficiently. In the script, Pandas finds application in the creation and manipulation of Data Frames, serving as a means to store and organize the product data obtained through web scraping. Leveraging Pandas in this context simplifies the process of working with and structuring the scraped data. Moreover, it facilitates various data analysis and transformation tasks, and Pandas' ability to export data to popular formats like CSV proves invaluable in making the data accessible and usable for a wide range of applications.

The 're' library, short for regular expressions, is a built-in Python library designed for handling regular expressions, and powerful tools for pattern matching and text manipulation. In the script, 're' plays a crucial role in the extraction of specific information from text strings acquired from web pages. For instance, it is employed to extract key details such as the number of reviews from a given string. By utilizing regular expressions, the script gains the ability to precisely define and extract structured patterns of text, making it a valuable asset in the process of scraping, parsing, and organizing data from web sources. This capability streamlines data extraction and enables the script to obtain the exact information it needs for further analysis and processing.

The 'os' library, an integral part of Python, serves as a valuable tool for interacting with the operating system, encompassing a wide array of file and directory operations. Within the script, 'os' takes on a pivotal role in various aspects, including the management of the CSV file housing the scraped data. It is used to check whether a file already exists, enabling the script to determine whether it needs to create a new file or append data to an existing one.

Additionally, 'os' is instrumental in handling file operations, such as creating, replacing, or appending to the CSV file as the script demands. This functionality enhances the script's ability to efficiently manage and organize the data it collects and makes it easier to work with the resulting datasets for further analysis or applications.

The 'RequestException' class, which is part of the 'requests.exceptions' module, is a specialized exception class offered by the requests library. Its primary purpose is to facilitate the handling of exceptions associated with HTTP requests. In the script, 'RequestException' is a key component in error management when making HTTP requests. It is used to catch and handle a range of potential exceptions that can arise during these requests, including issues like network errors or timeouts.

By employing 'RequestException,' the script is equipped to gracefully address and manage errors that might occur, ensuring that it can maintain robust execution even in the face of unexpected issues. This error-handling capability is essential for maintaining the script's reliability and ability to collect data from web sources effectively.

In summary, these libraries are essential for various aspects of web scraping, data extraction, data manipulation, and error handling in the script.

Defining Data Storage

Several lists are initialized to store information about each product, including its URL, company, product name, product code, average rating, reviews, selling price, original price, saved price, color, sales status, styles, and composition.

Functions

The script defines several functions for specific tasks, such as extracting HTML content from a URL, extracting the number of pages to scrape, extracting product URLs from pages, and extracting various product details like company, product name, etc.

Main Function:

The main function within the script serves as its central entry point, orchestrating the entire web scraping process. It begins by defining the essential elements: the base URL and the required headers for sending HTTP requests to the M&S website. Subsequently, it initiates the data collection process by first utilizing the extract_href_values function to retrieve the HTML content of the primary product page.

Once this initial data is obtained, the main function calculates the total number of available product pages and then generates the corresponding URLs for each page. With the list of page URLs in hand, it leverages the extract_product_url function to extract the individual product URLs from these pages.

Finally, the script advances to the core task of data extraction, using the fetch_product_details function to scrape and capture specific details of the products. In summary, the main function acts as the command center for the script, systematically executing each step of the web scraping process, ultimately resulting in the retrieval of product data from the M&S website.

Scraping Product Details

Inside the fetch_product_details function, the script iterates through each product URL, sends an HTTP request to the product page, and extracts details such as the company, product name, product code, average rating, reviews, prices, color, sales status, and composition. These details are stored in the respective lists.

Exporting Data

Upon completion of the product data scraping process, the script utilizes the Pandas library to structure and organize the collected information within a DataFrame. Following this organization, the script performs a file operation check to ascertain if a file named 'mas.csv' already exists.

If such a file is present, it is replaced with the updated data; otherwise, the script creates a new CSV file. Subsequently, the final DataFrame, now containing the meticulously gathered product details, is promptly saved to the 'mas.csv' file. This process of exporting the data to a CSV file ensures that the collected information is neatly stored and readily accessible for further analysis, reporting, or any other data-related tasks.

Execution:

The script checks whether it is being executed directly (not imported as a module) and, if so, calls the main function to start the scraping process.

In summary, this Python script scrapes information about lingerie and nightwear products from the M&S website, extracts various details about each product, and exports the data to a CSV file for further analysis or storage.

Extracting Product Information

A series of functions are defined within the script to extract various pieces of information from product pages. These functions work collectively to gather details such as brand name, product name, prices, ratings, the count of ratings, colors, and various other product specifics. Here's an explanation of each of these functions:

The extract_href_values(url, headers, max_retries=3) function accepts a URL, headers, and an optional maximum retry count as input. It sends a GET request to the provided URL with the specified headers and retries up to three times in case of network or request-related issues. It returns the parsed HTML content of the response using Beautiful Soup.
The extract_number_of_pages(soup) function takes the parsed HTML content (soup) as input and seeks a specific element in the HTML that contains information about the number of pages. It extracts and returns the maximum number of pages as an integer.
The extract_product_url(baseUrl, pageUrls, headers) function requires a base URL, a list of page URLs, and headers as input. It iterates through each page URL, sends requests, and extracts product URLs from the HTML content of each page. It returns a list of product URLs.
For brand name extraction, the extract_company function takes the parsed HTML content (soup) as input and attempts to find the brand or company name using specific classes. It returns the name as a string if found, otherwise "Not available."
The extract_product_name function, working with the parsed HTML content (soup), searches for an element containing the product name. If found, it returns the name as a string; otherwise, it returns "Not available."
To extract the product code, the extract_product_code function takes the parsed HTML content as input, searching for an element that contains the product code. It returns the product code as a string if found, or "Not available" if not.
The extract_avg_rating function extracts the average rating from the parsed HTML content (soup). It returns the average rating as a float if found, otherwise None.
For the number of reviews, the extract_reviews function searches for an element in the parsed HTML content (soup) that contains the review count. It returns the number of reviews as an integer if found, or None if not.
The extract_selling_price function focuses on extracting the selling price, removing the currency symbol if present. It takes the parsed HTML content as input and returns the selling price as a float.
The extract_original_price function, working with the parsed HTML content, searches for the original price of a product and returns it as a float if found. In case of absence, it falls back to the selling price.
To extract information about the saved price (discount), the extract_saved_price function looks for the relevant element in the parsed HTML content and returns the saved price as a float if found; otherwise, it returns None.
The extract_color function, using the parsed HTML content, finds the element containing product color information. It returns the color as a string if found, or "Not available" if not.
For sales status, the extract_sales_status function searches for an element indicating the sales status, such as "In stock" or "Out of stock." It returns the sales status as a string if found, or "Not available" if not.
Finally, the extract_composition function, based on the parsed HTML content, seeks an element with information about the product's composition or materials. It returns the composition as a string if found, or "Not available" if not.

Collectively, these functions enable the script to systematically extract a wide range of product details from web pages, facilitating comprehensive data collection and analysis.

Storing product information

Python script scrapes information about lingerie and nightwear products from the M&S website, extracts various details about each product, and exports the data to a CSV file for further analysis or storage.

Exploring Insights from Extracted Data

1. Analysis of the brands based on the count

Below is a comparative analysis of the brands based on the brand count data.

In this comparative analysis of the brands based on their product counts, a clear hierarchy and diversity emerges.

"M&S Collection" leads the pack with an impressive count of 133 products, signifying Marks & Spencer's extensive offerings across various categories.
"Cyberjammies" closely follows with 120 products, highlighting its specialization in sleepwear and pajama products.
The tie between "DKNY" and "M&S X GHOST," each with 21 products, reflects a balance between a globally recognized brand with diverse offerings and a unique collaborative brand with stylish, distinctive products.
"Body" maintains a notable count of 42 products, indicating a focus on bodywear and related items. "Seasalt Cornwall" offers a moderate selection of 10 products, likely tailored to a specialized theme.
Brands like "FatFace" and "Rosie" have equal counts of 8 products, catering to casual clothing and possibly lingerie or sleepwear, respectively.
"Nobody's Child" presents a smaller but notable presence with 7 products, suggesting unique offerings.
"Wacoal" specializes in lingerie and undergarments with 5 products.
Finally, brands like "Boutique," "Fantasie," and "Spencer Bear™" have lower counts, possibly indicating specialized or limited availability.
"Hotmilk," "Elomi," and "Triumph" each have 2 products, suggesting a niche presence, while "Autograph," "Kate Spade," and "Percy Pig™" with the lowest counts may signify limited availability or specialization in unique product categories.

This analysis showcases the dataset's brand diversity, product range, and market positioning, offering valuable insights into the fashion and clothing industry landscape.

Most trusted and reviewed product brands

1. M&S Collection

M&S Collection stands out with the highest sum of reviews, indicating a significant level of customer engagement and popularity. This suggests that products under this brand have garnered substantial attention and feedback.

2. Body

Body follows with a respectable sum of reviews, indicating that the brand's products have also generated considerable customer feedback. This suggests a strong customer interest in bodywear and related items.

3. Cyberjammies

Cyberjammies has received a moderate number of reviews, suggesting a reasonable level of customer engagement. This is particularly significant for a brand specializing in sleepwear and pajamas.

4. Spencer Bear™

Spencer Bear™ has received noteworthy reviews, indicating an engaged audience. This might suggest that products associated with this brand have a loyal following or unique appeal.

5. Rosie

Rosie has received a moderate number of reviews, suggesting a level of interest and engagement, possibly related to its offerings, which may include lingerie or sleepwear.

6. Percy Pig™

Percy Pig™ has received a modest number of reviews. Given that it is not a clothing brand but likely represents a specific product line or collaboration, this level of feedback may be reasonable.

7. M&S X GHOST

M&S X GHOST has garnered a modest sum of reviews, considering that it represents a collaboration between M&S and the Ghost brand. This suggests a level of interest in the unique products associated with this brand.

8. Wacoal

Wacoal has received a relatively lower sum of reviews, indicating a moderate level of customer feedback for a brand specializing in lingerie and undergarments.

9. Boutique

Boutique has a modest sum of reviews, suggesting a level of engagement with its product offerings, which might be more specialized or limited.

In summary, the comparative analysis based on the sum of reviews highlights the varying levels of customer engagement and popularity among the brands. M&S Collection and Body have the highest and second-highest sums of reviews, respectively, indicating their strong presence and customer appeal. Other brands have received varying degrees of feedback, suggesting differences in customer interest and the types of products offered by each brand. This data can be valuable for assessing the brand's performance and customer satisfaction.

Top brand's average rating with respect to their count

Body stands out with the highest average rating, indicating strong customer satisfaction despite a moderate product count.
M&S Collection has the largest product range and a good average rating, suggesting a wide variety of products that generally meet customer expectations.
M&S X GHOST maintains a high average rating, indicating that their collaborative products are well-received.
Seasalt Cornwall offers a limited product range but maintains a respectable average rating.
Cyberjammies has a substantial product count but may need to work on improving customer satisfaction.
DKNY has a moderate product count but faces challenges with the lowest average rating, indicating areas for improvement in customer satisfaction.

This analysis can help companies understand their performance relative to their product range and customer feedback, guiding them in making strategic decisions for improvement or expansion.

The average discount on the brands that are on sale

Nobody's Child offers products with the highest average discount price, making their discounted items particularly attractive to customers seeking significant savings.
Kate Spade offers a single product with a notable discount, albeit with limited product availability in this dataset.
Cyberjammies provides a range of products with a moderate average discount price, indicating a selection of discounted items likely catering to various preferences.
Boutique has the lowest average discount price and a limited product range, suggesting relatively smaller savings on their discounted products compared to other companies.

This analysis provides insights into how different companies offer discounts on their products, catering to varying customer preferences for savings and product variety.

Wrapping up

The data analysis provides valuable insights into the pricing strategies, product offerings, customer satisfaction, and engagement levels of different companies and brands. "Nobody's Child" stands out for offering the highest average discount price, making their discounted items attractive to budget-conscious shoppers. "Kate Spade" offers noteworthy discounts despite a limited product range. "Cyberjammies" offers a wide range of products with moderate discounts, appealing to a diverse customer base. "Boutique" offers relatively smaller savings on a limited product range.

In terms of brands, "M&S Collection" leads with a vast product range and a respectable average rating, reflecting a diverse product offering with a high level of customer satisfaction. "Cyberjammies" has a significant product count but faces the challenge of lower customer satisfaction. "Body" excels in customer satisfaction, particularly in the bodywear category, with the highest average rating. Smaller brands like "M&S X GHOST," "Seasalt Cornwall," and "Nobody's Child" also contribute to the market with their unique positions and product offerings.

Customer engagement reveals that "M&S Collection" enjoys a strong presence with substantial reviews, while "Cyberjammies" and "Rosie" also engage customers effectively. Smaller brands like "Boutique" and "Wacoal" have fewer reviews, possibly due to their specialized product offerings. This data provides valuable insights for businesses to refine their pricing, product range, and customer relations strategies, enhancing their competitive edge in the market.

Looking to acquire such data insights from your competitors? Reach to us at Datahut

Related Reading: