Web scraping is often the first project people do after learning the basics of python. Usually, they start with a static website like Wikipedia or IMDB. It is usually an easy and straightforward task.
Scraping data from a dynamic website is not easy. Many of our readers requested a few examples of how to scrape a dynamic website using Python. This is the second example in the tutorial series. You can see the first example here.
ULTA Beauty, America’s largest beauty specialty store, is turning to technology to make shopping for beauty products a little simpler and more enjoyable. It is creating an omnichannel approach to enhance the customer experience, whether they are buying in-person or online. A concept ULTA refers to as "connected beauty." Our goal is to assist fragrance enthusiasts worldwide in exploring, discovering, and enjoying the enthralling world of scent.
If you do this by manually sеarching on thе wеbsitе, it will takе you forеvеr to find еvеry dеtail about a fragrancе. That's whеrе wеb scraping comеs in. Wеb scraping is a procеss by which you can еxtract data from wеbsitеs and transform it into csv/json filеs, which hеlps you to analyzе and undеrstand thе fragrancе scеnt and brand, pricе, rеviеws and much morе.
In this blog, wе’ll sее how to scrapе thе data from Ulta Bеauty's Wеbsitе. Wе'll usе Python to scrapе thе Womеn Fragrancе Data from Ulta Bеauty's Wеbsitе and savе it as a CSV filе. Thеn, wе can analyzе thе data using python or anothеr program. Wе will bе еxtracting thе following data attributеs from thе individual pagеs of Ulta Bеauty.
Product URL - Thе URL gеts us to thе targеt pagе of Womеn's Fragrancе.
Product Namе - Thе namе of thе Womеn's Fragrancе products.
Brand - Thе brand of Womеn Fragrancе Products.
Numbеr of Rеviеws - Thе numbеr of Womеn Fragrancе Products rеviеws.
Rating - Thе rating of Womеn Fragrancе Products.
Fragrancе Dеscription - Thе dеscription of Womеn's Fragrancе Products.
Dеtails - Thе dеtails of Womеn's Fragrancе Products includе Composition, Fragrancе Family, Scеnt Typе, Kеy Notеs and Fеaturеs of еach Womеn's Fragrancе Product.
Ingrеdiеnts - Thе Ingrеdiеnts of thе Womеn Fragrancе Products.
Web scraping with Python Packages
In this tutorial, wе will bе using Python to еxtract data. Thеrе arе sеvеral rеasons why Python is a good choicе for wеb scraping:
Python has a largе and activе community, which mеans that many librariеs and framеworks can hеlp you with wеb scraping. For еxamplе, Bеautiful Soup is a popular library for parsing HTML and XML documеnts.
Python is еasy to lеarn and usе, еspеcially for thosе nеw to programming. Thе syntax is simplе and rеadablе, which makеs it a good choicе for prototyping and rapid dеvеlopmеnt.
Python is еfficiеnt and fast. It can handlе largе amounts of data and can scrapе wеbsitеs that arе hеavily loadеd with JavaScript, CSS and othеr rеsourcеs.
Python has good support for handling diffеrеnt typеs of data, such as tеxt, imagеs and vidеos. This makеs it еasy to еxtract and procеss data from wеbsitеs containing various mеdia typеs.
Python is a vеrsatilе languagе that can bе usеd for many purposеs bеyond wеb scraping. This mеans that you can usе thе skills you lеarn for wеb scraping in othеr arеas of programming as wеll.
Ulta Bеauty is a dynamic wеbsitе and thе contеnts arе dynamic. So wе nееd to usе a hеadlеss wеb browsеr to scrapе thе data. Sеlеnium, thе tool wе usе to scrapе data, can bе usеd as a hеadlеss wеb browsеr
Importing Libraries:
The first step is importing the required libraries. Here we use a mix of BeautifulSoup and Selenium to scrape the data. So We first import BeautifulSoup, Selenium, Webdriver, Unidecode, Time modules, and ElementTree (Etree).
We've also used the Beautiful Soup and Etree libraries here. Beautiful Soup parse HTML into an easily machine-readable tree format to extract DOM Elements quickly. It allows extraction of a specific paragraph and table elements with a specific HTML ID or Class or XPATH. Whereas, Etree is a Python library for parsing and generating XML data. It's an alternative to the standard ElementTree package, which allows you to easily parse, generate, validate, and otherwise manipulate XML data.
Selenium is a tool designed to automate Web Browsers. Additionally, it is very useful to web scrape because of these automation capabilities like Clicking specific form buttons, Inputting information in text fields, and Extracting the DOM elements for browser HTML code.
These are the necessary packages that are required to extract data from an HTML page.
import re
import time
import random
import warnings
import pandas as pd
from typing import List
from lxml import etree as et
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import NoSuchElementException
warnings.filterwarnings('ignore')
driver = webdriver.Chrome(ChromeDriverManager().install())
Here Chrome and ChromeDriver are installed with the Selenium package. The 'ChromeDriverManager' library helps manage the ChromeDriver executable, which is used by webdriver to control the Chrome browser.
To scrape or extract data, you first need to know where that data is located. For that reason, locating website elements is one of the essential requirements of web scraping. There are a few standard ways to find a specific element on a page. For example, you could search by the tag's name OR filter for a specific HTML class or HTML ID or use CSS selectors or XPath expressions. But, as usual, the easiest way to locate an element is to open your Chrome dev tools and inspect the element you need.
Extraction of Page Links:
The second step is extracting the resultant page links, searching for Women Fragrance Products lying on several website pages, and needing to go from one page to another to see the remaining products.
So first, we want to scrape a website and collect the URLs of different pages of search results. Here are six resultant pages so we can scrape the page URL of each page on the website from the Base URL. Here a while loop is used to iterate through the search result pages. The loop starts by navigating to the current URL using the 'driver. get()' method. It then obtains the page's HTML source code using 'driver.page_source' and parses it using the BeautifulSoup library. We are opening the website with Selenium and parsing the page's main content with BeautifulSoup.
#List to store the url of every resultant page
def get_page_urls(url):
page_urls = [url]
while url:
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
next_page = soup.find('li', class_='next-prev floatl-span').find('a', class_='next')
if next_page:
url = "https://www.ulta.com" + next_page['href']
page_urls.append(url)
else:
url = None
driver.quit()
return set(page_urls)
# Ulta website link
url = "https://www.ulta.com/womens-fragrance?N=26wn"
page_urls = get_page_urls(url)
We need URLs for every page. After the while loop finishes executing, we store each of this page_url in the list page_lst_link. Here we used HTML class to locate the elements.
Extraction of Product Links:
The next step is extracting product links from the resultant pages. Using the extracted page link in the second step, we can easily extract the resultant product link from the corresponding pages. Here, the ‘page_lst_link’ variable should contain a list of page links from which you want to scrape product links. The code will iterate through each page link in the list and use the web driver to navigate to that page. It will then use BeautifulSoup to parse the HTML of the page and extract all product links.
# Fetching all resulted product links
def get_product_links(page_urls: List[str]) -> List[str]:
product_links = []
for url in page_urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
links = ["https://www.ulta.com" + row.a['href'] for row in soup.find_all('p', class_='prod-desc')]
product_links.extend(links)
return product_links
product_links = get_product_links(page_urls)
# Indicate scraping completion
print("Got All Product Links! There are {} products in total.".format(len(product_links)))
We need URLs for every Women's Fragrance product. So we create a for loop for getting each product_url and store each of these product links in the list product_links. Here we used the HTML class to locate the elements.
Creating Dataframe to Store the Data:
The next step is to create a dataframe to store the extracted data. Here we are creating a dataframe with nine columns like Product URL, Product Name, Brand, Number of Reviews, Rating, Fragrance Description, Details, and Ingredients .
# Creating a dictionary of the required columns
data = {
'Product_url': [],
'Brand': [],
'Product_name': [],
'Number_of_reviews': [],
'Details': [],
'Star_rating': [],
'Price': [],
'Fragrance Description': [],
'Ingredients': []
}
# Creating a dataframe with those columns
df = pd.DataFrame(data)
Information Extraction:
In this step, we will identify wanted attributes from Ulta Beauty's Website and extract the Product Name, Brand, Number of Reviews, Rating, Fragrance Description, Details, and Ingredients of each product.
def extract_content(url):
driver.get(url)
page_content = driver.page_source
product_soup = BeautifulSoup(page_content, 'html.parser')
dom = et.HTML(str(product_soup))
return dom
The function extract_content() method is to scrape the content of a web page at the specified URL using the Selenium library and a web driver. The content is then parsed using the BeautifulSoup library and returned as an ‘lxml.html.HtmlElement’ object.
We pass the_url as an argument. We then store the requests from the URL in page_content using selenium web driver. We create the product_soup variable by parsing page_content with Beautifulsoup and also create the dom using the ElementTree. This method returns the dom, which you can then use to extract specific elements from the page using methods like .xpath() and .cssselect().
Extraction of Brand of the Products:
Here is the function to extract the brand name from an lxml.html.HtmlElement object using an XPath expression. In this step, we iterate through the brand of each product list one by one. Whenever the loop picks up the URL, we use Xpath to find the above attributes. Once the attributes are extracted - data will be added to the corresponding column. Sometimes data will be obtained in [“Brand”] this format. So we will remove those unwanted characters here.
def Brand(dom):
brand = dom.xpath('//*[@id="92384e5c-2234-4e8f-bef7-e80391889cfc"]/h1/span[1]/a/text()')
if not brand:
brand = 'brand is not available'
else:
brand = re.sub('[\[\]\']', '', str(brand))
df['Brand'].iloc[each_product] = brand
return brand
Extraction of Product Name:
Here is the function to extract the product name. This function appears to be similar to the Brand() function, but it is intended to extract the product name from an lxml.html.HtmlElement object using an XPath expression. We iterate through the product name list one by one. Every time the loop picks up the URL, we use Xpath to find the attributes listed above. Once the attributes are extracted - data will be added to the corresponding column. Sometimes data will be obtained in [“product name”] this format. So we will remove those unwanted characters here.
def Product_name(dom):
product = dom.xpath('//*[@id="92384e5c-2234-4e8f-bef7-e80391889cfc"]/h1/span[2]/text()')
if product:
product = re.sub(r'[\[\]\'\"]', '', str(product))
df.loc[each_product, "Product_name"] = product
else:
df.loc[each_product, "Product_name"] = "Product name is not available"
return product
Similarly, we can extract the Number of Ratings, Ratings, and Ingredients.
Number of Ratings of the Products :
def Reviews(dom):
number_of_reviews = dom.xpath('//*[@id="92384e5c-2234-4e8f-bef7-e80391889cfc"]/div/span[2]/text()')
if number_of_reviews:
number_of_reviews = re.sub(r'[\,\(\)\[\]\'\"]', '', str(number_of_reviews))
df.loc[each_product, "Number_of_reviews"] = number_of_reviews
else:
df.loc[each_product, "Number_of_reviews"] = "Number of reviews is not available"
return number_of_reviews
Ratings of the Products :
def Star_Rating(dom):
star_rating = dom.xpath('//*[@id="92384e5c-2234-4e8f-bef7-e80391889cfc"]/div/a/span/text()')
if star_rating:
star_rating = re.sub(r'[\,\(\)\[\]\'\"\ Q & A\ Ask A Question]', '', str(star_rating))
df.loc[each_product, "Star_rating"] = star_rating
else:
df.loc[each_product, "Star_rating"] = "Star rating is not available"
return star_rating
Ingredients of the Products :
def Ingredients(dom):
ingredients = dom.xpath("//*[@aria-controls='Ingredients']//p/text()")
if ingredients:
ingredients = re.sub(r'[\[\]\']', '', str(ingredients))
df.loc[each_product, "Ingredients"] = ingredients
else:
df.loc[each_product, "Ingredients"] = "Ingredients is not available"
return ingredients
In the next step we call the functions. Here the loop iterates over the rows of a dataframe, extracts the Product_url column for each row, and passes it to the extract_content() function to get the page content as an lxml.html.HtmlElement object. It then calls several functions (Brand(), Product(), Reviews(), Star_rating(), and Ingredients()) on the product_content object to extract specific data from the page.
for each_product in range(len(df)):
product_url = df['Product_url'].iloc[each_product]
product_content = extract_content(product_url)
Brand(product_content)
Product(product_content)
Reviews(product_content)
Star_rating(product_content)
Ingredients(product_content)
Extraction of Price of the Products :
Here is the function to extract the price of a product from a web page. We iterate through the price of each product list one by one. Sometimes when we try to extract the data with Python BeautifulSoup, it can't access dynamic content because it is just an HTTP client and will not be able to access dynamic content. In that case, we will use Selenium to parse the data; Selenium works because Selenium is a full browser with a javascript engine. Whenever the loop picks up the URL, we use Xpath to find the above attributes. Once the attributes are extracted - data will be added to the corresponding column. Sometimes data will be obtained in a different format so that we will remove those unwanted characters here.
def Price():
prices = driver.find_element("xpath",'//*[@id="1b7a3ab3-2765-4ee2-8367-c8a0e7230fa4"]/span').text
if prices:
prices = re.sub(r'[\$\,\(\)\[\]\'\"]', '', prices)
else:
prices="Price is not available"
return prices
Similarly we can extract the Fragrance Description and Details.
Fragrance Description of the Products:
def Fragrance_Description():
element = driver.find_element("xpath", '//*[@id="b46bc3ad-9907-43a6-9a95-88c160f02d7f"]/p')
if element:
description = element.text
description = re.sub(r'[\[\]]', '', description)
else:
description = "Fragrance description is not available"
return description
Details of the Products:
def Detail():
try:
driver.find_element("xpath",'//*[@id="Details"]').click()
except:
pass
time.sleep(3)
details = driver.find_element("xpath", "//*[@aria-controls='Details']").text
if details:
return details
else:
return "Details are not available"
return Details
To extract the details, we need to first click on the ‘+’ button. We will click on this button by selenium using Xpath.
The ‘Details’ data includes information about the Composition, Fragrance Family, Scent Type, Key Notes, and Features of each Women's Fragrance Product. So we can extract that information to another column if we need.
In the next step we call the functions. Here the loop that iterates over the rows of a dataframe and extracts specific data from the web page at the URL in the Product_url column of each row. The data is extracted using the Price(), Fragrance_Description(), and Detail() functions and is added to the corresponding columns of the dataframe.
for each_product in range(len(df)):
driver.get(df['Product_url'].iloc[each_product])
price=Price()
df["Price"].iloc[each_product]= price
fragrance_description=Fragrance_Description()
df['Fragrance Description'].iloc[each_product] = fragrance_description
details=Detail()
df['Details'].iloc[each_product] = details
pz=None
dt=None
We write the data of each Women Fragrance onto the csv file.
# Convering data to a csv file
df.to_csv("Ulta_Women_Fragrance_Data")
Conclusion
Python and Selenium are powerful tools for web scraping dynamic websites. Python is a popular and easy-to-learn programming language that offers a wide range of libraries and frameworks for handling different types of data. Selenium is a browser automation library that can simulate user interactions with websites, making it possible to scrape websites that use JavaScript and other dynamic features. Together, Python and Selenium can be used to effectively and efficiently scrape data from dynamic websites, making them a valuable choice for data extraction tasks.
Want to gain a competitive advantage by gathering product information through web scraping?
Unlock the power of data with our web scraping services. Don't let your competition stay ahead; contact us today and see how we can help you gain a competitive edge!
Related Reading: