How to Use Web Scraping for Real Estate Insights from Bayut

Ambily Biju

Feb 432 min read

How to use web scraping for real estate insights from Bayut

Introduction

Would you believe me if I tell you that you can gain an iconic understanding of the comparative real estate market by utilizing web scraping? As a data analyst or a business researcher, it is now possible to acquire precise and current information from Bayut, one of the top property sites in the UAE. This blog outlines the necessary tools and techniques required in data collection and analysis that makes extracting real estate data from Bayut .

What Is Web Scraping?

Web scraping is the process of extracting information from a website in an automated manner. Instead of manually copying and pasting relevant data, users can utilize different tools, software, and scripts that will do everything for them automatically.

For real estate websites such as Bayut, web scraping can be used to extract:

- Locations and specifications of properties together with their Selling prices

- Mortgate info and other related financials

- Market trends to enhance competitive analysis and decision making

The reasons business adopt web scraping includes:

- Fresh listings to update their databases

- Changes in competitors’ offerings

- Market research to gain insights and information for action planning

For your real estate business, here is a detailed account of how to scrape Bayut.

This article explains how to get real estate data from Bayut in two simple steps . The first script is used to get property links from multiple pages on Bayut, using requests and BeautifulSoup libraries in Python to get and read HTML content. These major techniques include mimicking different user agents to make it seem like a human is visiting the website, accessing pages to harvest links from several pages, and then storing them in a SQLite database.

The second script does a deep crawl of product details from the gathered links. The Playwright library allows for asynchronous browser automation. During this phase, pagination is scrolled so that dynamic contents are fetched on the pages using BeautifulSoup. By using this function, the prices, locations, specifications, and mortgage details that are available property information are obtained. The ordered information goes into a database, and faulty URLs are registered for retry scenarios. Best practice for the project includes introducing a pause between requests; the user agent should be rotated between requests, and good error handling is also as robust as possible. This is a good starter for those coming to learn real-life web scraping methods.

Libraries and Tools Used in Bayut Web Scraping

In the Bayut web scraping project, a wide range of libraries and tools available in Python is used to properly extract both static and dynamic content. For the smooth working of data retrieval and processing, each of these libraries has contributed a lot in their own individual ways.

The requests library was used mainly to perform HTTP requests when gathering web content. It would make GET requests to any URL that the user may want to input and supports customized headers and cookies. For this project, requests have been used so that it would cycle through the pagination links of Bayut's website where it has used a change in the user-agent string so that it might mask it as if the view process were a human.

The bs4 package offers BeautifulSoup-a power parsing tool for HTML and XML. The package allows one to easily navigate and extract the data by converting raw HTML to a structured format. The tool has been applied in the scripts: at first, property links were pulled from the website, and later, more details on the property were parsed by using tag-based selectors.

The sqlite3 library is a light database management system with built-in SQL capabilities, which stores and tracks URLs and maintains a status flag to distinguish between successfully processed and failed links. This makes the system have efficient retry mechanisms and prevent redundant processing of already scraped URLs.

Playwright is an advanced library for browser automation, which supports dynamic content rendering. Unlike static HTML libraries, it supports JavaScript-driven pages, allowing scrolling and clicking on buttons. In the second script, the content is dynamically loaded by Playwright to ensure that all the details of properties are visible before the data is extracted.

asyncio enables tasks to be executed parallelly while concurrently handling calls. Asynchronous functions, declared by async def, make scraping even more efficient because asynchronous functions are able to run many navigation procedures concurrently, which in itself significantly reduces the overall runtime for any collection procedure.

Time and random libraries create delays and randomness between requests like a human browse. In the code example, time.sleep() causes pauses during execution, and random.uniform() causes variable delays so that an anti-bot will not flag it.

These libraries form a robust framework for scalable and ethical web scraping, combining automation, dynamic content handling, and data storage to optimize the Bayut project's performance and reliability.

The Role of User Agents in Web Scraping: Mimicking Real Browsing Behavior

User agents are strings that provide servers with information about the client responsible for a web request, including browser type, version, and operating system. A website will tailor its content and layout to fit the device of a user through the use of such information. Use of user agents in web scraping is also necessary because it allows mimicking the behavior of real users and escapes any detection from sites' anti-bot measure.

In this web scraping project with Bayut, user agents have been used for simulating the requests coming from different browsers and devices. In order to not show repetitive patterns, which many websites flag as bot activity, the scraper will rotate user-agent strings across several requests. That improves access stability and prevents the risk of an IP ban in order to ensure continuous data collection. Adding to techniques such as random delays between requests and user-agent rotation, it makes the overall process of scraping more efficient, more reliable, and stealthier, which makes it a crucial tool for effective data harvesting.

Efficient Progress Tracking and Recovery with SQLite

Interruptions may arise from network problems, server blocks, or script errors while scraping in web scraping, and hence, data will be left incomplete. This project handles such scenarios efficiently by using SQLite as a lightweight database to track the progress and allow resumption from the point of failure. Every URL is stored with a status flag set initially to 0, which means that the data extraction is pending. Whenever the scraper successfully retrieves a URL, it updates its status to 1. The scraper can, however, query the database to find only the URLs whose status is 0 when it stops without any prior notice so that it can proceed without duplicating work or losing the data previously acquired.

SQLite is particularly beneficial for web scraping, as it is easy, efficient, and already there with Python without needing the user to have extra configuration on the server. SQLite's compact file-based storage is suitable for medium-scale data with minimal overhead. Using SQLite allows the pipeline of this scraper to be persisted across runtime, transactions safe, and easily query in general which makes it robust, scalable, and well-suited for dynamic large-scale extraction of web data.

STEP 1 : Product URL Scraping From BAYUT

Libraries Overview for Bayut Web Scraping

import requests
from bs4 import BeautifulSoup
import sqlite3
import random
import time

This section incorporates libraries such as requests, BeautifulSoup, sqlite3, Playwright, asyncio, random, and time to handle HTTP requests, parse HTML, manage data storage, automate browser actions, enable asynchronous scraping, and introduce delays to make the web scraping for Bayut property data efficient and human-like.

Defining Constants for Web Scraping Configuration

# Constants
BASE_URL = "https://www.bayut.com"
INITIAL_URL = f"{BASE_URL}/for-sale/apartments/dubai/?completion_status=ready"
USER_AGENTS_FILE = "/home/user/Documents/Datahut_Internship/bayut/data/user_agents.txt"
DB_FILE = "bayut_webscraping.db"
HEADERS_COOKIE = 'anonymous_session_id=698fd37f-ac65-45f6-957f-fd1f0fb0e4b2; device_id=m5c20a111ecppbt7f'

This section defines all the constant key values needed to perform web scraping. BASE_URL is the Bayut website main address, used as a basis to form other addresses. The INITIAL_URL parameter defines the page from which scraping of ready-to-sell apartments in Dubai shall start. The USER_AGENTS_FILE parameter indicates the text file containing different user-agent strings mimicking real-user behavior. DB_FILE is a parameter that indicates the name of the SQLite database file where the scraped data and URLs will be stored. HEADERS_COOKIE still carries session-related cookies to communicate with the website, thus sustaining a connection during scraping. It centralizes the configuration settings, making the code easier to manage and modify.

Loading User Agents from a File

# 1. Load user agents from file
def load_user_agents(file_path):
    """
    Load user agents from a specified file.

    Args:
        file_path (str): The path to the file containing a list of 
                         user agent strings, with one agent per line.

    Returns:
        list: A list of user agent strings loaded from the file. 
              Each string is stripped of leading and trailing 
              whitespace.
    """
    with open(file_path, "r") as file:
        return [line.strip() for line in file.readlines()]

This function, load_user_agents, reads out the list of user-agent strings by reading from a specified file to simulate different web browsers while scraping. The function takes in file_path as an argument. It's a location of the file that contains the list of user agents with one user-agent string per line. It opens the file to read all lines, removes extra spaces using strip(), and returns a list of user agent strings. These user agents help make requests appear as if sent from actual people, so, therefore, quite unlikely to have the website block their requests.

Selecting a Random User Agent

# 2. Get a random user agent
def get_random_user_agent(user_agents):
    """
    Select a random user agent from a provided list.

    Args:
        user_agents (list): A list of user agent strings.

    Returns:
        str: A randomly selected user agent string from the list.
    """
    return random.choice(user_agents)

The get_random_user_agent function helps pick a random user-agent string from a provided list of user agents. It takes user_agents as an argument, which is a list of user-agent strings loaded earlier. The function uses random.choice() to randomly select one user agent from the list and returns it. This randomness makes each web request look like it's coming from a different browser, helping avoid detection and blocking by websites.

Initializing the Database and Creating a Table

# 3. Initialize database and create table
def initialize_database(db_file):
    """
    Initialize an SQLite database and create a table for storing 
    product links if it does not already exist.

    Args:
        db_file (str): The file path of the SQLite database.

    Returns:
        tuple: A tuple containing:
               - conn (sqlite3.Connection): The SQLite connection 
                 object.
               - cursor (sqlite3.Cursor): The SQLite cursor object 
                 for executing database operations.
    """
    conn = sqlite3.connect(db_file)
    cursor = conn.cursor()
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS product_links (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            link TEXT UNIQUE,
            status INTEGER DEFAULT 0
        )
    """)
    conn.commit()
    return conn, cursor

The initialize_database function initializes an SQLite database in which the links to products are saved for web scraping. This function receives the db_file as argument, being the name or path of the file in which the database will be saved. The function will establish a connection with the database and a cursor to perform SQL commands. The first operation will be checking whether the table product_links already exists, in which case it creates it if it does not exist. The table product_links has id, link and status as its three columns where id is unique, link points to the products and status tells whether the link has been processed or not set to 0 by default to efficiently store track and retry extraction of failed links. It returns the connection and cursor to use in any further database operation.

Fetching and Parsing a Webpage

# 4. Fetch and parse a webpage
def fetch_page(url, user_agents):
    """
    Fetch a webpage and parse its content using BeautifulSoup.

    This function sends an HTTP GET request to the specified URL using 
    randomized headers for the 'User-Agent' field to mimic browser 
    behavior. It also includes a predefined cookie header to bypass 
    potential session-based restrictions on the server. If the 
    request is successful (status code 200), the HTML content of the 
    page is parsed using BeautifulSoup and returned. Otherwise, it 
    returns None.

    Args:
        url (str): The URL of the webpage to fetch.
        user_agents (list): A list of user agent strings used to 
                            randomize requests for ethical and 
                            anti-bot compliance.

    Returns:
        BeautifulSoup or None: 
            - If the HTTP request is successful, returns a BeautifulSoup 
              object containing the parsed HTML content of the webpage.
            - If the HTTP request fails (non-200 status code), returns 
              None.
    """
    headers = {
        'User-Agent': get_random_user_agent(user_agents),
        'Cookie': HEADERS_COOKIE
    }
    print(f"Fetching: {url} with User-Agent: {headers['User-Agent']}")
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print(f"Failed to fetch page: {url}")
        return None
    return BeautifulSoup(response.text, "html.parser")

The function, fetch_page uses BeautifulSoup to prepare the content fetched from the webpage for data extraction. It requires two arguments: the url to fetch the page and a user agent list to randomize the requests sent; the requests are, thus, more likely to look like those sent by a real browser. This randomness also helps in not getting detected by anti-bot systems. Its next call also uses the same cookie header defined, which it uses to manage sessions. It uses the requests library to send an HTTP GET request. If the request is successful with a status code of 200, it uses BeautifulSoup to parse the HTML content of the page, enabling the extraction of structured data. On failure, it prints an error message and returns None. This process makes the scraping more reliable and less likely to be blocked.

Extracting Product Links from a Webpage

# 5. Extract product links from a page
def extract_product_links(soup):
    """
    Extract product links from the parsed HTML content of a webpage.

    This function searches the parsed HTML (BeautifulSoup object) for 
    all product containers and extracts the individual product links. 
    It specifically looks for `div` elements with the class `"dde89f38"`, 
    where each product link is typically stored in an `<a>` tag. It then 
    constructs the full URL for each product and adds it to a list.

    Args:
        soup (BeautifulSoup): A BeautifulSoup object containing the parsed 
                              HTML content of the webpage.

    Returns:
        list: A list of full URLs pointing to individual product pages. 
              The URLs are constructed by appending the `href` attribute 
              of the `<a>` tag to the base URL (`BASE_URL`).
    """
    product_links = []
    product_divs = soup.find_all("div", class_="dde89f38")
    for div in product_divs:
        a_tag = div.find("a", href=True)
        if a_tag:
            full_link = BASE_URL + a_tag['href']
            product_links.append(full_link)
    return product_links

The extract_product_links function collects all links of products from the given webpage's parsed HTML content. To locate div elements with a specific class name "dde89f38", in which it stores all links, the script will use a BeautifulSoup object that mimics the structure of the webpage. This structure is used to collect all such <a> tags inside each div element, which contains the actual link. As for each tag found being <a, this function compiles a fully-formed URL of BASE_URL and link's href. All of the full URL address goes into one list, returning all of that information for another application to finish up. Overall, such functions simplify the web-scraping task of picking numerous product-page URLs from only one page so one can speed through large extracts with ease.

Finding the Next Page in Pagination

# 6. Find the next page URL
def find_next_page(soup):
    """
    Find the URL of the next page in the pagination of a webpage.

    This function searches the parsed HTML (BeautifulSoup object) for 
    the pagination section of the page. It looks for a `div` element 
    with the class `"e27bf381"`, which typically contains navigation 
    links. Specifically, it checks for a link with the title `"Next"`, 
    which indicates the next page in the series. If found, it returns 
    the full URL for the next page.

    Args:
        soup (BeautifulSoup): A BeautifulSoup object containing the parsed 
                              HTML content of the webpage.

    Returns:
        str or None: The full URL of the next page if a "Next" link is 
                     found. Returns `None` if there is no "Next" page or 
                     pagination section.
    """
    pagination_div = soup.find("div", class_="e27bf381")
    if not pagination_div:
        return None
    next_page = pagination_div.find("a", title="Next", href=True)
    if next_page:
        return BASE_URL + next_page['href']
    return None

The find_next_page function is used to navigate to the next page in a series of paginated search results. It takes a BeautifulSoup object, which contains the parsed HTML content of the current webpage. The function searches for a div element with the class "e27bf381", which contains the pagination links. Then it looks for an <a> tag (link) with the title "Next". If such a next link is established, it gets the full base URL by pasting the "BASE_URL" with the attribute "href". It returns a value for processing in the next page GET request. Returning None indicates "Next" linking is not made, meaning further pages are exhausted and there will be nothing for scraping.

Saving Unique Product Links to the Database

# 7. Save unique links to the database
def save_links_to_db(cursor, links):
    """
    Save unique product links to an SQLite database.

    This function iterates through a list of product links and inserts 
    each link into the `product_links` table of the SQLite database, 
    ensuring that duplicate links are ignored. If an error occurs during 
    the insertion process, it logs the error without stopping the process.

    Args:
        cursor (sqlite3.Cursor): The SQLite cursor object used to execute 
                                  SQL queries on the database.
        links (list): A list of product URLs (strings) to be inserted into 
                      the database.

    Returns:
        None: This function does not return any value. It directly modifies 
              the database by inserting the links.
    """
    for link in links:
        try:
            cursor.execute("""
                INSERT OR IGNORE INTO product_links (link, status) 
                VALUES (?, 0)
            """, (link,))
        except Exception as e:
            print(f"Failed to insert link {link}: {e}")

The function saves_links_to_db saves the list of product links scraped from a webpage into an SQLite database. It accepts two parameters: the cursor, which is used to interact with the database, and links, which is a list of product URLs. Inside this function, it iterates over each link in the list and tries to insert it into the product_links table of the database. This INSERT OR IGNORE SQL command will prevent the same link from getting inserted in duplicate. This is because, once the same link is present in the database, it won't be allowed to get inserted. If a problem occurs in inserting a link due to some error in the database, it is caught and printed, but the process doesn't stop here. In this sense, scraping is not prone to problems and infrequent errors are without any influence over the process.

Main Function to Control the Workflow: scrape_bayut()

# 8. Main function to control the workflow
def scrape_bayut():
    """
    Main function to control the entire web scraping workflow for Bayut.

    This function orchestrates the web scraping process by loading user 
    agents, initializing the database, fetching pages, extracting product 
    links, saving them to the database, and navigating through pagination. 
    It continuously scrapes until there are no more pages to process, 
    handling each page's data with a delay to avoid overwhelming the server.

    It performs the following steps:
        1. Loads a list of user agents from a specified file.
        2. Initializes the SQLite database and sets up the `product_links` table.
        3. Begins scraping from the initial URL.
        4. Iterates over each page, fetching its content and extracting product links.
        5. Removes duplicate links and saves them to the database.
        6. Identifies the next page to scrape, and repeats the process until no 
           further pages are found.
        7. Introduces a random delay between requests to mimic human behavior 
           and prevent getting blocked.
        8. Closes the database connection after completing the scraping process.

    Args:
        None: This function does not take any arguments directly.

    Returns:
        None: This function does not return any value. It performs the scraping 
              and stores the results in the database.
    """
    # Load user agents
    user_agents = load_user_agents(USER_AGENTS_FILE)

    # Initialize database
    conn, cursor = initialize_database(DB_FILE)

    # Scraping process
    url = INITIAL_URL

    while url:
        soup = fetch_page(url, user_agents)
        if not soup:
            break

        # Extract product links
        page_links = extract_product_links(soup)
        print(f"Found {len(page_links)} product links on this page.")

        # Remove duplicates and save to database
        unique_links = list(set(page_links))
        save_links_to_db(cursor, unique_links)
        conn.commit()
        print(f"Saved {len(unique_links)} unique product links to the database.")

        # Find the next page
        url = find_next_page(soup)

        # Add a delay between page visits
        if url:
            delay = random.uniform(2,5)
            print(f"Delaying for {delay:.2f} seconds before visiting the next page...")
            time.sleep(delay)

    conn.close()
    print(f"Scraping completed. Links saved to the database {DB_FILE} in table 'product_links'.")

The main function of scraping is scrape_bayut(), which is the heart of the process that coordinates the whole flow of scraping product links from the Bayut website. It coordinates all the steps involved in the scraping process such as loading user agents initializing the database, extracting links from pages, saving them to the database, and navigating through multiple pages. This function serves as a controller which ensures all tasks are performed sequentially.

The function starts with the loading of the list of user agents from a file. Then, it connects to an SQLite database and creates a table for product links. It starts scraping from a starting URL by making a request to fetch the page content through the function. After fetching the page, it proceeds with extracting the product links from the page. Links are cleaned for duplicates, and each unique link is saved in the database.

It also takes care of pagination by checking if there is a "next" page to scrape. It then moves on to the next page and continues the process until there are no more pages left to scrape. To avoid overwhelming the server, the function introduces a random delay between requests, simulating human-like behavior while scraping.

Once done with the scraping, the function is able to close the database connection and print out the message which says the links are saved. No value is returned by the function, but it changed the database to hold the links that were scraped.

Entry Point to Execute the Script

# Execute the script
if __name__ == "__main__":
    """
    Entry point to execute the Bayut web scraping script.

    This block checks if the script is being executed as the main module. 
    If it is, it calls the `scrape_bayut` function to start the web scraping 
    process. This ensures that the scraping process is initiated only when 
    the script is run directly, and not when it is imported as a module 
    into another script.

    The script initiates the scraping of product links from the Bayut website 
    and saves them into an SQLite database. The scraping process includes 
    fetching pages, extracting product links, saving the links to the database, 
    and navigating through paginated pages. 

    Usage:
        - The script can be run directly from the command line, after ensuring 
          all dependencies and configurations (e.g., user agents, database) 
          are in place.
        - It will automatically start the scraping process and store the 
          results in the database specified by `DB_FILE`.
    """
    scrape_bayut()

The entry point of running the script of the web scraper is the block if __name__ == " __main__ ":. This is the block where it checks if the script is run by the end-user or if it is imported into another scrip. The function here calls the scrape_bayut() function, which initiates the whole process of scraping. That page fetches pages from Bayut, cuts the product links and saves in SQLite DB, and handles pagination to scrape all pages available.

STEP 2: Extracting Detailed Product Information from Individual Pages

import asyncio
import sqlite3
import random
import time
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup

Generating a Random User-Agent for Web Scraping

# Function to get a random user-agent from the file
def get_random_user_agent():
    """
    Reads a list of user-agents from a text file and returns 
    a randomly selected user-agent string.
    
    Returns:
        str: A user-agent string randomly chosen from 
        the 'data/user_agents.txt' file.
    """
    with open('data/user_agents.txt', 'r') as file:
        user_agents = file.readlines()
    return random.choice(user_agents).strip()

In this section, we define the helper function get_random_user_agent. This function reads user-agent strings from a text file and returns one at random. The function reads all user-agents stored in the data/user_agents.txt file and removes any extraneous spaces or newline characters from them, before choosing one of them at random using the random.choice function. This way, each request used during the scrape will have a different user-agent, thus promoting anonymity and chances of getting blocked are lower. This dynamic way is essential to responsible and efficient web scraping.

Establishing a Database Connection

# Function to connect to the SQLite database
def get_db_connection():
    """
    Establishes a connection to the SQLite database.

    Returns:
        sqlite3.Connection: A connection object to the 
        'bayut_webscraping.db' database.
    """
    conn = sqlite3.connect('bayut_webscraping.db')
    return conn

This section demonstrates the get_db_connection function, which connects to a SQLite- database for writing scraped data. . The function opens the bayut_webscraping.db file and then returns the connection object used to interact with the database.This connection forms the entry point for the SQL commands used to execute insertions into data or tables. A dedicated function for database connections makes the code more modular and readable and easy to handle all the database interactions throughout the entire scraping process.

Creating Database Tables for Web Scraping

def create_tables():
    """
    Creates necessary tables in the SQLite database for storing 
    scraped data and failed URLs.

    Tables:
        - bayut_product_data: Stores property details such as 
          product URL, price, location, specifications, benefits, 
          description, property info, features, amenities, 
          and mortgage details.
          Columns:
            - id (INTEGER): Auto-incremented primary key.
            - product_url (TEXT): URL of the property.
            - price (TEXT): Price of the property.
            - location (TEXT): Location of the property.
            - specifications (TEXT): Property specifications.
            - benefits (TEXT): Benefits of the property.
            - description (TEXT): Property description.
            - property_info (TEXT): Key property information.
            - features_and_amenities (TEXT): Features and amenities.
            - mortgage_details (TEXT): Mortgage calculation details.
            - FOREIGN KEY (product_url): References 'link' from 
              product_links table.

        - failed_urls: Stores URLs that failed during scraping.
          Columns:
            - id (INTEGER): Auto-incremented primary key.
            - link (TEXT): URL that failed to scrape.
            - status (INTEGER): Scraping status (e.g., 0 for failure).
            - reason (TEXT): Reason for failure.

    Notes:
        - Call this function once before starting the scraping 
          process to ensure tables are created.
    """
    conn = get_db_connection()
    cursor = conn.cursor()

    # Create bayut_product_data table
    cursor.execute("""
    CREATE TABLE IF NOT EXISTS bayut_product_data (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        product_url TEXT,
        price TEXT,
        location TEXT,
        specifications TEXT,
        benefits TEXT,
        description TEXT,
        property_info TEXT,
        features_and_amenities TEXT,
        mortgage_details TEXT,
        FOREIGN KEY (product_url) REFERENCES product_links(link)
    )
    """)

    # Create failed_urls table
    cursor.execute("""
    CREATE TABLE IF NOT EXISTS failed_urls (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        link TEXT,
        status INTEGER,
        reason TEXT
    )
    """)

    conn.commit()
    conn.close()

# Call this function before running the scraping process
create_tables()

The create_tables function initializes the required tables in the SQLite database for storing scraped data and logging failed URLs. It defines two tables: bayut_product_data and failed_urls. The bayut_product_data table includes columns for storing property details, such as product_url, price, location, specifications, benefits, description, property_info, features_and_amenities, and mortgage_details. Each row has a unique identifier (id), and a foreign key constraint on product_url references the link column in the product_links table to maintain relational integrity. The failed_urls table captures URLs that failed to scrape, with columns for link, status (an integer representing scrape success or failure, where 0 indicates failure), and reason (describing the cause of failure). The use of CREATE TABLE IF NOT EXISTS ensures that re-running the function does not cause errors if the tables already exist. The function commits changes to save the schema and closes the connection to free resources. It should be called once before the scraping process to ensure the database structure is in place.

Updating URL Status in the product_links Table

# Function to update the status of a URL in the product_links table
def update_url_status(url, status):
    """
    Updates the scraping status of a URL in the product_links table.

    Args:
        url (str): The URL of the product link to update.
        status (int): The status value to set (e.g., 1 for scraped, 
                      0 for pending).

    Notes:
        - Assumes the 'product_links' table has a 'status' column 
          and a 'link' column.
        - Call this function to mark URLs as scraped or pending 
          during the scraping process.
    """
    conn = get_db_connection()
    cursor = conn.cursor()
    cursor.execute(
        "UPDATE product_links SET status = ? WHERE link = ?", 
        (status, url)
    )
    conn.commit()
    conn.close()

The update_url_status function will alter the status of a particular URL in the product_links table as per its scraping status. It is taking two input arguments: url, which is a string for the display of the product link, and status, an integer that shows 1 if the URL has successfully been scraped and 0 for a pending scrape. The function establishes a connection to the SQLite database. It updates the status column of the corresponding link in the product_links table. Then it commits the update and closes the connection to be conservative with its use of resources. This function assumes the product_links table has status and link columns. This function can be used in marking URLs as processed or pending in web scraping.

Inserting Scraped Data into bayut_product_data Table

# Function to insert scraped data into the bayut_product_data table
def insert_scraped_data(data):
    """
    Inserts the scraped property data into the bayut_product_data table.

    Args:
        data (dict): A dictionary containing the following keys:
            - product_url (str): URL of the property.
            - price (str): Price of the property.
            - location (str): Location of the property.
            - specifications (str): Property specifications.
            - benefits (str): Benefits of the property.
            - description (str): Property description.
            - property_info (dict): Key property information.
            - features_and_amenities (dict): Features and amenities.
            - mortgage_details (dict): Mortgage calculation details.

    Notes:
        - Converts 'property_info', 'features_and_amenities', 
          and 'mortgage_details' dictionaries to strings 
          before storing in the database.
        - Assumes 'bayut_product_data' table structure matches 
          the data being inserted.
    """
    conn = get_db_connection()
    cursor = conn.cursor()
    cursor.execute(
        """
        INSERT INTO bayut_product_data (
            product_url, price, location, specifications, benefits, 
            description, property_info, features_and_amenities, 
            mortgage_details
        ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
        """, (
            data['product_url'], data['price'], data['location'], 
            data['specifications'], data['benefits'], data['description'], 
            str(data['property_info']), 
            str(data['features_and_amenities']), 
            str(data['mortgage_details'])
        )
    )
    conn.commit()
    conn.close()

This function, insert_scraped_data, saves retrieved property information in the SQLite database's bayut_product_data table. It receives a dictionary as an argument; that dictionary is named data, whose keys are all the product URL, price, location, specifications, benefits, description, property information, features and amenities, and mortgage details. The inner dictionaries are flattened into string values for property_info, features_and_amenities, and mortgage_details because only those types can be used within the database table. It populates the columns by a safe parameterized query, closes the database connection after efficiently committing the transaction so that resources will not get locked. It is sure to save the data which has been scraped correctly so that it could be used or further analyzed in future.

Inserting Failed URLs into failed_urls Table

# Function to insert failed URLs into the failed_urls table with a status
def insert_failed_url(url, reason, status):
    """
    Inserts a failed URL into the failed_urls table with the specified status.

    Args:
        url (str): The URL that failed to scrape.
        reason (str): The reason for the failure (e.g., timeout, 
                      connection error).
        status (int): default 0 for URLs that need to be re-scraped.

    Notes:
        - The status column is used to track the state of URLs:
            - 0 indicates that the URL failed and needs to be re-scraped.
        - This function helps track errors and manage re-scraping of failed URLs.
        - The default value of 0 is inserted to indicate the URL needs to be retried.
    """
    conn = get_db_connection()
    cursor = conn.cursor()
    cursor.execute(
        """
        INSERT INTO failed_urls (link, status, reason) 
        VALUES (?, ?, ?)
        """, 
        (url, status, reason)
    )
    conn.commit()
    conn.close()

The insert_failed_url function adds a failed URL during the scraping process into the failed_urls table, which includes a reason for the failure and a status. The url parameter is the failed link, reason describes why the failure happened (for example, a timeout or connection error), and status is set to 0 by default, indicating that the URL needs to be retried. This function keeps track of scraping errors and manages attempts at re-scraping in an efficient manner. It commits the data to the database and closes the connection after inserting the record.

Scrolling a Web Page to Load Dynamic Content

# Scraping functions for various data
async def scroll_page(page, direction="down", delay=100):
    """
    Scrolls a web page in the specified direction to load dynamic content.

    Args:
        page (playwright.async_api.Page): The Playwright page object.
        direction (str): Direction to scroll, either "down" or "up". 
                         Defaults to "down".
        delay (int): Delay between scroll steps in milliseconds. 
                     Defaults to 100 ms.

    Notes:
        - Uses JavaScript to calculate and scroll to different 
          positions on the page.
        - Simulates smooth scrolling by pausing between steps.
        - Helps in loading content dynamically rendered during scrolling.
    """
    scroll_height = await page.evaluate("document.body.scrollHeight")
    step = 100
    if direction == "down":
        for position in range(0, scroll_height, step):
            await page.evaluate(
                f"window.scrollTo(0, {position})"
            )
            await asyncio.sleep(delay / 1000)
    elif direction == "up":
        for position in range(scroll_height, 0, -step):
            await page.evaluate(
                f"window.scrollTo(0, {position})"
            )
            await asyncio.sleep(delay / 1000)

The scroll_page function scrolls up or down the web page in order to assist in loading dynamic content that shows up as you scroll through the page. This function uses a page object from the Playwright library to run the JavaScript needed for scrolling. The direction parameter allows you to select either "down" or "up" scrolling, which is the default, and delay sets the pause between each scroll step, defaulting to 100 milliseconds. This is a good scrolling technique when the data loads on scrolling. Thus, it can be useful in web scraping as it loads all content before scraping.

Extracting Price from Web Page Content

async def extract_price(soup):
    """
    Extracts the price from the parsed HTML content.

    Args:
        soup (BeautifulSoup): The BeautifulSoup object containing the 
                              HTML content of the page.

    Returns:
        str: The extracted price if found, otherwise "Price not found".
    """
    price_div = soup.find(
        "div", 
        class_="_61c347da"
    )
    return (
        price_div.find("span", class_="_2d107f6e").text 
        if price_div else "Price not found"
    )

The extract_price function fetches the price of a product from a webpage using a BeautifulSoup object parsing the content on its HTML page. It looks for the appearance of a certain <div> element bearing the _61c347da class name inside which it finds a <span> element with the class _2d107f6e in order to get the text with price details. If the price is found it returns price as a string, else it returns "Price not found". This function simplifies the extraction of price information from structured web data.

Extracting Property Location from HTML Content

async def extract_location(soup):
    """
    Extracts the location from the parsed HTML content.

    Args:
        soup (BeautifulSoup): The BeautifulSoup object containing 
                              the HTML content of the page.

    Returns:
        str: The extracted location if found, otherwise 
             "Location not found".
    """
    location_div = soup.find(
        "div", 
        class_="e4fd45f0"
    )
    return (
        location_div.text 
        if location_div else "Location not found"
    )

This extract_location function finds and returns the location information from a webpage using the BeautifulSoup object, representing the parsed HTML. It searches for a <div> element with the class name e4fd45f0. If it finds the location, the function returns the text of the location; otherwise, it returns "Location not found." This function simplifies the process of getting property location details from page content.

Extracting Property Specifications from HTML Content

async def extract_specifications(soup):
    """
    Extracts the specifications from the parsed HTML content.
    Searches for a div with class "_14f36d85" to retrieve the 
    specifications text.
    Uses `.stripped_strings` to extract and join the strings 
    without extra whitespace.

    Args:
        soup (BeautifulSoup): The BeautifulSoup object containing 
                              the HTML content of the page.

    Returns:
        str: A comma-separated string of specifications if found, 
             otherwise "Specifications not found".
    """
    spec_div = soup.find(
        "div", 
        class_="_14f36d85"
    )
    return (
        ", ".join(spec_div.stripped_strings) 
        if spec_div else "Specifications not found"
    )

The extract_specifications function retrieves the property specifications from a webpage using a BeautifulSoup object that represents the parsed HTML. It searches for a <div> element with the class name 14f36d85. If found, it uses .strippedstrings to gather all text content without extra whitespace and returns it as a comma-separated string. If the specifications are not available, it returns "Specifications not found." This function helps in obtaining structured specifications information efficiently from the page content.

Extracting Property Benefits from HTML Content

async def extract_benefits(soup):
    """
    Extracts the benefits from the parsed HTML content.
    Searches for a div with the class "_34032b68 _656393c5 _701d0fe0"
    and then looks for an h1 element with the class "d8b96890 fontCompensation".
    If the benefits section is found, it returns the text content.

    Args:
        soup (BeautifulSoup): The BeautifulSoup object containing 
                              the HTML content of the page.

    Returns:
        str: The extracted benefits if found, otherwise "Benefits not found".
    """
    overview_div = soup.find(
        "div", 
        class_="_34032b68 _656393c5 _701d0fe0"
    )
    benefits_h1 = (
        overview_div.find(
            "h1", 
            class_="d8b96890 fontCompensation"
        ) if overview_div else None
    )
    return (
        benefits_h1.text.strip() 
        if benefits_h1 else "Benefits not found"
    )

The extract_benefits function extracts the benefits section from a webpage using a BeautifulSoup object that represents the parsed HTML content. It first looks for a <div> element with the class _34032b68 _656393c5 _701d0fe0, which contains the relevant information. Inside this div, it searches for an <h1> element with the class d8b96890 fontCompensation. If both elements are found, it extracts and returns the text content after stripping extra spaces. If not, the function returns "Benefits not found."

Extracting Benefits from a Webpage

async def extract_benefits(soup):
    """
    Extracts the benefits from the parsed HTML content.
    Searches for a div with the class "_34032b68 _656393c5 _701d0fe0"
    and then looks for an h1 element with the class "d8b96890 fontCompensation".
    If the benefits section is found, it returns the text content.

    Args:
        soup (BeautifulSoup): The BeautifulSoup object containing 
                              the HTML content of the page.

    Returns:
        str: The extracted benefits if found, otherwise "Benefits not found".
    """
    overview_div = soup.find(
        "div", 
        class_="_34032b68 _656393c5 _701d0fe0"
    )
    benefits_h1 = (
        overview_div.find(
            "h1", 
            class_="d8b96890 fontCompensation"
        ) if overview_div else None
    )
    return (
        benefits_h1.text.strip() 
        if benefits_h1 else "Benefits not found"
    )

The extract_benefits function retrieves property benefits from a webpage's HTML using BeautifulSoup. It searches for a <div> with the class 34032b68 656393c5 _701d0fe0 and then looks for an <h1> inside it with the class d8b96890 fontCompensation. If these elements are found, the text content of the <h1> is extracted and returned after removing extra spaces. If not, the function returns "Benefits not found."

Extracting Description from a Webpage

async def extract_description(soup):
    """
    Extracts the description from the parsed HTML content.
    Searches for a span element with the class "_3547dac9" to extract the 
    description text.
    Strips any extra whitespace from the description before returning.

    Args:
        soup (BeautifulSoup): The BeautifulSoup object containing 
                              the HTML content of the page.

    Returns:
        str: The extracted description if found, otherwise "Description not found".
    """
    description_span = soup.find(
        "span", 
        class_="_3547dac9"
    )
    return (
        description_span.get_text(strip=True) 
        if description_span else "Description not found"
    )

The extract_description function retrieves the property description from a webpage's HTML content using BeautifulSoup. It searches for a <span> element with the class 3547dac9 and extracts its text content. Any extra spaces are removed using the gettext(strip=True) method. If the element is found, the description is returned; otherwise, the function returns "Description not found."

Extracting Property Information from a Webpage

async def extract_property_info(soup):
    """
    Extracts property information from the parsed HTML content.
    Searches for a <ul> element with the class "_3dc8d08d" to find the property details.
    Each <li> element contains a key-value pair, where the key is the property name and 
    the value is the property detail.

    Args:
        soup (BeautifulSoup): The BeautifulSoup object containing 
                              the HTML content of the page.

    Returns:
        dict: A dictionary containing property information, with keys as 
              the property names and values as the corresponding details.
    """
    property_info = {}
    property_info_list = soup.find(
        "ul", 
        class_="_3dc8d08d"
    )
    if property_info_list:
        for item in property_info_list.find_all("li"):
            key = item.find(
                "span", 
                class_="ed0db22a"
            )
            value = item.find(
                "span", 
                class_="_2fdf7fc5"
            )
            if key and value:
                property_info[key.text.strip()] = value.text.strip()
    return property_info

It fetches the property information through the extract_property_info function using BeautifulSoup from an HTML version of a webpage. It searches for class _3dc8d08d within the ul element, which contains li elements. Each li corresponds to the property name and detail. One dictionary is created in such a way that the property names are considered as keys with details as corresponding values. If the elements exist, it puts them in the dictionary; otherwise, it returns an empty dictionary.

Extracting Features and Amenities from a Webpage

async def extract_features_and_amenities(soup):
    """
    Extracts features and amenities information from the parsed HTML content.
    
    This function retrieves two types of data:
    1. Features and amenities listed under sections with the class `da8f482a`.
    2. A list of amenities found under elements with the class `e3c6da98`.
    
    The data is structured in a dictionary where each key represents a section name, and
    the corresponding value is a list of items within that section. The function also 
    adds a special 'features' key to store general amenities not tied to a specific 
    section.

    Args:
        soup (BeautifulSoup): The BeautifulSoup object containing the parsed HTML content 
                              of the product page. It is assumed that this object is 
                              already created from the page's HTML source using 
                              BeautifulSoup.

    Returns:
        dict: A dictionary containing the extracted features and amenities. 
              The dictionary has the following structure:
              {
                  'Section Name': ['item1', 'item2', ...],
                  'features': ['amenity1', 'amenity2', ...]
              }
              Where 'features' is a key holding general amenities, and other keys 
              represent different sections of features/amenities with their respective 
              list of items.
              
    Notes:
        - The function first searches for sections within the HTML that are marked with 
          the `da8f482a` class, which are assumed to represent feature groups.
        - Each section may contain multiple items, which are stored as a list under that 
          section's name.
        - Then, the function searches for additional features or amenities found under the 
          `e3c6da98`  class. These are typically individual items that are collected and 
          added under  the 'features' key.
        - If no items are found in a section or feature group, it will be omitted from the 
          final dictionary.
        - The function ensures that no empty lists are added to the dictionary.
    """
    features_and_amenities = {}

    # Condition 1: Extract features and amenities using 'da8f482a' class
    features_sections = soup.find_all(
        "div", 
        class_="da8f482a"
    )
    for section in features_sections:
        section_title = section.find(
            "div", 
            class_="_1c78af3b"
        )
        if section_title:
            section_name = section_title.text.strip()
            section_items = section.find_all(
                "div", 
                class_="_682538c2"
            )
            items = [
                item.find(
                    "span", 
                    class_="_7181e5ac"
                ).text.strip() 
                for item in section_items 
                if item.find("span", class_="_7181e5ac")
            ]
            if items:
                features_and_amenities[section_name] = items

    # Condition 2: Extract features and amenities using 'e3c6da98' class
    amenities = []
    feature_divs = soup.find_all(
        'div', 
        class_='e3c6da98'
    )
    for feature in feature_divs:
        amenity_name = feature.find(
            'span', 
            class_='_7181e5ac'
        ).get_text(strip=True)
        amenities.append(amenity_name)

    # Add amenities to the 'features' key if they exist
    if amenities:
        features_and_amenities['features'] = amenities

    return features_and_amenities

The extract_features_and_amenities function gathers data about features and amenities from a webpage. It looks for sections with specific classes that indicate the different types of features. Data is then sorted into a dictionary, where every key represents a section, and every value is a list of items under that section. The function first scrapes items under sections with class da8f482a, then it will look for common features under class e3c6da98. If no items are found under a section, that section will be excluded from the dictionary.

Extracting Mortgage Details from a Real Estate Product Page

async def extract_mortgage_details(page):
    """
    Extracts mortgage details from a real estate product page using Playwright.

    This function retrieves various mortgage-related information, including:
    1. Total price of the property
    2. Loan period (duration of the mortgage)
    3. Down payment amount
    4. Interest rate for the loan
    5. Monthly payment amount
    6. Total loan amount

    The data is extracted from specific elements on the page, including sliders and input 
    fields,using Playwright’s ability to get attributes and evaluate JavaScript 
    expressions.

    Args:
        page (playwright.async_api.Page): The Playwright Page object representing the real 
                                          estate product page from which mortgage details 
                                          will be extracted.

    Returns:
        dict: A dictionary containing the mortgage details, with the following keys:
              - 'total_price': The total price of the property.
              - 'loan_period': The duration of the mortgage loan (in years).
              - 'down_payment': The amount to be paid upfront as the down payment.
              - 'interest_rate': The interest rate applied to the loan.
              - 'monthly_payment': The amount to be paid monthly towards the loan.
              - 'total_loan_amount': The total amount to be borrowed after accounting for 
                                     the down payment.
              
    Notes:
        - The function uses Playwright’s `get_attribute()` method to fetch values from 
          specific HTML elements that represent the sliders for total price, loan period,
          and down payment.
        - The `evaluate()` method is used to extract the text content of sibling elements 
          for monthly payment and total loan amount, as these values are dynamically 
          rendered and not directly available as attributes.
        - The returned values are in string format and may need conversion depending on 
          the further  processing or calculations required.

    Example:
        Given a property page, the extracted mortgage details might look like:
        {
            "total_price": "500,000",
            "loan_period": "25",
            "down_payment": "50,000",
            "interest_rate": "3.5%",
            "monthly_payment": "2,000",
            "total_loan_amount": "450,000"
        }
    """
    total_price = await page.get_attribute(
        "div.rheostat-horizontal.d6de0973 button.rheostat-handle", 
        "aria-valuenow"
    )
    loan_period = await page.get_attribute(
        'div.rheostat-horizontal.d6de0973 button[aria-valuemax="30"]', 
        "aria-valuenow"
    )
    down_payment = await page.get_attribute(
        "div._817e5a85 input.e950d463", 
        "value"
    )
    interest_rate = await page.get_attribute(
        "div._555ea2e5 input[disabled]", 
        "value"
    )
    monthly_payment = await page.locator(
        "div.ba772d6e div.f4ade747"
    ).evaluate(
        "node => node.nextSibling.textContent.trim()"
    )
    total_loan_amount = await page.locator(
        "div.fdba19ff span._65628231 span.f4ade747"
    ).evaluate(
        "node => node.nextSibling.textContent.trim()"
    )

    return {
        "total_price": total_price,
        "loan_period": loan_period,
        "down_payment": down_payment,
        "interest_rate": interest_rate,
        "monthly_payment": monthly_payment,
        "total_loan_amount": total_loan_amount,
    }

The extract_mortgage_details function gathers mortgage-related information from a real estate page using Playwright. It retrieves key details such as the total price of the property, loan period, down payment, interest rate, monthly payment, and total loan amount. The function extracts these details by interacting with specific elements on the page, such as sliders and input fields, using Playwright’s ability to get attributes and evaluate JavaScript expressions. The data is returned as a dictionary, where each key corresponds to a mortgage-related detail.

Scraping Property Details from a Real Estate Listing

# Scraping each property detail
async def scrape_property_details(url):
    """
    Scrapes detailed information from a property listing page.

    This function navigates to a property URL and extracts various details about the 
    property,including price, location, specifications, benefits, description, property 
    info, features and amenities, and mortgage details. It performs the scraping using 
    Playwright and parses the HTML content with BeautifulSoup. It also handles scrolling 
    the page to ensure dynamic content is fully loaded before extracting data.

    The scraped data is then inserted into a SQLite database and the status of the URL is 
    updated in the database. If an error occurs during scraping, the URL and error details 
    are saved into a separate table of failed URLs.

    Args:
        url (str): The URL of the property listing page to scrape.

    Returns:
        None

    Notes:
        - The function utilizes Playwright to launch a headless browser and interact with 
          the web page.
        - Scrolling is performed in both directions (down and up) to ensure dynamic 
          content is fully loaded.
        - BeautifulSoup is used for parsing the page content and extracting specific 
          information.
        - The database update functions (`insert_scraped_data` and `insert_failed_url`) 
           are used to  store the results of the scraping process.
        - The `update_url_status` function updates the URL's status in the database to 
          indicate it is being scraped.

    Example:
        Calling `scrape_property_details('https://www.bayut.com/property/details-
                                          10399872.html')` will:
        1. Visit the URL `https://www.bayut.com/property/details-10399872.html`
        2. Extract property details such as price, location, and features.
        3. Insert the scraped data into the database.
        4. If scraping fails, log the error into the `failed_urls` table.

    Error Handling:
        - If any error occurs during the scraping process, it logs the error message along
          with the
          URL in the `failed_urls` table with a status of 0.
    """
    headers = {
        'User-Agent': get_random_user_agent(),
        'Cookie': (
            'anonymous_session_id=5a135d15-3bf8-4766-9d41-46e9931130e3; '
            'device_id=m5c20a111ecppbt7f'
        )
    }

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()
        try:
            # Update the status to 1 when the URL is taken for scraping
            update_url_status(url, 1)

            await page.goto(url, timeout=35000)
            await page.wait_for_timeout(1000)
            await scroll_page(page, direction="down", delay=100)
            await scroll_page(page, direction="up", delay=100)
            await page.wait_for_timeout(1000)

            content = await page.content()
            soup = BeautifulSoup(content, "html.parser")

            price = await extract_price(soup)
            location = await extract_location(soup)
            specifications = await extract_specifications(soup)
            benefits = await extract_benefits(soup)
            description = await extract_description(soup)
            property_info = await extract_property_info(soup)
            features_and_amenities = await extract_features_and_amenities(soup)
            mortgage_details = await extract_mortgage_details(page)

            # Prepare scraped data
            data = {
                'product_url': url,
                'price': price,
                'location': location,
                'specifications': specifications,
                'benefits': benefits,
                'description': description,
                'property_info': property_info,
                'features_and_amenities': features_and_amenities,
                'mortgage_details': mortgage_details
            }

            insert_scraped_data(data)

        except Exception as e:
            insert_failed_url(url, str(e), 0)
            print(f"Failed to scrape {url}: {e}")

        await browser.close()

The scrape_property_details function scrapes detailed information from a property listing page, such as price, location, specifications, benefits, description, property info, features, amenities, and mortgage details. It uses Playwright for browser automation and BeautifulSoup for parsing HTML content. The function also handles dynamic content by scrolling the page in both directions.

Once the data is extracted, it is stored in a SQLite database. If any error occurs during the scraping process, the URL and error details are logged in a separate table for failed URLs. The function updates the URL's status in the database to track the scraping process.

Continuous Property Scraping Execution

# Main execution for continuous scraping
async def main():
    """
    Main execution function for continuous scraping of property details.

    This function continuously scrapes property details by fetching URLs from the 
    `product_links` table in the SQLite database. It processes URLs that have a `status`
     of 0, indicating that they are ready to be scraped. The function enters an infinite 
     loop, where it fetches a single URL at a time, calls the `scrape_property_details` 
     function to scrape the data, and updates the status of the URL in  the database to 
     indicate it is being processed.

    If no URLs with status 0 are found, the function exits the loop and prints a message 
    indicating that there are no more URLs to scrape.

    The `scrape_property_details` function is called to extract detailed information for
    each property (such as price, location, specifications, etc.) and store the results in 
    the database. 

    The loop continues indefinitely until no URLs are left to scrape.

    Args:
        None

    Returns:
        None

    Notes:
        - The function operates in an infinite loop, constantly fetching and scraping URLs 
          with status 0.
        - The loop exits when no URLs with status 0 are found in the `product_links` 
          table.
        - The database connection and cursor are managed within the function to fetch URLs
           for scraping.
        - `scrape_property_details` is called to scrape the data for each URL.
    """
    while True:
        conn = get_db_connection()
        cursor = conn.cursor()
        cursor.execute("SELECT link FROM product_links WHERE status = 0 LIMIT 1")
        url_tuple = cursor.fetchone()
        conn.close()

        if url_tuple:
            url = url_tuple[0]
            await scrape_property_details(url)
        else:
            print("No URLs with status 0 found. Exiting.")
            break

# Run the scraping process
asyncio.run(main())

The main function continuously scrapes property details from URLs stored in the database. It fetches URLs that are marked with a status of 0, indicating they are ready to be scraped. The function runs in an infinite loop, scraping one URL at a time using the scrape_property_details function, which collects data about the property (e.g., price, location, specifications, etc.) and stores it in the database.

The loop continues until no URLs with status 0 are found, at which point it exits and prints a message indicating there are no more URLs to scrape. The database connection and cursor are managed within the function to fetch the URLs for processing.

Conclusion

The Bayut web scraping project demonstrates a comprehensive and efficient approach to gathering real estate data using Python. The two-step process - first collecting property links and then deep crawling for detailed information - showcases a robust methodology for handling both static and dynamic web content.It successfully implemented ethical scraping practices with built-in delays and user agent rotation,Created a resilient system with error handling and retry mechanisms,Developed a scalable database structure for storing property details,Handled dynamic content loading through sophisticated scrolling mechanisms,Extracted comprehensive property information including prices, specifications, mortgage details, and amenities.

FAQ SECTION

1. What kind of real estate insights can be gathered from Bayut using web scraping?

Web scraping can help extract data on property prices, rental trends, listing durations, demand patterns, agent details, and location-based market trends, enabling data-driven real estate decisions.

2. How often should real estate data be scraped for accurate insights?

The frequency depends on your use case. Daily or weekly scraping is ideal for tracking price fluctuations and market trends, while monthly updates may be sufficient for long-term investment analysis.

3. Can web scraping help in real estate investment decision-making?

Yes! Scraped data allows investors to analyze pricing trends, identify undervalued properties, compare neighborhoods, and predict market movements for better investment strategies.

4. What are the challenges of scraping real estate data from Bayut?

Challenges include website structure changes, anti-scraping measures like CAPTCHAs, data volume management, and ensuring compliance with legal and ethical guidelines.