How to Build a Web Crawler in Python from Scratch

Bhagyeshwari Chauhan
Aug 12, 2020
11 min read

Updated: Jun 26

How to Build a Web Crawler in Python from Scratch

Every business today relies on web data — whether it’s tracking competitor prices, monitoring trends, or collecting product information at scale.

Behind all of this sits one core engine: a web crawler. In this guide, we’ll walk you through how crawlers work, the different types, and how you can build one in Python — from a simple beginner script to a modern Playwright-based crawler that works on JavaScript-heavy sites.

Web scraping and crawling are incredibly effective tools to capture specific information from a website for further analytics and processing. If you’re a newbie, through this blog, we aim to help you build a web crawler in python for your own customized use.

But first, let us cover the basics of a web scraper or a web crawler.

Demystifying the terms ‘Web Scraper’ and ‘Web Crawler’

A web scraper is a systematic, well-defined process of extracting specific data about a topic. For instance, if you need to extract the prices of products from an e-commerce website, you can design a custom scraper to pull this information from the correct source.

A web crawler, also known as a ‘spider’ has a more generic approach! You can define a web crawler as a bot that systematically scans the Internet for indexing and pulling content/information. It follows internal links on web pages. In general, a “crawler” navigates web pages on its own, at times even without a clearly defined end goal.

Hence, it is more like an exploratory search of the content on the Web. Search engines such as Google, Bing, and others often employ web crawlers to extract content for a URL or for other links, get URLs of these links, and other purposes.

However, it is important to note that web scraping and crawling are not mutually exclusive activities. While web crawling creates a copy of the content, web scraping extracts specific data for analysis, or to create something new. However, in order to scrape data from the web, you would first have to conduct some sort of web crawling to index and find the information you need. On the other hand, data crawling also involves a certain degree of scraping, like saving all the keywords, the images and the URLs of the web page.

Also Read: How Popular Price Comparison Websites Grab Data

Types of Web Crawlers

A web crawler is nothing but a few lines of code. This program or code works as an Internet bot. The task is to index the contents of a website on the internet. Now we know that most web pages are made and described using HTML structures and keywords. Thus, if you can specify a category of the content you need, for instance, a particular HTML tag category, the crawler can look for that particular attribute and scan all pieces of information matching that attribute.

You can write this code in any computer language to scrape any information or data from the internet automatically. You can use this bot and even customize the same for multiple pages that allow web crawling. You just need to adhere to the legality of the process.

There are multiple types of web crawlers. These categories are defined by the application scenarios of the web crawlers. Let us go through each of them and cover them in some detail.

1. General-Purpose Web Crawler

A general-purpose Web crawler, as the name suggests, gathers as many pages as it can from a particular set of URLs to crawl large-scale data and information. You require a high internet speed and a large storage space is required for running a general-purpose web crawler. Primarily, it is built to scrape massive data for search engines and web service providers.

2. Focused Web Crawler

A Focused Web Crawler is characterized by a focused search criterion or a topic. It selectively crawls pages related to pre-defined topics. Hence, while a general-purpose web crawler would search and index all the pages and URLs on a site, the focused crawler only needs to crawl the pages related to the pre-defined topics, for instance, the product information on an e-commerce website. Thus, you can run this crawler with smaller storage space and slower internet speed. Most search engines, such as Google, Yahoo, and Baidu use this kind of web crawler.

3. Incremental Web Crawler

Imagine you have been crawling a particular page regularly and want to search, index, and update your existing information repository with the newly updated information on the site. Would you crawl the entire site every time you want to update the information? That sounds like an unwanted extra cost of computation, time, and memory on your machine. The alternative is to use an incremental web crawler.

An incremental web crawler crawls only newly generated information in web pages. They only look for updated information and do not re-download the information that has not changed, or the previously crawled information. Thus it can effectively save crawling time and storage space.

4. Deep Web Crawler

Most of the pages on the internet can be divided into Surface Web and Deep Web (also called Invisible Web Pages or Hidden Web). You can index a surface page with the help of a traditional search engine. It is basically a static page that can be reached using a hyperlink.

Web pages in the Deep Web contain content that cannot be obtained through static links. It is hidden behind the search form. In other words, you cannot simply search for these pages on the web. Users cannot see it without submitting certain keywords. For instance, some pages are visible to users only after they are registered. A deep web crawler helps us crawl the information from these invisible web pages.

Also read: Scraping Nasdaq news using python

When do you need a web crawler?

From the above sections, we can infer that a web crawler can imitate the human actions to search the web and pull your content from the same. Using a web crawler, you can search for all the possible content you need. You might need to build a web crawler in one of these two scenarios:

1. Replicating the action of a Search Engine- Search Action

Most search engines or the general search function on any portal site use focused web crawlers for their underlying operations. It helps the search engine locate the web pages that are most relevant to the searched topics. Here, the crawler visits websites and reads their pages and other information to create entries for a search engine index. Post that, you can index the data as in the search engine.

To replicate the search function as in the case of a search engine, a web crawler helps:

Provide users with relevant and valid content
Create a copy of all the visited pages for further processing

2. Aggregating Data for further actions- Content Monitoring

You can also use a web crawler for content monitoring. You can then use it to aggregate datasets for research, business, and other operational purposes. Some obvious use-cases are:

Collect information about customers, marketing data, campaigns and use this data to make more effective marketing decisions.
Collect relevant subject information from the web and use it for research and academic study.
Search information on macro-economic factors and market trends to make effective operational decisions for a company.
Use a web crawler to extract data on real-time changes and competitor trends.

Basic Workflow of a General Web Crawler

Obtain the Initial URL

The crawling process starts by identifying an initial URL. This URL serves as the entry point for the crawler into the web. It's usually a known URL or a seed URL provided by the user. This URL is critical because it determines where the crawler begins its journey across the web. For example, if you're crawling a news website, the initial URL might be the homepage of that site.
Fetch and Parse HTML Content

Once the initial URL is obtained, the crawler sends a request to the web server to retrieve the HTML content of the page. After receiving the HTML content, the crawler parses it, which means it analyzes the structure of the page to understand and extract useful information. During this parsing phase, the crawler identifies all hyperlinks (URLs) within the HTML content. These links represent other web pages that the crawler can visit next. For example, on a news homepage, these links might point to individual news articles, category pages, or external websites.
Queue the URLs:

The URLs extracted from the HTML content are added to a queue, which is a data structure that holds the list of URLs waiting to be crawled. The queue operates on a first-in, first-out (FIFO) basis, meaning that the URLs added first will be processed first. This queue allows the crawler to manage its workload and ensures that it systematically visits each discovered URL. Managing this queue effectively is crucial for large-scale web crawling to avoid overwhelming the crawler with too many URLs at once.
Crawl the URLs:

The crawler then enters a loop where it processes each URL in the queue one by one. For each URL, the crawler performs the same steps: fetching the HTML content, parsing the page to extract new URLs, and adding these new URLs back into the queue. This loop allows the crawler to explore the web, expanding its reach by following links from one page to another. As the crawler moves through the web, it accumulates data from each visited page, which can be stored, analyzed, or used for various purposes, such as building a search engine index or monitoring website changes.
Check Stop Condition:

The crawler continues this looping process until a predefined stop condition is met. Stop conditions can vary depending on the purpose of the crawl. Common stop conditions include reaching a maximum number of pages crawled, running out of URLs in the queue, or hitting a time limit. If no stop condition is set, the crawler will continue indefinitely, which may lead to issues such as endless loops or excessive resource consumption. Properly defining and monitoring stop conditions is essential to ensure that the crawling process is efficient and doesn't run uncontrollably.

How can you build a Web Crawler from scratch?

There are a lot of open-source and paid subscriptions of competitive web crawlers in the market. You can also write the code in any programming language. Python is one such widely used language. Let us look at a few examples there.

Building a Web Crawler using Python

Python is a computationally efficient language that is often employed to build web scrapers and crawlers. The library, commonly used to perform this action is the ‘scrapy’ package in Python. Let us look at a basic code for the same.

import scrapy
class spider1(scrapy.Spider):
        name = ‘Wikipedia’
        start_urls = [‘https://en.wikipedia.org/wiki/Battery_(electricity)’]       
        def parse(self, response):
           pass

The above class consists of the following components:

a name for identifying the spider or the crawler, “Wikipedia” in the above example.
a start_urls variable containing a list of URLs to begin crawling from. We are specifying a URL of a Wikipedia page on clustering algorithms.
a parse() method which will be used to process the webpage to extract the relevant and necessary content.

You can run the spider class using a simple command ‘scrapy runspider spider1.py‘. The output looks something like this.

The above output contains all the links and the information (text content) on the website in a wrapped format. A more focused web crawler to pull product information and links from an e-commerce website looks something like this:

import requests
from bs4 import BeautifulSoup
def web(page,WebUrl):
     if(page>0):
          url = WebUrl
          code = requests.get(url)
          plain = code.text
          s = BeautifulSoup(plain, “html.parser”)
          for link in s.findAll(‘a’, {‘class’:’s-access-detail-page’}):
               tet = link.get(‘title’)
               print(tet)
               tet_2 = link.get(‘href’)
               print(tet_2)
web(1,’https://www.amazon.in/mobile-phones/b?ie=UTF8&node=1389401031&ref_=nav_shopall_sbc_mobcomp_all_mobiles’)

This snippet gives the output in the following format.

The above output shows that all the product names and their respective links have been enlisted in the output. This is a piece of more specific information pulled by the crawler.

Crawling a Dynamic Web Page

The requests library (used in the blog) only downloads the initial HTML. Many modern websites load their content using JavaScript after the page loads. If you use requests on these sites, you'll just get a loading script, not the data.

2020 Method: requests + BeautifulSoup
2026 Upgrade: Playwright or Selenium

Instead of just requesting a page, you need to run it in a real browser. Tools like Playwright (or Selenium) control a headless (invisible) browser, letting all the JavaScript execute. You can then scrape the fully-rendered content.

See a sample code below (updated on 17/11/2025)

# The modern way with Playwright

from playwright.sync_api import sync_playwright


def get_dynamic_page(url):

    with sync_playwright() as p:

        browser = p.chromium.launch(headless=True)

        page = browser.new_page()

        page.goto(url)

        # Wait for a specific element to be visible

        # This ensures the JavaScript has loaded the content

        page.wait_for_selector('h1') 


        content = page.content() # Get the

        browser.close()

        return content



# You would then pass this 'content' to BeautifulSoup

# soup = BeautifulSoup(content, 'html.parser')

Checkout this blog on how to build a reliable web scraper using Playwright

Other crawlers in the market

There are multiple open-source crawlers in the market that can help you collect/mine data from the Internet. You can conduct your due research and use the best possible tool for collecting information from the web. A lot of these crawlers are written in different languages like Java, PHP, Node, etc.

While some of these crawlers can work across multiple operating software, some are tailor-made for specific platforms like Linux. Some of them are the GNU Wget written in C, the PHP-crawler in PHP, JSpider in Java among many others.

To choose the right crawler for your use, you must consider factors like the simplicity of the program, speed of the crawler, ability to crawl over various websites (flexibility), and memory usage of these tools before you make your final choice.

Web Crawling with Datahut

While there are multiple open source data crawlers, they might not be able to crawl complicated web pages and sites on a large scale. You will need to tweak the underlying code so that the code works for your target page. Moreover, as mentioned earlier, it might not function for all the operating software present in your ecosystem. The speed and computational requirements might be another hassle.

To overcome these difficulties, Datahut can crawl multiple pages irrespective of your platforms, devices, or the code language and store the content in simple readable file formats like .csv or even in database systems. Datahut has a simple and transparent process of mining data from the web.

You can read more about our process and the multiple use-cases we have helped solve with data mining from the web. Get in touch with Datahut for your web scraping and crawling needs.

Dealing with Advanced Anti-Scraping Measures

Websites are much better at detecting and blocking crawlers than they were. By 2026, this will be the main bottleneck. The simple script in the article will be blocked almost instantly.

Here are the essential upgrades:

Proxy Rotation

Problem: Sending many requests from a single IP address is the easiest way to get banned.
Solution: Use a rotating proxy service (commercial or residential). This makes each request appear to come from a different user, making your crawler much harder to track and block.

User-Agents and Request Headers

Problem: Libraries like requests and Scrapy have a default "user-agent" (like python-requests/2.28.0) that immediately identifies them as a bot.
Solution: You must rotate user-agents and mimic other browser headers. Use a list of real user-agents (e.g., from Chrome, Firefox, Safari on mobile and desktop) and pick one randomly for each request.

"Stealth" Browser Automation

Problem: Even headless browsers like Playwright can be detected. Anti-bot systems check for properties specific to automated browsers.
Solution: Use stealth plugins (like playwright-stealth) that patch the browser to make it appear like a normal, human-controlled browser.

CAPTCHAs

Problem: You will eventually run into CAPTCHAs ("I am not a robot" checks).
Solution: This is the hardest part. The solution is to integrate a third-party CAPTCHA-solving service. These services have APIs where you send them the CAPTCHA, and they send you back the solution.

Frequently Asked Questions

1. What is a web crawler?

A web crawler is a bot that systematically browses and collects information from web pages. It’s commonly used for indexing content (like Google), data scraping, and monitoring websites.

2. Do I need coding knowledge to build a web crawler?

Yes, some coding knowledge—especially in Python or another scripting language—is essential to build an effective crawler. You'll need to understand HTTP requests, parsing HTML, and handling site structures.

3. Which programming language is best for building a web crawler?

Python is the most popular choice due to its rich ecosystem of libraries like Requests, BeautifulSoup, Scrapy, and Playwright. It’s beginner-friendly and widely used in the web scraping community.

4. Is it legal to build and use a web crawler?

It depends. Crawling publicly available websites is usually legal, but scraping data may violate terms of service or privacy laws. Always review a website’s robots.txt and applicable legal policies before crawling.

5. What are the key components of a basic web crawler?

A basic crawler includes a URL queue, an HTTP request handler, an HTML parser, a data extractor, and a mechanism to follow links recursively or up to a set depth.

How to build parsers using Xpaths