Choosing a Web Scraping Service? Think About the Total Cost of Data Ownership
Web scraping

Choosing a Web Scraping Service? Think About the Total Cost of Data Ownership

The total cost of ownership of data ( in a ready to consumable form ) is a critical metric people who procure a web scraping solution don’t know or overlook.  In textbook definition, the total cost of ownership (TCO) is the sum of initial capital expenditures (CapEx) plus ongoing and long-term operational expenditures (OpEx). 

Identifying and weighing the value of TCO variables when defining the scope of a web scraping project is a hard job even for experienced people. A simple miscalculation or oversight can cost companies hundreds of dollars every year. 

Assume the scope of the project is extracting data from 250 e-commerce websites every week for calculation. 

Let us talk about the types of web scraping solutions available and how TCO variables fit differently into each of those options. 

Data Acquisition methods

  1. Code it yourself or in-house web scraping team In layman terms, Code it yourself is like building a car by buying its parts online and assembling it. You make the car by hiring a team of mechanics and appoint a driver to run it.You can build web scrapers using open source scraping frameworks available for free. A few useful opensource frameworks are the following
    1. Scrapy
    2. Nokogiri
    3. Apache Nutch
  2. DIY Tools Self-Service tools make it possible for non-tech people to get data from websites in a structured format. In theory, anyone should be able to configure self-service tools. In reality, most companies end up hiring a developer to modify the data and write scripts to handle complex cases. 
  3. Data as a Service Data as a service is like Uber, you request a ride when you need it and pay for the distance travelled. You don’t worry about the technicalities of data extraction. DaaS helps organizations and people to get data in a ready to use format while the vendor is managing everything. 

Now let us see what the TCO variables for a web scraping project are and how it varies for each type of web scraping project.

1. The crawling infrastructure

You need a robust crawling infrastructure to get data reliably. From deploying scrapers to monitoring pattern changes – the infrastructure has a lot to do. The cost of robust crawling infrastructure is a key TCO metric. 

Code it yourself: You either need to build a platform from the ground up or deploy and maintain an open-source platform. To build a platform for scraping data from 250 websites – you’ll need full-time developers.

Self-service tool:  You don’t have to worry about the primary infrastructure because the vendor has already built it. However, – self-service tools can’t scrape complex websites or things that require customizations. In that case, you will need a setup to scrape websites the self-service tool can’t scrape. You need additional full-time developers depending on the capability of the platform. 

Data as a service: The vendor is responsible for the complete data delivery, not the customer and no resources are required here.

Also Read: 7 reasons to choose DaaS over DIY Web Scraping Tool

2. The volume of data

The volume of the data is a crucial TCO metric data procurement teams should consider. As the volume of data increases price also increases. 

Code it yourself: The cost is not linear, as the volume and the number of websites increase you’ll need different types of data handling infrastructure setups. If there are 150 websites to scrape – you’ll need at least one super good fulltime developer to handle and scale the infrastructure. 

Self-service tool: Any cloud-based setup can be easily connected to the self-service tool. However – you’ll need a person to monitor its health and interoperability with other services. 

DaaS: DaaS providers give you the data, and you can easily connect it with Amazon s3 or any similar cloud platforms. Direct integration to databases is also possible. No additional resources are required. 

3. Frequency of crawls

Frequency of crawls is another key TCO metric you have to consider. 

Code it yourself: You either have to build a smart scheduling mechanism or do it manually. If the data requirement is weekly and you need data from 250 websites – you’ll need two full-time developers for the deployment. 

Self-service tool:  You can schedule the data extraction in a self-service tool. However, if there is a pattern change – you’ll have to deal with manually by reconfiguring the tool and redeploying the agent. 

Data as a service: The vendor is responsible for the scheduling, and the complete data delivery, not the customer and no resources are required here. 

4. Scraper Maintenance 

Code it yourself: Website changes the patterns very frequently. You will need fulltime resources to maintain 250 web scrapers built using an open-source framework. 

Self-service tool:  If the website changes its pattern – you need to reconfigure the agent and maintain the scrapers. You’ll need at least one full-time resource for that. 

Data as a service: The vendor is responsible for scraper maintenance, and there is no extra resource required. 

5. Quality assurance

Quality assurance is another TCO metric people tend to underestimate. Your data needs to pass through strict Q&A checks before using it in a production-level application. The Q & A team constantly need to monitor the data quality and make sure the data from the web scraping service is ready to use. 

Code it yourself: Setting up rule-based Q&A protocols and also having manual testers is the usual way of doing things inhouse. You need at least two people to do the Q & A. 

Self-service tool:  Most self-service tools has a basic Q & A templates which need to be reconfigured in some cases. You also need to adjust the data structure based on the pattern changes in the website. Depending on the source and type you’ll need full-time resources for getting data from 250 websites. 

Data as a service: The vendor is responsible for quality assurance, and there is no extra resource required.

6. Customer support

Customer support is a TCO metric many people underestimate while defining the scope of the project for a larget scale project. 

Code it yourself: Even in an inhouse setup – you need a dedicated person to take care of the problems in data delivery, maintenance etc. This person should act as a bridge between different stakeholders. Otherwise, the project won’t run smoothly. 

Self-service tool:  The most common support you get with a service tool is an email/phone support. For a large scale project – you need a dedicated 1 – 1 support person. Period! Ask your vendor if it is included in the pricing, if not ask for it. 

Data as a service: The vendor is responsible for scraper maintenance and 1 -1 support usually comes with a data as a service. However – you have to ask the vendor it is part of the deal. 

Conclusion: 

At the end of the day, your worry should be about the total cost of ownership, not the monthly subscription. Make sure you do a thorough evaluation of TCO before signing the web scraping service contract. 

Need a free consultation? Get in touch with us today. 

This is an animated gif image, but it does not move
'