What web scraping can and can’t do for you

Big data is disrupting a lot of industries. That is good news, the bad news is that a lot of people don’t really understand how big data works. This problem is baffling Sales and Marketing teams in Big data companies. Sales and marketing teams spend a lot of time educating  prospects and sometimes the prospects just don’t get it.  At  Datahut we do web scraping  and  face this challenge every day. This is the inspiration to write this blog.

 

Let’s assume that you have a basic idea of what a website is and what does the terms HTML, CSS and Javascript means. If you don’t understand those terms please visit – Codecademy.Com. They have some simple basic courses on the basics of programming.

 

How web scraping works?

Every web page is written in HTML. There are some patterns in the HTML structure of a web page. You can use a computer program to extract data from the web page. The program that extracts data is called a web scraper or web spider. Every website will be following a different pattern and the scraper will need different programming logic.

 

Web scraping focuses more on the transformation of unstructured data on the web  into structured data that can be stored and analyzed in a central database or spreadsheet. Uses of web scraping include online price comparison, Lead generation, Market research, event aggregation, reputation monitoring etc.

 

Then it should be easy to scrape it, right?

Not really, see the points listed below to understand why,

  • Web sites will change its pattern and the new pattern will be in conflict with the scraping logic. As a result, the scraper will stop working.
  • Some websites will use heavy javascript which makes the web scraping difficult
  • Some websites will use anti-scraping mechanisms and it will prevent data extraction.

What web scraping can’t do?

Instead of giving a long boring lecture, I think it is better to quote some examples on what can’t be done.

1)  It isn’t possible to scrape data from all the e-commerce websites in the US at an affordable budget.

2) It is not possible to build a single scraper to scrape all the websites.

3) It is not possible to crawl the whole web to retrieve only startup data

Why are these things can’t be done ?- The answer is simple, computer brains can’t differentiate one type of data to another unless it is explained through a programming logic.  There is simply no programming logic to solve these problems.

What web scraping can do

1) A web scraper can extract data from a site by following a predefined path and logic.

2) A web scraper can turn unstructured web data into a structured format

 

Take away:

“Always listen to experts. They’ll tell you what can’t be done, and why. Then do it.”

― Robert A. Heinlein, Time Enough for Love

 

Thanks for reading this blog post. Datahut offers affordable data extraction services (DaaS) . If you need help with your web scraping projects let us know and we will be glad to help.

About Datahut

Datahut helps companies get structured data feeds from websites.