Tony Paul
- May 26, 2017
- 3 min read

Scrape The Internet To Get Training Data For Your Machine Learning Model

Updated: Feb 12, 2021

Have you seen the latest season of the Silicon Valley series? In the series, Erlich Blachman is asking Jian Yang to scrape the internet to train his classifier for a food app. He literally asked Jian to do it however, Jian refused and his food discovery app idea was crushed.

The moral of the story is, if you want data scraping to train your classifier, better contact someone like Datahut. Be smart and don’t be like Erlich Blachman.

Machine learning is a member of the family of buzzwords in tech.

For those who don’t know what machine learning is here is a simple explanation.

We feed machines with data and help them to logically reason using algorithms and then let them generalize what they’ve learned to a new data set. The more data you have to feed, the better will be the output.

If you know what machine learning is, you’d probably know why training data is important for machine learning projects. The process of training a Machine Learning model involves the learning algorithm with a set of training data to learn from.

Assume you want to train a machine learning models to predict if the content of a website is age appropriate for kids of the age 10 or not. You provide the algorithm with training data that contains both age appropriate and inappropriate contents with a label. The model will learn using this data, resulting in a model that attempts to predict whether new content is age appropriate or not.

Getting accurate training data is a big pain. Companies will search heaven and earth for the data, however, One source many people ignore is the internet. There is a ton of data available on the internet which could be used to train machine learning applications, like I said in the example.

Here are a few things you need to know about getting data from the internet:

Find websites which have training data

For an example (age appropriate content) I said above, one good source of age appropriate data will be websites like Curiousworld.

The source which has age inappropriate content will be websites on sexual health or porn.

Extract and Structure the data

Extracting and structuring the data is a pain. You need an excellent platform that can support you with getting training data at scale. Here are a few problems of extracting Data at scale.

Pattern Changes

Every website will change their designs now and then, and so should the web scrapers. Scrapers usually need adjustments every few weeks, as a minor change in the target website affecting the fields you scrape, might either give you incomplete data or crash the scraper, depending on the logic of the scraper. bad training data is the last thing you need to feed into your algorithm.

Quality of data

The records which do not meet the quality guidelines will affect the overall integrity of the data. Making sure the data meets quality guideline while crawling is difficult because it needs to be performed real time. Faulty data can cause serious problems if you are using any ML or AI technologies on top of data.

Extracting Data repeatedly

Your machine learning algorithms will be always hungry for new data. The best way to improve the accuracy of the machine learning system is by feeding it with better training data. You should be getting updated data from the existing and new sources.

Clearly, web data can help make Machine learning systems efficient. However, If you don’t have something that can deal with these challenges at scale, better get help someone who already has that technology. Datahut helped companies to get data for their machine learning projects.

Get in touch if you need help.

Or here is the app demo from Silicon Valley series.

#machinelearning #webscraping