What is web scraping? Is it legal?
Things to consider before diving into web scraping.
Forget Kaggle, forget the open datasets, forget the APIs. Web Scraping is where the fun is at now. There seems to be a way to extract data, potentially from each and every website on the internet. But web scraping also pushes the hazy boundaries of ethics and can be considered, or at least feels illegal in most cases.
What is Web Scraping? The process of extracting data automatically from a website using bots is called web scraping. There are a number of ways web scraping can be done.
- Third-Party Tools
- Browser Automation Tools
- Creating custom scrapers
Third-Party Tools such as chrome extensions and paid tools like ParseHub, Diffbot, etc. are extensively used by a number of organizations and individuals to extract data from the web.
The con? It can get pretty expensive if a lot of scraping needs to be done.
Browser automation tools like Selenium can also be used to extract data from a website. Posing as an individual user, and quite literally typing in the URL and clicking buttons there, you can extract whatever is present on that web page.
The con? This method is slower than the rest and can be easily blocked by any website.
Creating Custom Scrapers using programming languages such as python, etc., and tools like BeautifulSoup is the cheapest, fastest, and most reliable method to scrape data. This is the method that is most used, and this is the method where knowing your boundaries is important.
Let us dive deep into how this works, using the above-given examples of Python and BeautifulSoup.
BeautifulSoup is a python library that parses HTML and XML documents and allows for data extraction from the given documents. A combination of the Requests and BeautifulSoup can fetch you the data from almost any website.
While both requests and BeautifulSoup have their own limitations, there are methods to bypass those easily.
A request from a single IP address can be banned by a website easily, but using IP rotation services can solve that problem if one is willing to go that distance. For BeautifulSoup, pagination and lazy loading is one big issue, but interception of unprotected API calls is a possibility to bypass the issue.
With enough dedication, one can send a request, fetch the data from the website automatically, create a custom parser for that particular website to extract information using BeautifulSoup, and do whatever they want with the collected data.
“Well.. web scraping sounds fun. Why shouldn’t I do it?”
Because websites don’t want you to.
No seriously, most websites do not have any protection against web scraping, but they don't want bots messing around with their traffic and stealing the data.
I say ‘stealing’ because the data on the website is essentially the property of the owner of the website, and they publish policies explicitly asking users not to use any automation tools on their platform.
BUT….. Since the data is publicly available, it is NOT ILLEGAL to scrape it. If a website has a bunch of blog posts, it is legal to scrape it, provided that you are not breaking any copyright law and causing the website any monetary harm. You can scrape the data of a public directory and analyze it and use it in your business model. You can scrape the data from a real estate website to figure out where to buy your next house. You can also scrape real-time bitcoin prices if you wish to.
The problem arises if you start using that data to replicate what the original source of the data is meant to do. Since you can fairly easily duplicate the structure of a website using the HTML that you scraped, and fill it up with the data that you extracted. Publishing the data on your platform directly, can, and has landed people in trouble. Check out this interesting appeal/warning from Zomato to Burrp, India’s oldest food tech startup.
In another landmark ruling by a US appeals court, scraping publicly available data is totally legal. The case in question was brought in by LinkedIn against Hiq Labs, a company that uses public data to analyze employee attrition. LinkedIn claimed that the mass web scraping done by Hiq Labs amounted to hacking (?) and hence, violated a bunch of laws.
Despite having no proper definition of stealing publicly available data using any sort of automation script or tool, and having no laws in place to prevent this, web scraping has its limitations, owing to ethics and agreeing with company policies before scraping data. Besides, not EVERY website wants you to keep away your bots. So go ahead and give it a try, but make sure you stick with the terms and conditions of the website.
Disclaimer: This article is written only for sharing information and should not be considered legal advice.