Web Scraping Tutorial And Use Case
In a nutshell, web scraping is the process of data extraction from websites. The entire job is carried out by a piece of code, which is called a “scraper”. First, it sends a “GET” query to a specific website. Then, it parses an HTML document based on the received result. After it’s done, the scraper searches for the data you need within the document, and, finally, converts it into whatever specified format.
The data can be the following:
- product items;
- images;
- videos;
- text;
- contact information, e.g. emails, phone numbers etc.
“What do I use for web scraping?”
- Separate services that work through an API or have a web interface (Embedly, DiffBot etc.)
- Various open source projects implemented in different programming languages (Python: Goose, Scrapy; PHP: Goutte; Ruby: Readability, Morph, etc.).
Also, you can always make your own web scraping tool. Luckily, there are plenty of libraries available. For example, you can use the Nokogiri library to make a Ruby-based scraper.
“Are there any challenges I may want to know about?”
Yes, there are. After having some extensive web scraping experience, we’ve outlined a list of things that can prevent you from taking full advantage of web scrapers.
- Most of the websites are simply different layout-wise.
- Amateurs or pros, not all web developers follow style guides. As a result, their code often contains various mistakes making it absolutely unreadable for scrapers.
- Many websites are built with HTML5 in which any element can be unique.
- Content copy protection, e.g. a multi-level layout, using JavaScript for content rendering, user-agent validations etc.
- Depending on either the season of the year or the subject of the content itself, some websites can change their layouts. Keeping up with these changes requires a lot of time and effort.
- The abundance of ads, floods of comments, too many navigation elements, etc.
- In the web page code, there can be links to the same images of different size, e.g. image preview.
- Since the choice of language on most of the websites is based on your location, the content may not always be displayed in English.
- Websites can have their own encoding that is impossible to send back with a request.
All these factors directly affect the quality of the content leading to its decrease by unacceptable 10% or even 20%. (As if the Internet wasn’t already full of pranks and inaccurate information =/).
“…But I’m dying to scrape some websites! What should I do?”
Basically, it all boils down to the following options:
- If the number of websites you’re going to scrape the data from is small, it’s better to write your own scraper and customize it according to each specific website. The quality of the output content should be 100%.
- If the number of websites to scrape goes beyond “small”, we suggest using a complex approach. In this case, the output content quality should be close to 95%.
“How does that look like in practice?”
First, you would need to create a mechanism to receive HTML code with a GET request. Next, inspect the DOM structure of the website to identify the nodes containing the target data. After that, create a node processor to output the data in a normalized format. The choice of format is usually based on either your client’s requirements or your data processing preferences. For example, we use JSON. And that’s pretty much it — you can create the scraping system.
We’ve called ours “Duck System”. Why? Well, we just thought life’s just way too short to come up with fancy names for a piece of code =)
Now, let’s break it down. The system receives an URL at the input and outputs normalized data at the output. Upon receiving the URL, the system decides which reader should process it. The priority goes to the most high-quality reader with proper customizations. In case there is no such reader, the URL is forwarded to the default reader. Usually, it’s either the most stable scraper or some third-party service.
As you can see, there is another scraper to the right side of the scheme. It comes into play only when the default reader fails to read an incoming URL. For this purpose, the reader’s database is being constantly updated either by the developers or by the system admin.
We also recommend implementing some sort of a feedback support to be able to promptly receive complaints about low-quality content, if any.
Using such system enables us to achieve the highest content quality. However, everything has its price. In this case, the downsides are the increased processing time and server resources as well as the fact that the subscription to the third-party service is paid. Also, it’s quite possible that these expenses may exceed what you’ve spend on the server infrastructure and developers’ work.
The main advantage of going for your own solution to scrape a small number of websites is the processing speed (which takes roughly 7 ms for one web page to process). But what about the bandwidth and upload file size-related limitations? We’ve solved this problem by asynchronously downloading media and main content in the background. As a result, files up to 100MB can be downloaded in a blink of an eye, wherein the quality of the output content remains 100%.
Wrap-up
If you’re going to develop your own web scraping system, we can gladly share more of our experience. After all, implementing an efficient scraping system is a much more challenging task than it may seem at first. That’s why it’s important to consider all the possible issues and pitfalls from the very beginning. Anyway, if you have some related questions, feel free to ask our experts in the comment section below! Also, if you liked the article, don’t forget to share it with your friends by clicking the share buttons! =)