The Most Effective Web Scraping Methods
Hey, guys. The clock is ticking and each and every day have been appearing new technologies. As Fred Thompson once said: “Information is power and those who have access to it are powerful.” And he’s absolutely right! Remember, we’ve already talked about data scraping tools. In this episode, we’ll look at the most effective methods of web scraping. However, before we learn how to collect data effectively, let’s find out why people actually started doing web scraping.
The main reason for the internet’s problem, as well as web scraping appearance is the abundance of choice. Web optimization for various mobile platforms, internet speed growth, various technological and software solutions as well as design trends changed the Internet beyond recognition. Today’s internet is the wellspring of the world’s knowledge, where you can find the answer to any question. Nevertheless, sometimes such a variety of techniques and methods in websites realization becomes the problem and not the feature.
Usually, there are two ways of how you can gain access to the content you’re interested in:
- The first one is to use the API. In this case, owners grant access to the users as they have a vested interest in subscriptions, newsletters, and so on.
- The second one is to exclude API. Here is the case when web scraping gets in the game.
A method with API usage fully depends on consumer’s financial resources. However, it’s not a big deal in terms of technical implementation. All the information that you obtained using API will be both well structured and normalized. For example, it can be in the format of XML or JSON.
The second method is when it actually comes to web scraping. It becomes a tough call for developers and mathematicians. Just try to realize the extent of this technology. During the scraping, the text is being automatically processed using both Artificial Intelligence and semantic interpretation. Isn’t it a technological breakthrough.
So the question is, “Which method is better?” That actually depends on both your preferences and financial resources. In a few minutes, we’ll look at the pros and cons of these methods so you’ll be able to understand which one meets your needs.
Copy/Paste Method
Delicious copypasta. This method is probably one of the simplest one. You manually search for the needed information and copy it (for your document localization, publication in your resource or data base and so on). This method is widely used for small blogs or shops with poor choice.
Pros:
- High-quality content and its adaptation to the consumers’ needs.
- High-speed search.
Cons:
- The necessity to have certain knowledge and skills for Internet search as well as understanding your target area.
- Humans are vulnerable to psychological and physical impact. This can negatively influence his job stability and cost services.
- Such scraping method allows providing a limited quantity of qualitative results (up to a few hundred per day).
Regular expression and matches capture in the text
This is simple but at the same time highly effective method of retrieving information from the Internet. It becomes even more effective combined with UNIX capture command line tools (‘curl’ for example). You can find regular expression in most of the programming languages. In our case, we implemented this web scraping method for several projects on Python and Ruby.
Choose this method if you deal with self-monitoring of several information providers. Assuming it’s a scraping of separate pieces (product name, it’s price, phone numbers, emails and so on). Using this method, you’ll spend about an hour per website. Of course, if you won’t face any pitfalls, such as JS rendering.
Pros:
- In case you’ve already had experience with a regular expression, you won’t spend much time on its implementation.
- The regular expression allows to get rid of minor errors from the result without damaging the main content (for example clear the remainders of HTML code).
- Regular expressions are supported by the majority of programming languages. What is really cool is that their syntax remains almost unchanged regardless of a programming language. This allows to migrate projects to the programming languages with higher performance and clearer code (for example, from PHP to Ruby).
Cons:
- Regular expressions may become a tough challenge for people who have never had experience with them. In such case, it’s better to use service of a specialist. As usual, you may face difficulties with decisions integration of one language into another or during the projects migration to other programming languages.
- Regular expressions are complicated in terms of reading and analysis.
- In the case of the destination resource having been changed, an HTML code or added some tag, you’ll have to change a regular expression.
HTTP-requests ( HTML code review)
Using this method, you’ll be able to get dynamic and static pages by sending HTTP-requests to remote servers. This method uses sockets programming and sort all the answers using pre-made data about goal packages (their classes and id).
This method is a good choice for any project. Despite the fact that it’s more complicated in terms of implementation, it’ll also allow getting more data in a shorter period of time.
Pros:
- This method allows for getting a page source in terms of HTTP-answers.
- A tremendous amount of answers that are restricted only by server’s resources and Internet speed.
Cons:
- The necessity of further answers processing as the results may contain superfluous data.
- Most of the websites have a protection from such “robots”. You may generate further service-based data in the header of HTTP-request, though not all the website can be cheated that way.
- In case of suspicious constantly repeating interest to the resource, there is a high probability that you’ll be banned.
- A remote server may be turned off during the sending of a request. As a result, there is the potential for a great number of errors TimeOut-like.
Analyzing DOM structure based on the screen scraping
Dynamic content is one of the main challenges of web scraping. How to solve it? In this case, we’ll need any web-browser capable of playing a dynamic content and script on the client-side. Additionally, you may use various plugins. They are pretty good but not so effective. On the other hand, using such plugins you may forget about cookies, regular expressions, HTTP, and so on.
Analyzing DOM structure based on the screen scraping will be a good choice, for both small and large-scale projects. Nevertheless, the implementation of the project’s automation is quite challenging from the technical perspective. However, it didn’t prevent our team from solving all the challenges as well as successfully implementing this method. To achieve the goal we wrote a browser’s emulator and its “virtual screen” handler with intellectual nodes search in DOM structure.
Pros:
- The possibility to get dynamic content.
- Automation. Due to automation, you’ll be able to get qualitative content in large amounts.
- The possibility to realize commercial solutions. Using this method, you’ll be able to get their support for solving the challenges with bough/rent software.
Cons:
- Implementation complexity and sever’s workload during the automation. All these factors made this scraping method quite resource intensive in terms of development. This method will be especially challenging for beginners. For proper implementation a specialist should have a sophisticated understanding of “hardware’, basis of web programming as well as knowledge of one of the server programming language.
- This method is really expensive.
Methods of artificial intelligence and ontologies
Let’s assume that you need to scrape hundreds, no thousands and thousands of various websites. All these sites are written in different languages and frameworks with different layouts. In such situations, it’ll be more reasonable to start investing into the development of artificial intelligence and ontologies methods. This method is based on the theory that all the websites can be divided into the classes and groups with the similar structure and technological packages.
Pros:
- Such system will allow getting the highest quality content from a great amount of domains. In case of some changes into the pages, an intelligent system will correct deficiencies. Quality control for 150 thousand domains will be from 75% till 93% (based on our own researches).
- This method will allow to normalize the result, which was got from all the resources, according to your database structure.
- Despite the fact that this system needs constant support (in terms of monitoring), it still requires a significant interference into the code.
Cons:
- Complicated “engine” realization requires in-depth knowledge of math, statistics, and fuzzy logic.
- High development costs.
- High spending on both support and learning of the system.
- Subscriptions on the commercial projects. This actually means a restricted number of requests and their high cost.
- The necessity to take into account a bug tracking tools, data validation and backup proxy servers for possible bypass of websites blacklists.
The Bottom Line
In this episode, we looked at one of the most effective web scraping methods. The majority of IT companies, including us, have been actively using them in accordance to their goals and preferences. Choose your method, carefully taking into account the area of your expertise, as well as information needs and enjoy the results of this magic technology.