Data Scraping Tools Review: Goose Extractor

JetRuby Agency
JetRuby Agency
Published in
3 min readFeb 1, 2017

--

Also known as web harvesters or web data extractors, scraping tools are specifically developed for collecting massive amounts of information from websites. Web scrapers don’t require either monotonous typing or copy-pasting making them extremely handy. There are a lot of tools available but today we are going elaborate on a specific open-source project called “Goose”.

Background

Originally written in pure Java, Goose was converted into a Scala-based project in August 2011. Since then, the open source activity began to slowly fade until it stopped in 2012. However, thanks to the efforts of Xavier Grangier, Goose have been granted a second chance, but this time in Python. The community responded to this news really positively and fell to rebuilding the project almost instantly. As a result, Goose’s library has grown a lot and now comprises more than three hundred active forks. Our team was quite impressed by the work done so we started to contribute as well.

Goose can be used both as a library and as a separate application. Its major purpose, however, is to extract data from web pages. This includes:

  • Meta tags.
  • Meta descriptions.
  • Body text of an article and its title.
  • The main image of an article.
  • Any Youtube/Vimeo video embedded

The scraper’s code is also available on GitHub.

Installation and configuration

To install Goose, run the following commands:
(Make sure you have Python installed on your machine!).

Goose is now ready to work. You can try it in the Python console:

Generally, the out-of-the-box code is great and suits most of the requirements for scrapers. With that said, we decided to make it even better by implementing some improvements of our own. For example:

The tweak allows for receiving a full list of images on a webpage.
Here is another one:

This one returns normalized and pure HTML code ready to repost.

Pros

One of the reasons we’ve been using Goose in commercial projects is the quality of processed content at the output. Compared to other solutions, it’s really great. If you plan on only scraping regular websites, the default version is ready to go “as is”. At the same time, Goose can be infinitely customized and easily adjusted thanks to its open-source nature.

Keep in mind, however, that being able to change the scraper’s code, e.g. add a regular expression, requires having knowledge of Python as well as some experience in other programming languages.

Cons

The wide use of Goose’s library has revealed a lot of downsides. The biggest one being about extracting content through an HTTP request. Anyway, here is a list of all the pitfalls we bumped into:

  • Goose was unable to track errors in requests. If something went wrong, it simply returned an empty result.
  • The scraper seemed to have real problems with encoding detection. If a server returned an HTTP header with no encoding, Goose was unable to detect it.
  • It didn’t allow for using proxy.
  • The number of outgoing requests was unlimited so you could simply DDoS a website.
  • The out-of-the-box version didn’t allow for scraping dynamic content.
  • Being able to do that required a lot of changes in the code or adding external modules. Otherwise, it didn’t work.
  • There was no mechanism to explicitly assign rules for certain domains.

Our solution

Firstly, we removed the piece of code responsible for HTTP requests and then replaced it with a separate module that enabled us to solve all the issues above. Secondly, we created a set of rules for extracting content from the most frequently used websites. Adding a new website goes in just a few clicks and doesn’t require a developer to be involved. As a result, Goose is now able to receive a ready-to-process HTML content with a set rules, if any. Basically, this has greatly improved the whole extraction procedure as well as the quality of the content received.

In conclusion

Considering the cons, we still believe Goose to be a good choice for scraping websites. Surely, it is not perfect, but it is arguably the best solution currently available in Python. We would also like to say a massive thank you to Xavier Grangier for making such a great contribution to the open-source community.

--

--

JetRuby is a Digital Agency that doesn’t stop moving. We expound on subjects as varied as developing a mobile app and through to disruptive technologies.