deixto.com/blog: January 2013

Thursday, January 24, 2013

Cloudify your browser testing (and scraping) with Sauce!

For quite some time now, along with our DEiXTo scraping software, we have been using Selenium which is perhaps the best web browser automation tool currently available. It's really great and has helped us a lot in a variety of web data extraction cases (we published another post about it recently). We tried it locally as well as on remote GNU/Linux servers and we wrote code for a couple of automated tests and scraping tasks. However, it was not that easy to set everything up and get things running; we came across various difficulties (ranging from installation to stability issues e.g. sporadic timeout errors) although we were finally able to surpass most of them.

Wouldn't it be great though if there was a robust framework that would provide you with the necessary infrastructure and all possible browser/OS combinations and allow you to run your Selenium tests in the cloud? You would not have to worry about setting a bunch of things up, installing updates, machines management, maintenance, etc. Well, there is! And it offers a whole lot more.. Its name is Sauce Labs and it provides an amazing set of tools and features. Admittedly they have done awesome work and they bring great products to software developers. Moreover, their team seems to share some great values: pursuit of excellence, innovation and open source culture (among others).
They offer a variety of pricing plans (a bit expensive in my opinion though) while the free account includes 100 automated code minutes for Win, Linux and Android, 40 automated code minutes for Mac and iOS and 30 Minutes of manual testing. And for those contributing to an open source project that needs testing support, Open Sauce Plan is just for you (unlimited minutes without any cost!). Please note that the Selenium project is sponsored by Sauce Labs.

Being a Perl programmer, I could not resist signing up and writing some Perl code to run a test on the ondemand.saucelabs.com host! I was already familiar with the WWW::Selenium CPAN module, so it was quite easy and straightforward. It should be noted that they provide useful guidelines and various examples online for multiple languages e.g. Python, Java, PHP and others. Overall my test script worked pretty well but it was a bit slow (compared to running the same code locally). However, one could improve speed by deploying lots of processes in parallel (if the use case scenario is suitable) and by disabling video (the script's execution and browser activity is recorded for easier debugging). Furthermore, Sauce's big advantage is that it can go large scale, which would be especially suited for complex cases with heavy requirements.

The bottom line is that the "Selenium - Sauce Labs" pair is remarkable and can be very useful in a wide range of cases and purposes. Sauce in particular offers developers an exciting way to cloudify and manage their automated browser testing (although we personally focus more on the scraping capabilities that these tools provide). Their combination with DEiXTo extraction patterns could definitely be very fertile and open new, interesting potentials. In conclusion, the uses and applications of web scraping are limitless and Selenium turns out to be a powerful tool in our quiver!

Sunday, January 13, 2013

Scraping PDF files

While puttering around on the Internet, I recently stumbled upon the website of the National Printing House ("Εθνικό Τυπογραφείο" in Greek) which is the public service responsible for the dissemination of Greek law. It publishes and distributes the government gazette and its website provides free access to all series of the Official Journal of the Hellenic Republic (ΦΕΚ).

So, at its search page I noticed a section with the most popular issues. The most-viewed one, as of 13 Jan 2013, with 351.595 views was this: ΦΕΚ A 226 - 27.10.2011. Out of curiosity I decided to download it in order to take a quick look and see what it is all about. It was available in a PDF format and it turned out to be an issue about the Economic Adjustment Programme for Greece aiming to reduce its macroeconomic and fiscal imbalances. However, I was quite surprised to find that the text was contained in images and you could not perform any keyword search in it nor could you copy-paste its textual content! I guess because the document's pages were scanned and converted to digital images.

This instantly brought to my mind once again the difficulties that PDF scraping involves. From our web scraping experience there are many cases where the data is "locked" in PDF files e.g. in a .pdf brochure. Getting the data of interest out is not an easy task but quite a few tools (pdftotext is one of them) have popped up over the years to ease the pain. One of the best tools I have encountered so far is Tesseract, a pretty accurate open source OCR engine currently maintained by Google.

So, I thought it would be nice to put Tesseract into action and check its efficiency against the PDF document (that so dramatically affects the lives of all Greeks..). It worked quite well, although not perfect (probably because of the Greek language), and a few minutes later (and after converting the pdf to a tiff image through Ghostscript) I had the full text, or at least most of it, in my hands. The output text file generated can be found here. The truth is that I could not do much with it (and the austerity measures were harsh..) but at least I was happy that I was able to extract the largest part of the text.
Of course this is just an example, there are numerous PDF files out there containing rich, inaccessible data that could potentially be processed and further utilised e.g. in order to create a full text search index. DEiXTo, our beloved web data extraction tool, can scrape only HTML pages. It cannot deal with PDF files residing on the Web. However, we do have the tools and the knowledge to parse those as well, find bits of interest and unleash their value!

Friday, January 11, 2013

Selenium: a web browser automation companion for DEiXTo

Selenium is probably the best web browser automation tool we have come across so far. Primarily it is intended for automated testing of web applications but it's certainly not limited to that; it provides a suite of free software tools to automate web browsers across many platforms. The range of its use case scenarios is really wide and its usefulness is just great.

However, as scraping experts, we inevitably focus on using Selenium for web data extraction purposes. Its functionality-rich client API can be used to launch browser instances (e.g. Firefox processes) and simulate, through the proper commands, almost everything a user could do on a web site/ page. Thus, it allows you to deploy a fully-fledged web browser and surpass the difficulties that pop up from heavy JavaScript/ AJAX use. Moreover, via the virtual framebuffer X server (Xvfb), one could automate browsers without the need for an actual display and create scripts/ services running periodically or at will on a headless server e.g. on a remote GNU/Linux machine. Therefore, Selenium could successfully be used in combination with DEiXToBot, our beloved Mechanize scraping module.

For example, the Selenium-automated browser could fetch a target page after a couple of steps (like clicking a button/ hyperlink, selecting an item from a drop-down list, submitting a form, etc.) and then pass it to DEiXToBot (which lacks JavaScript support) to do the scraping job through DOM-based tree patterns previously generated with the GUI DEiXTo tool. This is particularly useful for complex scraping cases and opens new potential for DEiXTo wrappers.

The Selenium Server component (formerly the Selenium RC Server) as well as the client drivers that allow you to write scripts that interact with the Selenium Server can be found here. We have used it quite a few times for various cases and the results were great. In conclusion, Selenium is an amazing "weapon" added to our arsenal and we strongly believe that along with DEiXTo it boosts our scraping capabilities. If you have an idea/ project that involves web browser automation or/ and web data extraction, we would be more than glad to hear from you!