deixto.com/blog: JavaScript

Friday, January 11, 2013

Selenium: a web browser automation companion for DEiXTo

Selenium is probably the best web browser automation tool we have come across so far. Primarily it is intended for automated testing of web applications but it's certainly not limited to that; it provides a suite of free software tools to automate web browsers across many platforms. The range of its use case scenarios is really wide and its usefulness is just great.

However, as scraping experts, we inevitably focus on using Selenium for web data extraction purposes. Its functionality-rich client API can be used to launch browser instances (e.g. Firefox processes) and simulate, through the proper commands, almost everything a user could do on a web site/ page. Thus, it allows you to deploy a fully-fledged web browser and surpass the difficulties that pop up from heavy JavaScript/ AJAX use. Moreover, via the virtual framebuffer X server (Xvfb), one could automate browsers without the need for an actual display and create scripts/ services running periodically or at will on a headless server e.g. on a remote GNU/Linux machine. Therefore, Selenium could successfully be used in combination with DEiXToBot, our beloved Mechanize scraping module.

For example, the Selenium-automated browser could fetch a target page after a couple of steps (like clicking a button/ hyperlink, selecting an item from a drop-down list, submitting a form, etc.) and then pass it to DEiXToBot (which lacks JavaScript support) to do the scraping job through DOM-based tree patterns previously generated with the GUI DEiXTo tool. This is particularly useful for complex scraping cases and opens new potential for DEiXTo wrappers.

The Selenium Server component (formerly the Selenium RC Server) as well as the client drivers that allow you to write scripts that interact with the Selenium Server can be found here. We have used it quite a few times for various cases and the results were great. In conclusion, Selenium is an amazing "weapon" added to our arsenal and we strongly believe that along with DEiXTo it boosts our scraping capabilities. If you have an idea/ project that involves web browser automation or/ and web data extraction, we would be more than glad to hear from you!

Saturday, September 17, 2011

DEiXToBot & Lack of JavaScript Support

Perhaps the major drawback of DEiXToBot (the Perl browser emulator object capable of executing GUI DEiXTo generated patterns) is the lack of JavaScript support, which derives from the fact that WWW::Mechanize doesn't operate on JavaScript. In many cases though, a solution is possible by figuring out what the JavaScript code is doing and simulating it via Perl programming. But in certain, more difficult cases that depend heavily on Javascript, this is very hard, if not impossible, because essentially you cannot reach the actual html source code of interest.

However, a workaround that sometimes works is to download the target pages of interest locally (after executing their Javascript segments of code) and then pass them to DEiXToBot for offline scraping.

Two remarkable tools for getting complex Javascript-enabled pages for this purpose are:

- Selenium, an amazing web browser automation tool and

- spynner, a powerful web browsing module with Ajax support for Python

Please note that these two great tools also work fine on GNU/Linux, which is really important, especially for server use and scheduled, periodic execution of wrappers.

So, once a target page is stored locally to your disk, the DEiXToBot agent can easily get the page through the file:// scheme and extract bits of interest in the usual manner.