Saturday, September 17, 2011

DEiXToBot & Lack of JavaScript Support


Perhaps the major drawback of DEiXToBot (the Perl browser emulator object capable of executing GUI DEiXTo generated patterns) is the lack of JavaScript support, which derives from the fact that WWW::Mechanize doesn't operate on JavaScript. In many cases though, a solution is possible by figuring out what the JavaScript code is doing and simulating it via Perl programming. But in certain, more difficult cases that depend heavily on Javascript, this is very hard, if not impossible, because essentially you cannot reach the actual html source code of interest.

However, a workaround that sometimes works is to download the target pages of interest locally (after executing their Javascript segments of code) and then pass them to DEiXToBot for offline scraping.
Two remarkable tools for getting complex Javascript-enabled pages for this purpose are:
- Selenium, an amazing web browser automation tool and
- spynner, a powerful web browsing module with Ajax support for Python

Please note that these two great tools also work fine on GNU/Linux, which is really important, especially for server use and scheduled, periodic execution of wrappers.

So, once a target page is stored locally to your disk, the DEiXToBot agent can easily get the page through the file:// scheme and extract bits of interest in the usual manner.