deixto.com/blog: August 2012

Saturday, August 18, 2012

PhantomJS & finding pizza using Yelp and DEiXTo!

Recently I stumbled upon PhantomJS, a headless WebKit browser which can serve a wide variety of purposes such as web browser automation, site scraping, website testing, SVG rendering and network monitoring. It's a very interesting tool and I am sure that it could successfully be used in combination with DEiXToBot which is our beloved powerful Mechanize scraper. For example, it could fetch a not-easy-to-reach (probably JavaScript-rich) target page (that WWW::Mechanize could not get due to its lack of JavaScript support) after completing some steps like clicking, selecting, checking, etc and then pass it to DEiXToBot to do the scraping job. This is particularly useful for complex scraping cases where in my humble opinion PhantomJS DOM manipulation support would just not be enough and DEiXTo extraction capabilities could come into play.

So, I was taking a look at the PhantomJS examples and I liked (among others) the one about finding pizza in Mountain View using Yelp (I really like pizza!). So, I thought it would be nice to port the example to DEiXToBot in order to demonstrate the latter's use and efficiency. Hence, I visually created a pretty simple and easy to build XML pattern with GUI DEiXTo for extracting the address field of each pizzeria returned (essentially equivalent to what PhantomJS does by getting the inner text of span.address items) and wrote a few lines of Perl code to execute the pattern on the target page and print the addresses extracted on the screen (either on a GNU/Linux terminal or a command prompt window on Windows).

The resulting script was simple like that:

use DEiXToBot;
my $agent = DEiXToBot->new();
$agent->get('http://www.yelp.com/search?find_desc=pizza&find_loc=94040&find_submit=Search');
die 'Unable to access network' unless $agent->success;
$agent->load_pattern('yelp_pizza.xml');
$agent->build_dom();
$agent->extract_content();
my @addresses;
for my $record (@{$agent->records}) {
push @addresses, $$record[0];
}
print join("\n",@addresses);

Just note that it scrapes only the first results page (just like in the PhantomJS example). We could easily parse through all the pages by following the "Next" page link but this is out of scope.

I would like to further look into PhantomJS and check the potential of using it (along with DEiXTo) as a pre-scraping step for hard JavaScript-enabled pages. In any case, PhantomJS is a handy tool that can be quite useful for a wide range of use cases. Generally speaking, web scraping can have countless applications and uses and there are many remarkable tools out there. One of the best we believe is DEiXTo, so check it out! DEiXTo has helped quite a few people get their web data extraction tasks done easily and free!