deixto.com/blog: February 2013

Although our blog's main focus is scraping data from web information sources (especially via DEiXTo), we are also very interested in services and applications that can be built on top of agents and crawlers. Our favorite tools for programmatic web browsing are WWW::Mechanize and Selenium. The first one is a handy Perl object (that lacks Javascript support though) whereas the latter is a great browser automation tool that we have been using more and more lately in a variety of cases. Through them we can simulate whatever a user can do in a browser window and automate the interaction with pages of interest.

Traversing a website is one of the most basic and common tasks for a developer of web robots. However, the methodologies used and the mechanisms deployed can vary a lot. So, we tried to think of a meaningful crawler-based scenario that would blend various "tasty" ingredients and come up with a nice story. Hopefully we did and our post has four major pillars that we would like to highlight and discuss further below:

Crawling (through Selenium)
Sitemaps
Archivability
Scraping (in the demo that follows we download reports from a target website)

An interesting topic we recently stumbled upon is digital preservation which can be viewed as a series of policies and strategies necessary to ensure continued access to digital content over time and regardless of the challenges of media failure and technological change. In this context, we discovered ArchiveReady, a remarkable web application that checks whether a website is easily archivable. This means that it scans a page and checks whether it's suitable for web archiving projects (such as the Internet Archive and BlogForever) to access and preserve it. However, you can only pass one web page at a time to its checker (not an entire website) and it might take some time to complete depending on the complexity and size of the page. Therefore, we thought it could be useful for those interested to test multiple pages if we wrote a small script that parses the XML sitemap of a target site and checks each of the URLs contained in it against the ArchiveReady service and at the same time downloads the results.

Sitemaps, as you probably already know, are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a sitemap is an XML file that lists URLs for a site along with some additional metadata about each URL so that search engines can more intelligently crawl the site. Typically sitemaps are auto-generated by plugins.

OK, enough with the talking. Let's get to work and write some code! The Perl modules we utilised for our purposes were WWW::Selenium and XML::LibXML. The activities we had to do were the following:

launch a Firefox instance
read the sitemap document of a sample target website (we chose openarchives.gr)
pass each of its URLs to the ArchiveReady validation engine and finally
download locally the results returned in Evaluation and Report Language (EARL) format since the ArchiveReady offers this option

So, here is the code (just note that we wait till the validation page contains 8 "Checking complete" messages, one for each section, to determine whether the processing has finished):

use WWW::Selenium;
use XML::LibXML;

my $parser = XML::LibXML->new();
my $dom = $parser->parse_file('http://openarchives.gr/sitemap.xml');
my @loc_elms = $dom->getElementsByTagName('loc');
my @urls;
for my $loc (@loc_elms) {
push @urls,$loc->textContent;
}
# launch a Firefox instance
my $sel = WWW::Selenium->new( host => "localhost",
port => 4444,
browser => "*firefox",
browser_url => "http://archiveready.com/"
);
$sel->start;
# parse through the pages contained in the sitemap
for my $u (@urls) {
$sel->open("http://archiveready.com/check?url=$u");
my $content = $sel->get_html_source();
while ( (() = $content =~ /Checking complete/g) != 8) { # check if complete
sleep(1);
$content = $sel->get_html_source();
}
$content=~m#href="/download-results\?test_id=(\d+)&format=earl"#;
my $id = $1; # capture the identifier of the current validation test
$sel->click('xpath=//a[contains(@href,"earl")]'); # click on the EARL link
$sel->wait_for_page_to_load(5000);
eval { $content = $sel->get_html_source(); };
open my $fh,">:utf8","download_results_$id.xml"; # write the EARL report to a file
print $fh $content;
close $fh;
}
$sel->stop;

We hope you found the above helpful and the scenario described interesting. We tried to take advantage of software agents/ crawlers and use them creatively in combination with ArchiveReady, an innovative service that helps you strengthen your website's archivability. Finally, scraping and automated browsing can have an extremely extensive set of uses and applications. Please check out DEiXTo, our feature-rich web data extraction tool, and don't hesitate to contact us! Maybe we can help you with your tedious and time-consuming web tasks and data needs!

Saturday, February 2, 2013

Digital Preservation and ArchiveReady