deixto.com/blog: Web archiving

Tuesday, December 10, 2013

Web archiving and Heritrix

A topic that has gained increasing attention lately is web archiving. In an older post we started talking about it and we cited a remarkable online tool named ArchiveReady that checks whether a web page is easily archivable. Perhaps the most well-known web archiving project is currently the Internet Archive which is a non-profit organization aiming to build a permanently and freely accessible Internet library. Their Wayback Machine, a digital archive of the World Wide Web, is really interesting. It enables users to "travel" across time and visit archived versions of web pages.

As web scraping aficionados we are mostly interested in their crawling toolset. So, the web crawler used by the Internet Archive is Heritrix, a free, powerful Java crawler released under the Apache License. The latest version is 3.1.1 and it was made available back in May 2012. Heritrix creates copies of websites and generates WARC (Web ARChive) files. The WARC format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file.
Heritrix offers a basic web based user interface (admin console) to manage your crawls as well as a command line tool that can optionally be used to initiate archiving jobs. We played with it a bit and found it handy for quite a few cases but overall it left us with a sense of obsolescence.

In our humble opinion (and someone please correct us if we are wrong) the two main drawbacks Heritrix has are: a) lack of distributed crawling support and b) lack of JavaScript/AJAX support. The first one means that if you would like to scan a really big source of data, for example the great Digital Public Library of America (DPLA) with more than 5 million items/ pages, then Heritrix would take a lot of time since it runs locally on a single machine. Even if multiple Heritrix crawlers were combined and a subset of the target URL space was assigned to each of them, then again it wouldn't be an optimal solution. From our point of view it would be much better and faster if several cooperating agents on multiple different servers could actually collaborate to complete the task. Therefore, scaling and time issues arise when the number of pages goes very large.

The second disadvantage on the other hand is related to the trend of modern websites towards heavy use of JavaScript and AJAX calls. Heritrix provides just basic browser functionality and it does not include a fully-fledged web browser. Therefore, it's not able to archive efficiently pages that use JavaScript/ AJAX to populate parts of the page. Thus, it cannot capture properly social media content.

We think that both of these issues could be surpassed using a cloud, Selenium-based architecture like Sauce Labs (although the cost for an Enterprise plan is a matter that should be considered). This choice would allow you a) to run your crawls in the cloud in parallel and b) use a real web browser with full JavaScript support, like Firefox, Chrome or Safari. We have already covered Selenium in previous posts and it is absolutely a great browser automation tool. In conclusion, we recommend Selenium and a different, cloud-based approach for implementing large-scale, web archiving projects. Heritrix is quite good and has proved a valuable ally but we think that other, state-of-the-art technologies are nowadays more suitable for the job especially with the latest Web 2.0 developments. What's your opinion?

Saturday, February 2, 2013

Digital Preservation and ArchiveReady

Although our blog's main focus is scraping data from web information sources (especially via DEiXTo), we are also very interested in services and applications that can be built on top of agents and crawlers. Our favorite tools for programmatic web browsing are WWW::Mechanize and Selenium. The first one is a handy Perl object (that lacks Javascript support though) whereas the latter is a great browser automation tool that we have been using more and more lately in a variety of cases. Through them we can simulate whatever a user can do in a browser window and automate the interaction with pages of interest.

Traversing a website is one of the most basic and common tasks for a developer of web robots. However, the methodologies used and the mechanisms deployed can vary a lot. So, we tried to think of a meaningful crawler-based scenario that would blend various "tasty" ingredients and come up with a nice story. Hopefully we did and our post has four major pillars that we would like to highlight and discuss further below:

Crawling (through Selenium)
Sitemaps
Archivability
Scraping (in the demo that follows we download reports from a target website)

An interesting topic we recently stumbled upon is digital preservation which can be viewed as a series of policies and strategies necessary to ensure continued access to digital content over time and regardless of the challenges of media failure and technological change. In this context, we discovered ArchiveReady, a remarkable web application that checks whether a website is easily archivable. This means that it scans a page and checks whether it's suitable for web archiving projects (such as the Internet Archive and BlogForever) to access and preserve it. However, you can only pass one web page at a time to its checker (not an entire website) and it might take some time to complete depending on the complexity and size of the page. Therefore, we thought it could be useful for those interested to test multiple pages if we wrote a small script that parses the XML sitemap of a target site and checks each of the URLs contained in it against the ArchiveReady service and at the same time downloads the results.

Sitemaps, as you probably already know, are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a sitemap is an XML file that lists URLs for a site along with some additional metadata about each URL so that search engines can more intelligently crawl the site. Typically sitemaps are auto-generated by plugins.

OK, enough with the talking. Let's get to work and write some code! The Perl modules we utilised for our purposes were WWW::Selenium and XML::LibXML. The activities we had to do were the following:

launch a Firefox instance
read the sitemap document of a sample target website (we chose openarchives.gr)
pass each of its URLs to the ArchiveReady validation engine and finally
download locally the results returned in Evaluation and Report Language (EARL) format since the ArchiveReady offers this option

So, here is the code (just note that we wait till the validation page contains 8 "Checking complete" messages, one for each section, to determine whether the processing has finished):

use WWW::Selenium;
use XML::LibXML;

my $parser = XML::LibXML->new();
my $dom = $parser->parse_file('http://openarchives.gr/sitemap.xml');
my @loc_elms = $dom->getElementsByTagName('loc');
my @urls;
for my $loc (@loc_elms) {
push @urls,$loc->textContent;
}
# launch a Firefox instance
my $sel = WWW::Selenium->new( host => "localhost",
port => 4444,
browser => "*firefox",
browser_url => "http://archiveready.com/"
);
$sel->start;
# parse through the pages contained in the sitemap
for my $u (@urls) {
$sel->open("http://archiveready.com/check?url=$u");
my $content = $sel->get_html_source();
while ( (() = $content =~ /Checking complete/g) != 8) { # check if complete
sleep(1);
$content = $sel->get_html_source();
}
$content=~m#href="/download-results\?test_id=(\d+)&format=earl"#;
my $id = $1; # capture the identifier of the current validation test
$sel->click('xpath=//a[contains(@href,"earl")]'); # click on the EARL link
$sel->wait_for_page_to_load(5000);
eval { $content = $sel->get_html_source(); };
open my $fh,">:utf8","download_results_$id.xml"; # write the EARL report to a file
print $fh $content;
close $fh;
}
$sel->stop;

We hope you found the above helpful and the scenario described interesting. We tried to take advantage of software agents/ crawlers and use them creatively in combination with ArchiveReady, an innovative service that helps you strengthen your website's archivability. Finally, scraping and automated browsing can have an extremely extensive set of uses and applications. Please check out DEiXTo, our feature-rich web data extraction tool, and don't hesitate to contact us! Maybe we can help you with your tedious and time-consuming web tasks and data needs!