deixto.com/blog: Agents

Showing posts with label Agents. Show all posts

Tuesday, December 10, 2013

Web archiving and Heritrix

A topic that has gained increasing attention lately is web archiving. In an older post we started talking about it and we cited a remarkable online tool named ArchiveReady that checks whether a web page is easily archivable. Perhaps the most well-known web archiving project is currently the Internet Archive which is a non-profit organization aiming to build a permanently and freely accessible Internet library. Their Wayback Machine, a digital archive of the World Wide Web, is really interesting. It enables users to "travel" across time and visit archived versions of web pages.

As web scraping aficionados we are mostly interested in their crawling toolset. So, the web crawler used by the Internet Archive is Heritrix, a free, powerful Java crawler released under the Apache License. The latest version is 3.1.1 and it was made available back in May 2012. Heritrix creates copies of websites and generates WARC (Web ARChive) files. The WARC format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file.
Heritrix offers a basic web based user interface (admin console) to manage your crawls as well as a command line tool that can optionally be used to initiate archiving jobs. We played with it a bit and found it handy for quite a few cases but overall it left us with a sense of obsolescence.

In our humble opinion (and someone please correct us if we are wrong) the two main drawbacks Heritrix has are: a) lack of distributed crawling support and b) lack of JavaScript/AJAX support. The first one means that if you would like to scan a really big source of data, for example the great Digital Public Library of America (DPLA) with more than 5 million items/ pages, then Heritrix would take a lot of time since it runs locally on a single machine. Even if multiple Heritrix crawlers were combined and a subset of the target URL space was assigned to each of them, then again it wouldn't be an optimal solution. From our point of view it would be much better and faster if several cooperating agents on multiple different servers could actually collaborate to complete the task. Therefore, scaling and time issues arise when the number of pages goes very large.

The second disadvantage on the other hand is related to the trend of modern websites towards heavy use of JavaScript and AJAX calls. Heritrix provides just basic browser functionality and it does not include a fully-fledged web browser. Therefore, it's not able to archive efficiently pages that use JavaScript/ AJAX to populate parts of the page. Thus, it cannot capture properly social media content.

We think that both of these issues could be surpassed using a cloud, Selenium-based architecture like Sauce Labs (although the cost for an Enterprise plan is a matter that should be considered). This choice would allow you a) to run your crawls in the cloud in parallel and b) use a real web browser with full JavaScript support, like Firefox, Chrome or Safari. We have already covered Selenium in previous posts and it is absolutely a great browser automation tool. In conclusion, we recommend Selenium and a different, cloud-based approach for implementing large-scale, web archiving projects. Heritrix is quite good and has proved a valuable ally but we think that other, state-of-the-art technologies are nowadays more suitable for the job especially with the latest Web 2.0 developments. What's your opinion?

Saturday, August 18, 2012

PhantomJS & finding pizza using Yelp and DEiXTo!

Recently I stumbled upon PhantomJS, a headless WebKit browser which can serve a wide variety of purposes such as web browser automation, site scraping, website testing, SVG rendering and network monitoring. It's a very interesting tool and I am sure that it could successfully be used in combination with DEiXToBot which is our beloved powerful Mechanize scraper. For example, it could fetch a not-easy-to-reach (probably JavaScript-rich) target page (that WWW::Mechanize could not get due to its lack of JavaScript support) after completing some steps like clicking, selecting, checking, etc and then pass it to DEiXToBot to do the scraping job. This is particularly useful for complex scraping cases where in my humble opinion PhantomJS DOM manipulation support would just not be enough and DEiXTo extraction capabilities could come into play.

So, I was taking a look at the PhantomJS examples and I liked (among others) the one about finding pizza in Mountain View using Yelp (I really like pizza!). So, I thought it would be nice to port the example to DEiXToBot in order to demonstrate the latter's use and efficiency. Hence, I visually created a pretty simple and easy to build XML pattern with GUI DEiXTo for extracting the address field of each pizzeria returned (essentially equivalent to what PhantomJS does by getting the inner text of span.address items) and wrote a few lines of Perl code to execute the pattern on the target page and print the addresses extracted on the screen (either on a GNU/Linux terminal or a command prompt window on Windows).

The resulting script was simple like that:

use DEiXToBot;
my $agent = DEiXToBot->new();
$agent->get('http://www.yelp.com/search?find_desc=pizza&find_loc=94040&find_submit=Search');
die 'Unable to access network' unless $agent->success;
$agent->load_pattern('yelp_pizza.xml');
$agent->build_dom();
$agent->extract_content();
my @addresses;
for my $record (@{$agent->records}) {
push @addresses, $$record[0];
}
print join("\n",@addresses);

Just note that it scrapes only the first results page (just like in the PhantomJS example). We could easily parse through all the pages by following the "Next" page link but this is out of scope.

I would like to further look into PhantomJS and check the potential of using it (along with DEiXTo) as a pre-scraping step for hard JavaScript-enabled pages. In any case, PhantomJS is a handy tool that can be quite useful for a wide range of use cases. Generally speaking, web scraping can have countless applications and uses and there are many remarkable tools out there. One of the best we believe is DEiXTo, so check it out! DEiXTo has helped quite a few people get their web data extraction tasks done easily and free!

Monday, January 2, 2012

Cooperating DEiXTo agents

Basically there are two major, broad categories of cooperating DEiXTo wrappers. In the first one, the wrappers are executed and applied on the same, single page so as to capture bits of interest that are scattered all over this particular target page. On the other hand, the second category comprises cases where the output of a wrapper serves as input for a second one. For the latter, typically the output of the first wrapper is a txt file containing the target URLs leading to pages with detailed information.

The first category is not supported directly by the GUI tool. However, DEiXToBot (a Mechanize agent object capable of executing extraction rules previously built with the GUI tool) allows the combination of multiple extraction rules/ patterns on the same page and their results through Perl code. So, if you have come across a complex, data-rich page and you are fluent with Perl and DEiXToBot's interface, you can build the necessary tree patterns separately with the GUI tool and then write a highly efficient set of cooperating Perl robots aiming at capturing all the desired data. It is not easy though since it requires programming skills and custom code.

As far as the second type of collaboration is concerned, we have stumbled upon numerous cases where a first wrapper collects the detailed target URLs from listing pages and passes them to a second wrapper which in turn takes over and gathers all data of interest from the pages containing the full text/ description. A typical case would be a blog or a news site or an e-shop, where a first agent could scrape the URLs of the detailed pages and a second one would visit each one of them extracting every single piece of desired information. If you wonder how you can set a DEiXTo wrapper to visit multiple target pages, this can be done either through a text file containing their addresses or via a list. Both ways can be specified in the Project Info tab of the DEiXTo GUI tool.

Moreover, for the first wrapper which is intended to scrape the URLs, you only have to create a pattern that locates the links towards the detailed pages. Usually this is easy and straightforward. You should just point at a representative link, use it as a record instance and set the A rule node as "checked" (right click on the A node and select "Match and Extract Content"). The resulting pattern will be something like this:

Then, via executing the rule you can extract the "href" attribute (essentially the URI) of each matching link and export the results to a txt file, say target_urls.txt, which subsequently will be fed to the next wrapper. Please note that if you provide just the A rule node as a pattern, you will capture ALL the hyperlinks found on the page but we guess you don't want that (we want only those leading to the detailed pages).

In conclusion, DEiXTo can power schemes of cooperative robots and achieve very high precision. Especially for more advanced cases, synergies of multiple wrappers are always needed. Their coordination though usually needs some careful thought and effort. Should you have any questions, please do not hesitate to contact us!

Wednesday, December 28, 2011

Robots.txt & access restrictions

A really serious matter that often many people ignore (deliberately or not) is the access and copyright restrictions that several website owners/ administrators impose. A lot of websites want robots entirely out. The method they use to keep cooperating web robots out of certain site content is a robots.txt file that resides in their root directory and functions as a request that visiting bots ignore specified files or directories.

For example, the next 2 lines indicate that robots should not visit any pages on the site:
User-agent: *
Disallow: /
However, a large number of scraping agents violate the restrictions set by the vendors and content providers. This is a very important issue and it raises significant legal concerns. Undoubtedly, there has been an ongoing raging war between bots and websites with strict terms of use. The latter deploy various technical measures to stop robots (an excellent white paper about detecting and blocking site scraping attacks is here) and sometimes even take legal action and resort to courts. There have been many cases over the last years with contradictory decisions. So, the whole issue is quite unclear. You can read more about it in the relevant section of the "Web scraping" Wikipedia article. Both sides have their arguments, so it's not at all an easy verdict.

The DEiXTo command line executor by default respects the robots.txt file of potential target websites (through the use of the WWW::RobotRules Perl module). Nevertheless, you can override this configuration (at your own risk!) by setting the -nice parameter to 0. It is strongly recommended though that you comply with webmasters' requests and keep out of pages that have access restrictions.

Generally speaking, data copyright is a HUGE issue, especially in today's Web 2.0 era, and has sparked endless discussions and spawned numerous articles, opinions, licenses, disputes and legitimacy issues.

By the way, it is worth mentioning that currently there is a strong movement in favor of openness in data, standards and software. And according to many, openness fosters innovation and promotes transparency and collaboration.

Finally, we would like to suggest to everyone using web data extraction tools to comply with the terms of use that the websites set and think twice before deploying a scraper, especially if the data is going to be used for commercial purposes. A good practice is to contact the webmaster and ask for permission accessing and using their content. Quite a few times the website might be interested in such a cooperation mostly for marketing and advertising reasons. So, as soon as you get a "green light", start building your scraper with DEiXTo and we are here to help you!