Wednesday, December 28, 2011

Robots.txt & access restrictions

A really serious matter that often many people ignore (deliberately or not) is the access and copyright restrictions that several website owners/ administrators impose. A lot of websites want robots entirely out. The method they use to keep cooperating web robots out of certain site content is a robots.txt file that resides in their root directory and functions as a request that visiting bots ignore specified files or directories.
For example, the next 2 lines indicate that robots should not visit any pages on the site:
User-agent: *
Disallow: /
    However, a large number of scraping agents violate the restrictions set by the vendors and content providers. This is a very important issue and it raises significant legal concerns. Undoubtedly, there has been an ongoing raging war between bots and websites with strict terms of use. The latter deploy various technical measures to stop robots (an excellent white paper about detecting and blocking site scraping attacks is here) and sometimes even take legal action and resort to courts. There have been many cases over the last years with contradictory decisions. So, the whole issue is quite unclear. You can read more about it in the relevant section of the "Web scraping" Wikipedia article. Both sides have their arguments, so it's not at all an easy verdict.
    The DEiXTo command line executor by default respects the robots.txt file of potential target websites (through the use of the WWW::RobotRules Perl module). Nevertheless, you can override this configuration (at your own risk!) by setting the -nice parameter to 0. It is strongly recommended though that you comply with webmasters' requests and keep out of pages that have access restrictions.
    Generally speaking, data copyright is a HUGE issue, especially in today's Web 2.0 era, and has sparked endless discussions and spawned numerous articles, opinions, licenses, disputes and legitimacy issues.
    By the way, it is worth mentioning that currently there is a strong movement in favor of openness in data, standards and software. And according to many, openness fosters innovation and promotes transparency and collaboration.
    Finally, we would like to suggest to everyone using web data extraction tools to comply with the terms of use that the websites set and think twice before deploying a scraper, especially if the data is going to be used for commercial purposes. A good practice is to contact the webmaster and ask for permission accessing and using their content. Quite a few times the website might be interested in such a cooperation mostly for marketing and advertising reasons. So, as soon as you get a "green light", start building your scraper with DEiXTo and we are here to help you!

Sunday, December 25, 2011

Downloading images with DEiXTo and wget

Many people often download pictures and photos from various websites of interest. Sometimes though, the number of images that someone wants to download from certain pages is large. So large that doing it manually is almost prohibitive. Therefore, an automation tool is often needed to save users time and repetitive effort. Of course, towards this goal, DEiXTo can help.
    Let's suppose that you want to get all images from a specific web page (with respect to terms of use). You can easily build a simple extraction rule by pointing at an image, using it as a record instance and setting the IMG rule node as "checked" (right click on the IMG node and select "Match and Extract Content"). The resulting pattern will be like this:
    Then, via executing the rule you can extract the "src" attribute (essentially the URI) of each image found on the page and export the results to a txt file, let's say image_urls.txt. And last, you can use GNU Wget (a great free command line tool) in order to retrieve the files. You can download a Windows (win32) version of wget hereFor example, on Windows you can then just open a DOS command prompt window, change the current working directory to the folder wget is stored (via the 'cd' command) and enter:
wget.exe -i image_urls.txt 
where image_urls.txt is the file containing the URIs of images. And voilà! The wget utility will download all the images of the target page for you!
    What about getting images from multiple pages? You will have to explicitly provide the target URLs either through an input txt file or via a list. Both ways can be specified in the Project Info tab of the DEiXTo GUI tool.
    Thus, if you have the target URLs at hand or you can extract them with another wrapper (generating a txt file), then  you can just pass them as input to the new image wrapper and the latter will do the laborious work for you.
    In case all the above are a bit unclear, we have built a sample wrapper project file (imdb_starwars.wpf) that downloads all Star Wars (1977) thumbnail photos from the corresponding imdb page. Please note that we set the agent to follow the Next page link so as to gather all thumbnails since they are scattered across multiple pages. However, if you would like to get the large size photos you will have to add another scraping layer for extracting the links of the pages containing the full size pictures.
    Anyway, in order to run the sample wrapper for the thumbnails, you should open the wpf (through the Open button in the Project Info tab) and then press the "Go!" button. Or alternatively you can use the command line executor instead on a DOS prompt:
deixto_executor.exe imdb_starwars.wpf
Finally, you will have to pass the image_urls.txt output file to wget in order to download all thumbnails and get the job done! May the Force be with you! :)

Thursday, December 22, 2011

Can DEiXTo power mobile apps? Yes, it can!

Web content scraped with DEiXTo can be presented in a wide variety of formats. However, the most common choice is probably XML since it facilitates heavy post-processing and further transformations so as to make the data suit your needs. A potentially interesting scenario would be to output bits of interest from a target website into an XML file and then transform it to HTML through XSLT (Extensible Stylesheet Language Transformations). This could be very practical and useful for creating in real time a customized, "shortened" version of a target web page specifically for mobile devices (e.g. Android and iOS devices).
    As you all know smartphones and tablets over the last few years have changed the computer world. So, we thought it would be challenging and hopefully useful to build a web service capable of repurposing specified pages on the fly (through the use of a DEiXTo-based agent), keeping only the important/ interesting stuff and returning it in a mobile-compatible fashion, suitable to fit small screens by harnessing XML, XSLT and CSS. We did not fully implement the service but we got a simple prototype ready to try our idea. And the results were quite encouraging!

    For the needs of our demo we used greektech-newsa technology news blog covering a plethora of interesting and fun topics around the IT industry. In the context of the demo, we supposed that we wanted to scrape the articles of the home page. So, we built a quick test scraper able to extract all the records found and generate an XML document with the data captured. With the use of an elegant XSLT and a CSS we achieved a nice, usable and easy to navigate structure, suitable for a smartphone screen (illustrated in the picture above). You can see live how the output XML file (containing 15 sample headlines) looks like on an online iPhone simulator at the following address:
    The concept of the proposed web service is the following: suppose that you are an app developer or a website owner/ administrator and that you need to display content inside your app or the mobile version of your site, either from a website of your control (meaning that you have access to its backend) or from another, "external" site (with respect to copyright and access restrictions). Often, though, it's not easy to retrieve data from the target website or simply you don't know how to do it.. Therefore, a service that could listen to requests for certain pages/ URIs and return their important data in a suitable form could potentially be very useful. For example, ideally an HTTP request like http://deixto.com/webservice.pl?uri="http://example.com/.." would result in a good-looking XML chunk (formatted with XSLT and CSS) containing the data scraped from the original page (specified with the uri parameter).
    Finally, we would like to bring forward the fact that DEiXToBot contains best of breed Perl technology and allows extensive customization. Thus, it facilitates tailor-made solutions so as to make the data captured fully fit your project's needs. And towards this direction, deploying XSLT and XML-related technologies in general can really boost the utility and value of scraping and DEiXTo in particular!

Saturday, December 17, 2011

APIs vs Scraping - Cl@rity & Yperdiavgeia

Typically there are two main mechanisms to search and retrieve data from a website: either through an Application Programming Interface commonly known as an API (if available) or via screen scraping. The first one is better, faster and more reliable. However, there is not always a search API available or perhaps even if there exists one, it may not fully cover your needs. In such cases, web robots, also called agents, are usually used in order to simulate a person searching the target website/ online database through a web browser and capture bits of interest by utilizing scraping techniques.

An API that has attracted some attention over the last few months in Greece is the Opendata API that the "Cl@rity" program ("Διαύγεια" in Greek) is offering. Since the 1st of October 2010, all Greek Ministries are obliged to upload their decisions and expenditure on the Internet, through the Cl@rity program. Cl@rity is one of the major transparency initiatives of the Ministry of Interior, Decentralization and e-Government. Each document uploaded is digitally signed and given a transaction unique number automatically by the system.
The Opendata API offers a variety of search parameters such as organization, type, tag (subject), ada (the unique number assigned), signer and date. However, there are still a lot of parameters and functionality missing such as full text search as well as searching by certain criteria like beneficiary's name, VAT registration number (ΑΦΜ in Greek), document title and other metadata fields.

A remarkable alternative for searching effectively through the documents of the Greek public organizations is yperdiavgeia.gr ("ΥπερΔιαύγεια" in Greek), a web-based platform built by the expert in digital libraries and institutional repositories Vangelis Banos. Yperdiavgeia is a mirror site of Cl@rity that gets updated on a daily basis and it provides a powerful and robust OpenSearch API which is far more usable and easy to harness. Its great advantage is that it facilitates full text searching. Currently, it lacks some parameters support but it seems that they are going to be added soon since it is under active development.
Even though both APIs mentioned above are really remarkable (especially for communicating and exchanging data with third party programs) there is still some room for utilizing scraping techniques and coming up with some "magic". In a previous post we had described in detail an application we developed mostly for downloading a user-specified number of the latest PDF documents of a specific organization uploaded to Cl@rity. We believe that this little utility we created (offering both a GUI as well as a command line version) can be quite useful for many people working in the public sector and potentially save a lot of time and effort. For further information about it, please check out this post (although it is written in Greek).

So, in this short post we just wanted to point out that there are quite a lot of great APIs out there, provided mostly by large organizations (e.g., firms, governments, cultural institutions and digital libraries - collections) as well as the major players of the IT industry such as Google, Amazon, etc, offering amazing features and functionality. Nevertheless, scraping the native web interface of a target site can still be useful and sometimes come up with a solution that overpasses difficulties or/and inefficiencies of APIs and yield an innovative outcome. Moreover, there are numerous websites that do not offer an API, thus a scraper could perhaps be deployed in case data searching, gathering or exporting would be needed. Therefore, the "battle" between APIs and scraping still rages.. and we are eager to see how things will evolve. Truth be told, we love them both!

Tuesday, December 13, 2011

DSpace & Institutional Repositories

Institutional Repositories (IRs) have emerged over the last few years and became very popular in the academic library world. The system that has dominated the "market" is DSpace, an exciting, functionality-rich, open source software package that is installed at over 1.100 institutions around the globe. It offers an OAI-PMH web service for harvesting the metadata of the repository and getting its entire contents in Dublin Core format. However, OAI-PMH does not provide advanced search by certain criteria such as title, author, supervisor, etc. Even the REST API which is under construction does not facilitate searching with these metadata fields, at least to the best of our knowledge. Moreover, the default DSpace OpenSearch support still seems incomplete and a bit buggy. Therefore, a potential solution for searching in real time a DSpace repository could be submitting a query and scraping the results returned through its native web interface. This could probably be useful for building a federated search engine or perhaps for creating a mobile app (currently there is no mobile version for DSpace).

Having in mind the lack of an "advanced", multiple-criteria enabled, search API / mechanism by DSpace, we thought it would be interesting and perhaps useful to write a test scraper that could submit queries to a DSpace repository and fetch the search results through its website. So, we built a simple, DOM-based, extraction rule (wrapper) with the DEiXTo GUI tool and then wrote a short DEiXToBot-based script that submits a sample query to Psepheda (the IR of University of Macedonia) and scrapes the results returned. The following picture illustrates the 10 first results for a sample query by title.

To get a better idea of how a Perl, DEiXToBot-based script works, below you can find the code that scrapes the 10 first items containing "programming" in title. The pattern used captures five metadata fields: detailed URL, title, date, authors and supervisor, and prints them on the screen. Of course this script can be easily extended to submit user specified queries as well as navigate through all the result pages by following the Next page link ("επόμενη" is the inner text of this link in Greek).

use DEiXToBot;
use Encode;
my $agent = DEiXToBot->new();
$agent->get('http://dspace.lib.uom.gr/simple-search?query=((title:programming))');
$agent->load_pattern('dspace_pattern.xml');
$agent->ignore_tags( [ 'em' ] );
$agent->build_dom();
$agent->extract_content();
for my $record (@{$agent->records}) {
    print encode_utf8(join("\n",@{$record})),"\n\n";
}

DEiXToBot is written is Perl, thus it is portable and can run on multiple operating systems provided you have all the prerequisite Perl modules installed. You can download the lines of code given above along with the necessary pattern by clicking here. This short script serves as a good, simple example for utilizing the power and flexibility of DEiXToBot (a Mechanize agent object, essentially a browser emulator, which is able to execute patterns/ extraction rules previously built with the GUI tool).

Generally, IRs have huge potential and in the next few years they are expected to play an increasingly important role in storing and preserving digital content, academic or not. By the way, a great federated search engine harvesting numerous Greek digital libraries and institutional repositories is openarchives.gr which is mostly based upon OAI-PMH. It harnesses innovative technologies and has grown a lot since 2006 when it was first launched.

Last but not least, DEiXTo was used quite long ago by "Pantou", the federated search engine of the University of Macedonia, in order to scrape (in real time) multiple online resources simultaneously via their web interface/ site. It is worth noting that a predecessor of the current DEiXToBot module, back in 2007, was included in the official dbWiz distribution, a remarkable, open source, federated search software package upon which pantou was built.