deixto.com/blog: Institutional repositories

Friday, January 20, 2012

Open Archives & Digital Libraries

The Open Archives Initiative (OAI) develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. OAI has its roots in the open access and institutional repository movements and its cornerstone is the Protocol for Metadata Harvesting (OAI-PMH) which allows data providers/ repositories to expose their content in a structured format. A client then can make OAI-PMH service requests to harvest that metadata through HTTP.

openarchives.gr is a great federated search engine harvesting 57 Greek digital libraries and institutional repositories (as of January 2012). It currently provides access to almost half a million(!) documents (mainly undergraduate theses and Master/ PhD dissertations) and its index gets updated on a daily basis. It began its operation back in 2006 after being designed and implemented by Vangelis Banos but since May 2011 it is being hosted, managed and co-developed by the National Documentation Centre (EKT). What makes this amazing searching tool even more remarkable is the fact that it is entirely built on open source/ free software.

A tricky point that needs some clarification is that when a user searches openarchives.gr, the search is not submitted in real time to the target sources. Instead, it is performed locally on the openarchives.gr server where full copies of the repositories/ libraries are stored (and updated at regular time intervals).
The majority of the sources searched by openarchives.gr are OAI-PMH compliant repositories (such as DSpace or EPrints). Therefore, their data are periodically retrieved via their OAI-PMH endpoint. However, it is worth mentioning that non OAI-PMH digital libraries have also been included in its database. This was made possible through scraping their websites with DEiXTo and transforming their metadata into Dublin Core. So, more than 16.000 records from 6 significant online digital libraries (such as the Lyceum Club of Greek Women and the Music Library of Greece “Lilian Voudouri”) were inserted in openarchives.gr with the use of DEiXTo wrappers and custom Perl code.

Finally, it is known that digital collections have flourished over the last few years and enjoy growing popularity. However, most of them do NOT provide their contents in OAI-PMH or another appropriate metadata format. Actually, many of them (especially legacy systems) do NOT even offer an API or an SRW/U interface. Consequently, we believe that there is much room for DEiXTo to help cultural and educational organizations (e.g., museums, archives, libraries and multimedia collections) to export, present and distribute their digitized items and rich content to the outside world, in an efficient and structured way, through scraping and repurposing their data.

Tuesday, December 13, 2011

DSpace & Institutional Repositories

Institutional Repositories (IRs) have emerged over the last few years and became very popular in the academic library world. The system that has dominated the "market" is DSpace, an exciting, functionality-rich, open source software package that is installed at over 1.100 institutions around the globe. It offers an OAI-PMH web service for harvesting the metadata of the repository and getting its entire contents in Dublin Core format. However, OAI-PMH does not provide advanced search by certain criteria such as title, author, supervisor, etc. Even the REST API which is under construction does not facilitate searching with these metadata fields, at least to the best of our knowledge. Moreover, the default DSpace OpenSearch support still seems incomplete and a bit buggy. Therefore, a potential solution for searching in real time a DSpace repository could be submitting a query and scraping the results returned through its native web interface. This could probably be useful for building a federated search engine or perhaps for creating a mobile app (currently there is no mobile version for DSpace).

Having in mind the lack of an "advanced", multiple-criteria enabled, search API / mechanism by DSpace, we thought it would be interesting and perhaps useful to write a test scraper that could submit queries to a DSpace repository and fetch the search results through its website. So, we built a simple, DOM-based, extraction rule (wrapper) with the DEiXTo GUI tool and then wrote a short DEiXToBot-based script that submits a sample query to Psepheda (the IR of University of Macedonia) and scrapes the results returned. The following picture illustrates the 10 first results for a sample query by title.

To get a better idea of how a Perl, DEiXToBot-based script works, below you can find the code that scrapes the 10 first items containing "programming" in title. The pattern used captures five metadata fields: detailed URL, title, date, authors and supervisor, and prints them on the screen. Of course this script can be easily extended to submit user specified queries as well as navigate through all the result pages by following the Next page link ("επόμενη" is the inner text of this link in Greek).

use DEiXToBot;
use Encode;
my $agent = DEiXToBot->new();
$agent->get('http://dspace.lib.uom.gr/simple-search?query=((title:programming))');

$agent->load_pattern('dspace_pattern.xml');
$agent->ignore_tags( [ 'em' ] );

$agent->build_dom();

$agent->extract_content();

for my $record (@{$agent->records}) {

print encode_utf8(join("\n",@{$record})),"\n\n";

}

DEiXToBot is written is Perl, thus it is portable and can run on multiple operating systems provided you have all the prerequisite Perl modules installed. You can download the lines of code given above along with the necessary pattern by clicking here. This short script serves as a good, simple example for utilizing the power and flexibility of DEiXToBot (a Mechanize agent object, essentially a browser emulator, which is able to execute patterns/ extraction rules previously built with the GUI tool).

Generally, IRs have huge potential and in the next few years they are expected to play an increasingly important role in storing and preserving digital content, academic or not. By the way, a great federated search engine harvesting numerous Greek digital libraries and institutional repositories is openarchives.gr which is mostly based upon OAI-PMH. It harnesses innovative technologies and has grown a lot since 2006 when it was first launched.

Last but not least, DEiXTo was used quite long ago by "Pantou", the federated search engine of the University of Macedonia, in order to scrape (in real time) multiple online resources simultaneously via their web interface/ site. It is worth noting that a predecessor of the current DEiXToBot module, back in 2007, was included in the official dbWiz distribution, a remarkable, open source, federated search software package upon which pantou was built.