deixto.com/blog: January 2012

Saturday, January 28, 2012

Federated searching & dbWiz

Nowadays, most university and college students, professors as well as researchers are increasingly seeking information and finding answers on the open Web. Google has become the dominant search tool for almost everyone. Its popularity is enormous, no need to wonder or analyze why. It has a simple and effective interface and it returns fast, accurate results.

However, libraries, in their effort to win some patrons back, have tried to offer a decent searching alternative by developing a new model: federated search engines. Federated searching (also known as metasearch or cross searching) allows users to search simultaneously multiple web resources and subscription-based bibliographic databases from a single interface. To achieve that, parallel processes are executed in real time and retrieve results from each separate source. Τhen, the results returned get grouped together and presented to the user in a unified way.

The mechanisms used for pulling the data from the target sources are broadly two: either through an Application Programming Interface (API) or via scraping the native web interface/ site of each database. The first method is undoubtedly better but very often a search API is not available. In such cases, web robots (or agents) come into play and capture information of interest, typically by simulating a human browsing through the target webpages. Especially in the academia, there are numerous online bibliographic databases. Some of them offer Z39.50 or API access. However, a large number still does not provide protocol-based search functionality. Thus, scraping techniques should be deployed for those (unless the vendor disallows bots).

When starting my programming adventure with Perl back in 2006, in the context of my former full-time job at the Library of University of Macedonia (Thessaloniki, Greece), I had the chance (and luck) to run across dbWiz, a remarkable open source, federated search tool developed by the Simon Fraser University (SFU) Library in Canada. I was fascinated with Perl as well as dbWiz's internal design and implementation. So, this is how I met and fell in love with Perl.

dbWiz offered a friendly and usable admin interface that allowed you to create search categories and select from a global list of resources which databases would be active and searchable. If you had to add a new resource though, you would have to write your own plugin (Perl knowledge and programming skills were required). Some of the dbWiz search plugins were based upon Z39.50 whereas others (the majority) relied on regular expressions and WWW::Mechanize (a handy web browser Perl object).
The federated search engine developed while working at the University of Macedonia (2006-2008) was named "Pantou" and became a valuable everyday tool for students and professors of the University. The results of this work were presented at the 16th Panhellenic Academic Libraries Conference (Piraeus, 1-3 October 2007). Unfortunately, its maintenance stopped at the end of 2010 due to the economic crisis and severe cuts in funding. Consequently, a few months later some of its plugins started falling apart.

Generally, delving into dbWiz taught me a lot of lessons such as web development, Perl programming and GNU/Linux administration. I loved it! Meanwhile, in my effort to improve the relatively hard and tedious procedure of creating new dbWiz plugins, I put into practice an early version of GUI DEiXTo (which was my MSc thesis being fulfilled in the same period at the Aristotle University of Thessaloniki). The result was a new Perl module that allowed the execution of W3C DOM-based, XML patterns (built with the GUI DEiXTo) inside dbWiz and eliminated, at least to a large extent, the need for heavy use of regular expressions. That module, which was the first predecessor of today's DEiXToBot package, got included in the official dbWiz distribution after contacting the dbWiz development team in 2007. Unfortunately, SFU Library ended the support and development of dbWiz in 2010.

Looking back, I can now say with quite a bit of certainty, that DEiXTo (more than ever before) can power federated search tools and help them extend their reach to previously inaccessible resources. As far as the search engines war is concerned, Google seems to triumph but nobody can say for sure what is going to happen in the next few years to come. Time will tell..

Friday, January 20, 2012

Open Archives & Digital Libraries

The Open Archives Initiative (OAI) develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. OAI has its roots in the open access and institutional repository movements and its cornerstone is the Protocol for Metadata Harvesting (OAI-PMH) which allows data providers/ repositories to expose their content in a structured format. A client then can make OAI-PMH service requests to harvest that metadata through HTTP.

openarchives.gr is a great federated search engine harvesting 57 Greek digital libraries and institutional repositories (as of January 2012). It currently provides access to almost half a million(!) documents (mainly undergraduate theses and Master/ PhD dissertations) and its index gets updated on a daily basis. It began its operation back in 2006 after being designed and implemented by Vangelis Banos but since May 2011 it is being hosted, managed and co-developed by the National Documentation Centre (EKT). What makes this amazing searching tool even more remarkable is the fact that it is entirely built on open source/ free software.

A tricky point that needs some clarification is that when a user searches openarchives.gr, the search is not submitted in real time to the target sources. Instead, it is performed locally on the openarchives.gr server where full copies of the repositories/ libraries are stored (and updated at regular time intervals).
The majority of the sources searched by openarchives.gr are OAI-PMH compliant repositories (such as DSpace or EPrints). Therefore, their data are periodically retrieved via their OAI-PMH endpoint. However, it is worth mentioning that non OAI-PMH digital libraries have also been included in its database. This was made possible through scraping their websites with DEiXTo and transforming their metadata into Dublin Core. So, more than 16.000 records from 6 significant online digital libraries (such as the Lyceum Club of Greek Women and the Music Library of Greece “Lilian Voudouri”) were inserted in openarchives.gr with the use of DEiXTo wrappers and custom Perl code.

Finally, it is known that digital collections have flourished over the last few years and enjoy growing popularity. However, most of them do NOT provide their contents in OAI-PMH or another appropriate metadata format. Actually, many of them (especially legacy systems) do NOT even offer an API or an SRW/U interface. Consequently, we believe that there is much room for DEiXTo to help cultural and educational organizations (e.g., museums, archives, libraries and multimedia collections) to export, present and distribute their digitized items and rich content to the outside world, in an efficient and structured way, through scraping and repurposing their data.

Tuesday, January 17, 2012

Netnography & Scraping

Netnography or digital ethnography, is (or should be) the correct translation of ethnographic methods to online environments such as bulletin boards and social sites. It is more or less doing the same that ethnographers do in actual places like squares, pubs, clubs, etc: observe what people say and do, and try to participate as much as possible in order to better understand what's involved in action and discourses. Using ethnography may answer a lot of what, when, who and how questions defining several everyday problems. However, netnography differs in many ways compared to ethnography; especially in the fashion it is conducted.

Forums, Wikis as well as the blogosphere are good online equivalents of public squares and pubs. There are not physical identities, but online ones; there are not faces, but avatars; there is no gender, age or any reliable info about physical identities, but there are voices discussing and arguing about common topics of interests.

The more popular a forum is, the more difficult it gets to follow it nethnographically. A nethnographer has to use a Computer Assisted Qualitative Data Analysis (CAQDA) tool (such as RDQA) on certain parts of the texts collected during his research. In a forum use case, these texts would be posts and threads. If the researcher has to browse the forum and manually copy and paste its content, a huge amount of effort would be required. However, this obstacle could be surpassed through scraping the forum with a web data extraction tool such as DEiXTo.

A scraped forum is a jewel: perfectly ordered textual data corresponding to each thread, ready for further analysis. So, this is where DEiXTo comes into play and may boost the research process significantly. To our knowledge, Dr Juan Luis Chulilla Cano, CEO of Online and Offline Ltd., has been successfully utilizing scraping techniques so as to capture the threads of popular Spanish forums (and their metadata) and transform them into a structured format, suitable for post-processing. Typically, such sites have a common presentation style for their threads and offer rich metadata. Thus, they are potential goldmines upon which various methodologies can be tested and applied so as to discover knowledge and trends and draw useful conclusions.

Finally, netnography and anthropology seem to be gaining momentum over the last few years. They are really interesting as well as challenging fields and scraping could evolve to an important ally. It is worth mentioning that quite a few IT vendors and firms employ ethnographers for R&D and testing of new products. Therefore, there is a lot of potential in using computer aided techniques in the context of netnography. So, if you are coming from social sciences and creating wrappers/ extraction rules is not your second nature, why don't you drop us an email? Perhaps we could help you gather quite a few tons of usable data with DEiXTo! Unless terms of use or copyright restrictions forbid it..

Friday, January 13, 2012

Geo-location data, Yahoo! PlaceFinder & Google Maps API

Location-aware applications have known huge success over the last few years and geographic data have been used extensively in a wide variety of ways. Meanwhile, there are numerous places of interest out there, such as shopping malls, airports, restaurants, museums, transit stations and for most of them their addresses are publicly available on the Web. Therefore, you could use DEiXTo (or a web data extraction tool of your choice) in order to scrape the desired location information for any points of interest and then postprocess it so as to produce geographic data for further use.

Yahoo! PlaceFinder is a great web service that supports world-wide geocoding of street addresses and place names. It allows developers to convert addresses and places into geographic coordinates (and vice versa). Thus, you can send an HTTP request with a street address to it and get the latitude and longitude back! It's amazing how well it works. Of course, the more complete and detailed the address, the more precise the coordinates returned.

In the context of this post, we thought it would be nice, mostly for demonstration purposes, to build a map of Thessaloniki museums using the Google Maps API and geo-location data generated with Yahoo! PlaceFinder. The source of data for our demo was Odysseus, the WWW server of the Hellenic Ministry of Culture that provides a full list of Greek museums, monuments and archaeological sites.

So, we searched for museums located in the city of Thessaloniki (the second-largest city in Greece and the capital of the region of Central Macedonia) and extracted through DEiXTo the street addresses of the ten results returned. At the picture below you can see a sample screenshot from the "INFORMATION" section of the Folk Art and Ethnological Museum of Macedonia and Thrace Odysseus detailed webpage (from which the address of this specific museum was scraped):

After capturing the name and location of each museum and exporting them to a simple tab delimited text file, we wrote a Perl script harnessing the Geo::Coder::PlaceFinder CPAN module in order to automatically find their geo-location coordinates and create an XML output file containing all the necessary information (through XML::Writer). Part of this XML document is displayed right below:

After having all the metadata we needed in this XML file, we utilized the Google Maps JavaScript API v3 and created a map (centered on Thessaloniki) displaying all city museums! To accomplish that goal, we followed the helpful guidelines given in this very informative post about Google Maps markers and wrote a short script that parsed the XML contents (via XML::LibXML) and produced a web page with the desired Google Map object embedded (including markers for each museum). Finally, the end result was pretty satisfying (after some extra manual effort to be absolutely honest):

This is kind of cool, isn't it? Of course, the same procedure could be applied in a larger scale (e.g. for creating a map of Greece with ALL museums or/and monuments available) or expanded to other points of interest (whatever you can imagine, from schools and educational institutions to cinemas, supermarkets, shops or bank ATMs). In conclusion, we think that the combination of DEiXTo with other powerful tools and technologies can sometimes yield an innovative and hopefully useful outcome. Since you have the raw web data at your disposal (captured with DEiXTo), your imagination (and perhaps copyright restrictions) is the only limit!

Wednesday, January 4, 2012

DEiXTo powers Michelin Maps and Guides!

One of the biggest success stories of DEiXTo is that it was used a few months ago by the Maps and Guides UK division of Michelin in order to build a France gazetteer web application. If you are going on holiday to France, probably you will need hotel and restaurant guides, maps, atlases and tourist guides relevant to where you are staying or the places you will visit. So, the free online Michelin database can help you find out which ones are for you.

The contribution of DEiXTo in the context of the implementation of this useful service was that it scraped from Wikipedia geo-location data as well as other metadata fields for 36.000+ French communes. In France the smallest administrative region is the commune and Wikipedia happened to have all of this relevant information freely available!

The starting target page contained a list of 95 (or so) departments, each of which containing a large number of communes. Thus, every department detailed page would in turn list all its communes and their corresponding hyperlinks/ URLs. A sample department page looks like this. And last, at a level below, we have the actual pages of interest with all the details needed about each commune. You can see a sample commune Wikipedia page by clicking here and a screenshot from it at the picture below. Meanwhile, this "scenario" also serves as a good example of collaborative wrappers where the output of a wrapper (a txt file with URLs) gets passed as input to a second one.

It should be noted though that there were slight variations in the layout and structure of the target pages. However, the algorithm DEiXTo uses is quite efficient and robust and usually can deal with such cases. To be more specific, the scraper that was deployed, extracted from each commune page the following metadata: region, department, arrondissement, canton and importantly the latitude and longitude.

The precision and recall that DEiXTo achieved with these commune pages was amazing (very close to 100%) and as a result the database was finally enriched with the large volumes of information captured. We are really happy that Michelin was able to successfully utilize DEiXTo and create a free and useful online service. So, if you plan a trip to France, you know where to find an informative online map/ guide! :)

Monday, January 2, 2012

Cooperating DEiXTo agents

Basically there are two major, broad categories of cooperating DEiXTo wrappers. In the first one, the wrappers are executed and applied on the same, single page so as to capture bits of interest that are scattered all over this particular target page. On the other hand, the second category comprises cases where the output of a wrapper serves as input for a second one. For the latter, typically the output of the first wrapper is a txt file containing the target URLs leading to pages with detailed information.

The first category is not supported directly by the GUI tool. However, DEiXToBot (a Mechanize agent object capable of executing extraction rules previously built with the GUI tool) allows the combination of multiple extraction rules/ patterns on the same page and their results through Perl code. So, if you have come across a complex, data-rich page and you are fluent with Perl and DEiXToBot's interface, you can build the necessary tree patterns separately with the GUI tool and then write a highly efficient set of cooperating Perl robots aiming at capturing all the desired data. It is not easy though since it requires programming skills and custom code.

As far as the second type of collaboration is concerned, we have stumbled upon numerous cases where a first wrapper collects the detailed target URLs from listing pages and passes them to a second wrapper which in turn takes over and gathers all data of interest from the pages containing the full text/ description. A typical case would be a blog or a news site or an e-shop, where a first agent could scrape the URLs of the detailed pages and a second one would visit each one of them extracting every single piece of desired information. If you wonder how you can set a DEiXTo wrapper to visit multiple target pages, this can be done either through a text file containing their addresses or via a list. Both ways can be specified in the Project Info tab of the DEiXTo GUI tool.

Moreover, for the first wrapper which is intended to scrape the URLs, you only have to create a pattern that locates the links towards the detailed pages. Usually this is easy and straightforward. You should just point at a representative link, use it as a record instance and set the A rule node as "checked" (right click on the A node and select "Match and Extract Content"). The resulting pattern will be something like this:

Then, via executing the rule you can extract the "href" attribute (essentially the URI) of each matching link and export the results to a txt file, say target_urls.txt, which subsequently will be fed to the next wrapper. Please note that if you provide just the A rule node as a pattern, you will capture ALL the hyperlinks found on the page but we guess you don't want that (we want only those leading to the detailed pages).

In conclusion, DEiXTo can power schemes of cooperative robots and achieve very high precision. Especially for more advanced cases, synergies of multiple wrappers are always needed. Their coordination though usually needs some careful thought and effort. Should you have any questions, please do not hesitate to contact us!