deixto.com/blog: February 2012

Sunday, February 19, 2012

Linked Data & DEiXTo

As explained in a previous post, DEiXTo can scrape the content of digital libraries, archives and multimedia collections lacking an API and enable their metadata transformation (through post-processing and custom Perl code) to Dublin Core and subsequently in OAI-PMH or another suitable form, e.g. Europeana Semantic Elements (ESE).
Meanwhile, the Web has become a dynamic collaboration platform that allows everyone to meet, read and more importantly write. Thus, it steadily approaches the vision of Tim Berners-Lee (the inventor of the World Wide Web): the Linked Data Web, a place where related data are linked and information is represented in a more structured and easily machine-processable way.

Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. Its key technologies are URIs (a generic method to identify resources on the Internet), the Hypertext Transfer Protocol (HTTP) and RDF (a data model and a general method for conceptual description of things in the real world). It is an exciting topic of interest and it's expected to make great progress in the next few years. A video that does a nice job of explaining what Linked Open Data is all about can be found here: http://vimeo.com/36752317

Over the last decade, the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) has become the de facto standard for metadata exchange in digital libraries and it's playing an increasingly important role. However, it has two major drawbacks: it does not make its resources accessible via dereferencable URIs and it provides only restricted means of selective access to metadata. Therefore, there is a strong need for efficient tools that would allow metadata repositories to expose their content according to the Linked Data guidelines. This would make digitized items and media objects accessible via HTTP URIs and query able via the SPARQL protocol.
Dr Haslhofer has performed significant research and work towards this direction. He has developed (among others) the OAI2LOD Server based on the D2R Server implementation and wrote the ESE2EDM converter, a collection of ruby scripts that can convert given XML-based ESE source files into the RDF-based Europeana Data Model (EDM). These remarkable tools could turn out very useful for making large volumes of information Linked-Data ready, with all the advantages this brings.
Linked Open Data can change the computer world as we know it. So, there is a lot of potential in combining DEiXTo with Linked Data technologies. Their blend could eventually produce an innovative and useful outcome. Many already believe that Linked Data is the next big thing. Time will tell. Meanwhile, DEiXTo could definitely help you generate structured data in a variety of formats from unstructured HTML pages, either your ultimate goal is Linked Data or not.

Saturday, February 11, 2012

DEiXTo components clarified

From the emails and feedback received, it seems that many people get a bit confused about the utility and functionality of the DEiXTo GUI tool compared to the Perl command line executor (CLE). DEiXToBot is even more confusing for quite a few users. So, let's clarify things.

The GUI tool is freeware (available at no cost but without any source code, at least yet) and it allows you to visually build and execute extraction rules for web pages of interest with point and click convenience. It offers you an embedded web browser and a friendly graphical interface so that you can highlight an element/ record instance as the mouse moves over it. The GUI tool is a Windows-only application that harnesses Internet Explorer's HTML parser and render engine. It is worth noting that it can support simple cooperative extraction scenarios as well as periodic, scheduled execution through batch files and the Windows Task Scheduler. Perhaps its main drawback is that it can execute just one pattern on a page although for several cases (maybe for the majority) one and only extraction rule is enough to get the job done.

On the other hand, the command line executor, or CLE for short, is implemented in Perl and it is freely distributed under the GNU General Public License v3, thus its source code is included. Its purpose is to execute wrapper project files (.wpf) that have previously been created with the GUI tool. It runs on a DOS prompt window or on a Linux/ Mac terminal. Besides the code though, we have built two standalone executables so that you can run CLE either on a Windows or a GNU/Linux machine without having Perl or any prerequisite modules installed. CLE is faster, offers more output formats and has some additional features such as an efficient post-processing mechanism and database support. However, it shares the same shortcoming as the GUI tool: it supports just one pattern on a page. Finally, it relies on DEiXToBot, a "homemade" package that facilitates the execution of GUI DEiXTo generated wrappers.

DEiXToBot is the third and probably the most powerful and well-crafted software component of the DEiXTo scraping suite and it is available under the GPL v3 license. It is a Perl module based on WWW::Mechanize::Sleepy, a handy web browser Perl object, and several other CPAN modules. It allows extensive customization and tailor-made solutions since it facilitates the combination of multiple extraction rules/ patterns as well as the post-processing of their results through custom code. Therefore, it can deal with complex cases and cover more advanced web scraping needs. But it requires programming skills in order to use it.

The bottom line is that DEiXToBot is the essence of our long experience. The GUI tool might be more suitable for most every-day users (due to its visual convenience) but when things get difficult or the situation requires a more advanced solution (e.g. scheduled or on-demand execution and coordination of multiple wrappers on a GNU/Linux server), a customized DEiXToBot-based script is your choice. You can use the GUI tool first to create the necessary patterns and then deploy a Perl script that uses them to extract structured data from the pages of the target website. So, if you are familiar with Perl, you should not find it very hard to write your first deixto-based spider/ crawler!