Wednesday, December 28, 2011

Robots.txt & access restrictions

A really serious matter that often many people ignore (deliberately or not) is the access and copyright restrictions that several website owners/ administrators impose. A lot of websites want robots entirely out. The method they use to keep cooperating web robots out of certain site content is a robots.txt file that resides in their root directory and functions as a request that visiting bots ignore specified files or directories.
For example, the next 2 lines indicate that robots should not visit any pages on the site:
User-agent: *
Disallow: /
    However, a large number of scraping agents violate the restrictions set by the vendors and content providers. This is a very important issue and it raises significant legal concerns. Undoubtedly, there has been an ongoing raging war between bots and websites with strict terms of use. The latter deploy various technical measures to stop robots (an excellent white paper about detecting and blocking site scraping attacks is here) and sometimes even take legal action and resort to courts. There have been many cases over the last years with contradictory decisions. So, the whole issue is quite unclear. You can read more about it in the relevant section of the "Web scraping" Wikipedia article. Both sides have their arguments, so it's not at all an easy verdict.
    The DEiXTo command line executor by default respects the robots.txt file of potential target websites (through the use of the WWW::RobotRules Perl module). Nevertheless, you can override this configuration (at your own risk!) by setting the -nice parameter to 0. It is strongly recommended though that you comply with webmasters' requests and keep out of pages that have access restrictions.
    Generally speaking, data copyright is a HUGE issue, especially in today's Web 2.0 era, and has sparked endless discussions and spawned numerous articles, opinions, licenses, disputes and legitimacy issues.
    By the way, it is worth mentioning that currently there is a strong movement in favor of openness in data, standards and software. And according to many, openness fosters innovation and promotes transparency and collaboration.
    Finally, we would like to suggest to everyone using web data extraction tools to comply with the terms of use that the websites set and think twice before deploying a scraper, especially if the data is going to be used for commercial purposes. A good practice is to contact the webmaster and ask for permission accessing and using their content. Quite a few times the website might be interested in such a cooperation mostly for marketing and advertising reasons. So, as soon as you get a "green light", start building your scraper with DEiXTo and we are here to help you!

Sunday, December 25, 2011

Downloading images with DEiXTo and wget

Many people often download pictures and photos from various websites of interest. Sometimes though, the number of images that someone wants to download from certain pages is large. So large that doing it manually is almost prohibitive. Therefore, an automation tool is often needed to save users time and repetitive effort. Of course, towards this goal, DEiXTo can help.
    Let's suppose that you want to get all images from a specific web page (with respect to terms of use). You can easily build a simple extraction rule by pointing at an image, using it as a record instance and setting the IMG rule node as "checked" (right click on the IMG node and select "Match and Extract Content"). The resulting pattern will be like this:
    Then, via executing the rule you can extract the "src" attribute (essentially the URI) of each image found on the page and export the results to a txt file, let's say image_urls.txt. And last, you can use GNU Wget (a great free command line tool) in order to retrieve the files. You can download a Windows (win32) version of wget hereFor example, on Windows you can then just open a DOS command prompt window, change the current working directory to the folder wget is stored (via the 'cd' command) and enter:
wget.exe -i image_urls.txt 
where image_urls.txt is the file containing the URIs of images. And voilà! The wget utility will download all the images of the target page for you!
    What about getting images from multiple pages? You will have to explicitly provide the target URLs either through an input txt file or via a list. Both ways can be specified in the Project Info tab of the DEiXTo GUI tool.
    Thus, if you have the target URLs at hand or you can extract them with another wrapper (generating a txt file), then  you can just pass them as input to the new image wrapper and the latter will do the laborious work for you.
    In case all the above are a bit unclear, we have built a sample wrapper project file (imdb_starwars.wpf) that downloads all Star Wars (1977) thumbnail photos from the corresponding imdb page. Please note that we set the agent to follow the Next page link so as to gather all thumbnails since they are scattered across multiple pages. However, if you would like to get the large size photos you will have to add another scraping layer for extracting the links of the pages containing the full size pictures.
    Anyway, in order to run the sample wrapper for the thumbnails, you should open the wpf (through the Open button in the Project Info tab) and then press the "Go!" button. Or alternatively you can use the command line executor instead on a DOS prompt:
deixto_executor.exe imdb_starwars.wpf
Finally, you will have to pass the image_urls.txt output file to wget in order to download all thumbnails and get the job done! May the Force be with you! :)

Thursday, December 22, 2011

Can DEiXTo power mobile apps? Yes, it can!

Web content scraped with DEiXTo can be presented in a wide variety of formats. However, the most common choice is probably XML since it facilitates heavy post-processing and further transformations so as to make the data suit your needs. A potentially interesting scenario would be to output bits of interest from a target website into an XML file and then transform it to HTML through XSLT (Extensible Stylesheet Language Transformations). This could be very practical and useful for creating in real time a customized, "shortened" version of a target web page specifically for mobile devices (e.g. Android and iOS devices).
    As you all know smartphones and tablets over the last few years have changed the computer world. So, we thought it would be challenging and hopefully useful to build a web service capable of repurposing specified pages on the fly (through the use of a DEiXTo-based agent), keeping only the important/ interesting stuff and returning it in a mobile-compatible fashion, suitable to fit small screens by harnessing XML, XSLT and CSS. We did not fully implement the service but we got a simple prototype ready to try our idea. And the results were quite encouraging!

    For the needs of our demo we used greektech-newsa technology news blog covering a plethora of interesting and fun topics around the IT industry. In the context of the demo, we supposed that we wanted to scrape the articles of the home page. So, we built a quick test scraper able to extract all the records found and generate an XML document with the data captured. With the use of an elegant XSLT and a CSS we achieved a nice, usable and easy to navigate structure, suitable for a smartphone screen (illustrated in the picture above). You can see live how the output XML file (containing 15 sample headlines) looks like on an online iPhone simulator at the following address:
    The concept of the proposed web service is the following: suppose that you are an app developer or a website owner/ administrator and that you need to display content inside your app or the mobile version of your site, either from a website of your control (meaning that you have access to its backend) or from another, "external" site (with respect to copyright and access restrictions). Often, though, it's not easy to retrieve data from the target website or simply you don't know how to do it.. Therefore, a service that could listen to requests for certain pages/ URIs and return their important data in a suitable form could potentially be very useful. For example, ideally an HTTP request like http://deixto.com/webservice.pl?uri="http://example.com/.." would result in a good-looking XML chunk (formatted with XSLT and CSS) containing the data scraped from the original page (specified with the uri parameter).
    Finally, we would like to bring forward the fact that DEiXToBot contains best of breed Perl technology and allows extensive customization. Thus, it facilitates tailor-made solutions so as to make the data captured fully fit your project's needs. And towards this direction, deploying XSLT and XML-related technologies in general can really boost the utility and value of scraping and DEiXTo in particular!

Saturday, December 17, 2011

APIs vs Scraping - Cl@rity & Yperdiavgeia

Typically there are two main mechanisms to search and retrieve data from a website: either through an Application Programming Interface commonly known as an API (if available) or via screen scraping. The first one is better, faster and more reliable. However, there is not always a search API available or perhaps even if there exists one, it may not fully cover your needs. In such cases, web robots, also called agents, are usually used in order to simulate a person searching the target website/ online database through a web browser and capture bits of interest by utilizing scraping techniques.

An API that has attracted some attention over the last few months in Greece is the Opendata API that the "Cl@rity" program ("Διαύγεια" in Greek) is offering. Since the 1st of October 2010, all Greek Ministries are obliged to upload their decisions and expenditure on the Internet, through the Cl@rity program. Cl@rity is one of the major transparency initiatives of the Ministry of Interior, Decentralization and e-Government. Each document uploaded is digitally signed and given a transaction unique number automatically by the system.
The Opendata API offers a variety of search parameters such as organization, type, tag (subject), ada (the unique number assigned), signer and date. However, there are still a lot of parameters and functionality missing such as full text search as well as searching by certain criteria like beneficiary's name, VAT registration number (ΑΦΜ in Greek), document title and other metadata fields.

A remarkable alternative for searching effectively through the documents of the Greek public organizations is yperdiavgeia.gr ("ΥπερΔιαύγεια" in Greek), a web-based platform built by the expert in digital libraries and institutional repositories Vangelis Banos. Yperdiavgeia is a mirror site of Cl@rity that gets updated on a daily basis and it provides a powerful and robust OpenSearch API which is far more usable and easy to harness. Its great advantage is that it facilitates full text searching. Currently, it lacks some parameters support but it seems that they are going to be added soon since it is under active development.
Even though both APIs mentioned above are really remarkable (especially for communicating and exchanging data with third party programs) there is still some room for utilizing scraping techniques and coming up with some "magic". In a previous post we had described in detail an application we developed mostly for downloading a user-specified number of the latest PDF documents of a specific organization uploaded to Cl@rity. We believe that this little utility we created (offering both a GUI as well as a command line version) can be quite useful for many people working in the public sector and potentially save a lot of time and effort. For further information about it, please check out this post (although it is written in Greek).

So, in this short post we just wanted to point out that there are quite a lot of great APIs out there, provided mostly by large organizations (e.g., firms, governments, cultural institutions and digital libraries - collections) as well as the major players of the IT industry such as Google, Amazon, etc, offering amazing features and functionality. Nevertheless, scraping the native web interface of a target site can still be useful and sometimes come up with a solution that overpasses difficulties or/and inefficiencies of APIs and yield an innovative outcome. Moreover, there are numerous websites that do not offer an API, thus a scraper could perhaps be deployed in case data searching, gathering or exporting would be needed. Therefore, the "battle" between APIs and scraping still rages.. and we are eager to see how things will evolve. Truth be told, we love them both!

Tuesday, December 13, 2011

DSpace & Institutional Repositories

Institutional Repositories (IRs) have emerged over the last few years and became very popular in the academic library world. The system that has dominated the "market" is DSpace, an exciting, functionality-rich, open source software package that is installed at over 1.100 institutions around the globe. It offers an OAI-PMH web service for harvesting the metadata of the repository and getting its entire contents in Dublin Core format. However, OAI-PMH does not provide advanced search by certain criteria such as title, author, supervisor, etc. Even the REST API which is under construction does not facilitate searching with these metadata fields, at least to the best of our knowledge. Moreover, the default DSpace OpenSearch support still seems incomplete and a bit buggy. Therefore, a potential solution for searching in real time a DSpace repository could be submitting a query and scraping the results returned through its native web interface. This could probably be useful for building a federated search engine or perhaps for creating a mobile app (currently there is no mobile version for DSpace).

Having in mind the lack of an "advanced", multiple-criteria enabled, search API / mechanism by DSpace, we thought it would be interesting and perhaps useful to write a test scraper that could submit queries to a DSpace repository and fetch the search results through its website. So, we built a simple, DOM-based, extraction rule (wrapper) with the DEiXTo GUI tool and then wrote a short DEiXToBot-based script that submits a sample query to Psepheda (the IR of University of Macedonia) and scrapes the results returned. The following picture illustrates the 10 first results for a sample query by title.

To get a better idea of how a Perl, DEiXToBot-based script works, below you can find the code that scrapes the 10 first items containing "programming" in title. The pattern used captures five metadata fields: detailed URL, title, date, authors and supervisor, and prints them on the screen. Of course this script can be easily extended to submit user specified queries as well as navigate through all the result pages by following the Next page link ("επόμενη" is the inner text of this link in Greek).

use DEiXToBot;
use Encode;
my $agent = DEiXToBot->new();
$agent->get('http://dspace.lib.uom.gr/simple-search?query=((title:programming))');
$agent->load_pattern('dspace_pattern.xml');
$agent->ignore_tags( [ 'em' ] );
$agent->build_dom();
$agent->extract_content();
for my $record (@{$agent->records}) {
    print encode_utf8(join("\n",@{$record})),"\n\n";
}

DEiXToBot is written is Perl, thus it is portable and can run on multiple operating systems provided you have all the prerequisite Perl modules installed. You can download the lines of code given above along with the necessary pattern by clicking here. This short script serves as a good, simple example for utilizing the power and flexibility of DEiXToBot (a Mechanize agent object, essentially a browser emulator, which is able to execute patterns/ extraction rules previously built with the GUI tool).

Generally, IRs have huge potential and in the next few years they are expected to play an increasingly important role in storing and preserving digital content, academic or not. By the way, a great federated search engine harvesting numerous Greek digital libraries and institutional repositories is openarchives.gr which is mostly based upon OAI-PMH. It harnesses innovative technologies and has grown a lot since 2006 when it was first launched.

Last but not least, DEiXTo was used quite long ago by "Pantou", the federated search engine of the University of Macedonia, in order to scrape (in real time) multiple online resources simultaneously via their web interface/ site. It is worth noting that a predecessor of the current DEiXToBot module, back in 2007, was included in the official dbWiz distribution, a remarkable, open source, federated search software package upon which pantou was built.

Monday, November 28, 2011

myVisitPlanner & DEiXTo

We are very happy to announce that DEiXTo is going to power myVisitPlanner, an exciting project funded by the Greek Ministry of Education and Lifelong Learning, under the national action "COOPERATION 2011". myVisitPlanner is coordinated by the University of Macedonia and it is aiming at creating a personalized system for cultural itineraries planning. The consortium participants include Athena - Research and Innovation Center, GNOMON Informatics SA, the Ethnological Museum of Thrace and the West Macedonia Development Company (ANKO) SA

DEiXTo-based wrappers are going to be deployed in order to automatically retrieve regional cultural itineraries and points of interest from various, heterogenous target websites. For a pre-defined set of webpages, DEiXTo scrapers will run periodically and monitor them for new content and coming cultural events. However, besides the traditional DOM-based tree patterns, new “smart” and innovative techniques, such as text mining and NLP algorithms, will be used so as to also contend with sites of unknown structure and layout.

myVisitPlanner officially started a few months ago and its duration is 36 months. We are really glad that we are participating in this challenging and exciting project and we hope that DEiXTo will help myVisitPlanner towards implementing its ambitious goals.

Tuesday, November 8, 2011

TEL-MAP & DEiXTo

DEiXTo was recently used successfully in the context of the TEL-MAP FP7 project in order to scrape the metadata of several European Technology Enhanced Learning (TEL) projects from the CORDIS website as well as from the European Commission website.

TEL-MAP is a Coordination and Support Action funded by the European Commission under the Technology-Enhanced Learning programme. It is coordinated by the Brunel University of London and it focuses on exploratory / roadmapping activities for fundamentally new forms of learning to support the adoption of those new forms, via awareness building and knowledge management on the results of EU RTD projects in TEL and socio-economic evaluations in education.
Currently, the TEL-MAP team is building a new portal called "Learning Frontiers". This portal aspires to become a widely recognized, single-point-of-access source of information for all European TEL. Among others, it offers a Projects space that contains a lot of detailed information about numerous TEL EU-funded projects and their participants (essentially the data scraped from the target web pages). It is really worth noting that through using the geo-location info of each participant / organization, Learning Frontiers provides an interactive map of TEL in Europe!
We are really happy that we have helped the Learning Frontiers portal a little towards the implementation of its ambitious vision to increase EU-wide and global dissemination, adoption and impact of EU TEL. We wish them best success!

Monday, October 31, 2011

DEiXTo & Athos Memory

In the context of our collaboration with the awarded Veria Central Public Library, a really remarkable Greek, online digital collection, Athos Memory, has been scraped through DEiXTo in order to be added to the European Library. Athos Memory has been a giant effort of the monastic community of the Holy Mountain to preserve and disseminate the unique religious tradition of the Eastern Orthodox Church on this peninsula of Chalcidice. Numerous people have worked tirelessly for years to make this endeavour possible. We would really like to congratulate and thank them for their great efforts and for providing open access to this magnificent collection.

The metadata of 27.223 photographs, documents and digitalized manuscripts from Athos, the Sacred Mountain of Christianity, have been transformed into Europeana Semantic Elements (ESE) format so that they could then be inserted into the Hellenic Aggregator's database.

To give you a better idea of the transformation process, check out the picture below. It's a screenshot of a typical item of Athos Memory archives.

Now, this record, after extracting and repurposing its metadata based on Dublin Core, gets the following form, suitable for exporting it:
Finally, it should be noted that this was the fourth digital library that was included in the Europeana with the help of DEiXTo. And we are eager to add more online resources and help Europeana enrich further its huge cultural and scientific collection!

Saturday, October 8, 2011

DEiXTo & Veria Central Public Library

One of DEiXTo's most important success stories is our collaboration with Veria Central Public Library in the context of the EuropeanaLocal project. Veria Central Public Library is a really remarkable library that embraces technology and constitutes a successful model for libraries in Greece and around the world. That's why it received a 1$ Million international award from Bill & Melinda Gates Foundation in 2010.

DEiXTo powers the Hellenic Aggregator for Europeana, created by Veria Central Public Library. DEiXToBot based Perl scripts have enabled the metadata extraction of the Music Library of Greece “Lilian Voudouri”, the Greek Educational TV and Corgialenios Digital Library in a format suitable for further processing. Once extracted, their rich content was repurposed through customized Perl code and transformed into Europeana Semantic Elements (ESE) format so that it could then be inserted into the aggregator's database.

This is the reason why DEiXTo was cited at the Symposium "Europeana in Greece" that took place on 19 October 2010 in Athens, Greece, as well as at the 19th Hellenic Academic Libraries Conference (3-5 November 2010, Athens).

Hopefully, more digital libraries/archives will use DEiXTo in the next few months in order to be able to export their metadata to the great Europeana collection (more than 15 million items from 1.500 institutions!). And we are more than glad to help Europeana enrich its content even more!

Saturday, September 17, 2011

DEiXToBot & Lack of JavaScript Support


Perhaps the major drawback of DEiXToBot (the Perl browser emulator object capable of executing GUI DEiXTo generated patterns) is the lack of JavaScript support, which derives from the fact that WWW::Mechanize doesn't operate on JavaScript. In many cases though, a solution is possible by figuring out what the JavaScript code is doing and simulating it via Perl programming. But in certain, more difficult cases that depend heavily on Javascript, this is very hard, if not impossible, because essentially you cannot reach the actual html source code of interest.

However, a workaround that sometimes works is to download the target pages of interest locally (after executing their Javascript segments of code) and then pass them to DEiXToBot for offline scraping.
Two remarkable tools for getting complex Javascript-enabled pages for this purpose are:
- Selenium, an amazing web browser automation tool and
- spynner, a powerful web browsing module with Ajax support for Python

Please note that these two great tools also work fine on GNU/Linux, which is really important, especially for server use and scheduled, periodic execution of wrappers.

So, once a target page is stored locally to your disk, the DEiXToBot agent can easily get the page through the file:// scheme and extract bits of interest in the usual manner.

Sunday, June 26, 2011

Μεταφόρτωση PDF αρχείων από τον Δικτυακό τόπο της Διαύγειας

Εδώ και σχεδόν 9 μήνες, στο πλαίσιο της λειτουργίας του προγράμματος «Διαύγεια», όλα τα κυβερνητικά όργανα, οι φορείς του στενού και ευρύτερου δημόσιου τομέα και οι Ανεξάρτητες αρχές υποχρεούνται πλέον να αναρτούν το σύνολο των αποφάσεων και των δαπανών τους στο Διαδίκτυο και συγκεκριμένα στο δικτυακό τόπο της Διαύγειας (http://et.diavgeia.gov.gr).

Σε καθημερινή βάση λοιπόν, πολλοί εργαζόμενοι στο Δημόσιο έχουν επιφορτιστεί με την ευθύνη να ανεβάζουν στη Διαύγεια μεγάλο όγκο από εντάλματα πληρωμής και αποφάσεις σε pdf μορφή. Και μάλιστα συνήθως μετά την ανάρτησή τους και αφού έχουν πάρει ΑΔΑ (Αριθμός Διαδικτυακής Ανάρτησης), ο εκάστοτε αρμόδιος υπάλληλος πρέπει να τα "κατεβάσει" χειροκίνητα κάνοντας απανωτά κλικ στους αντίστοιχους συνδέσμους "Λήψη Αρχείου" και να τα τυπώσει, κατά βάση για γραφειοκρατικούς λόγους. Η διαδικασία αυτή ωστόσο είναι χρονοβόρα, ιδιαίτερα όταν το πλήθος των αρχείων είναι μεγάλο. Τυγχάνει να το γνωρίζω αυτό από πρώτο χέρι όντας εργαζόμενος στο Διεθνές Πανεπιστήμιο Ελλάδος.

Στην εικόνα που ακολουθεί φαίνονται μερικές τυπικές εγγραφές στο site της Διαύγειας για κάποιο φορέα. Βλέπετε δεξιά τους υπερσυνδέσμους (links) για μεταφόρτωση των ενταλμάτων πληρωμής.

Σκεφτήκαμε λοιπόν να φτιάξουμε μία μικρή εφαρμογή η οποία θα εντοπίζει τις διευθύνσεις (URLs) των pdf αρχείων πάνω σε μία σελίδα αποτελεσμάτων της Διαύγειας, π.χ. των Χ τελευταίων που ανέβασε κάποιος την προηγούμενη μέρα, και στη συνέχεια θα τα κατεβάζει μαζικά. Ένα τέτοιο πρόγραμμα πιθανότατα θα μπορούσε να βοηθήσει αρκετούς δημόσιους υπαλλήλους να εξοικονομήσουν κόπο και χρόνο.

Η εφαρμογή είναι διαθέσιμη τόσο σε εκδοχή για χρήση σε γραμμή εντολών (σε Windows και Linux) όσο και για χρήση μέσω γραφικής διεπαφής (GUI) σε περιβάλλον Windows (για το δεύτερο, δείτε στο τέλος του post).

Σε γραμμή εντολών των Windows (DOS prompt) τρέξτε το diavgeia-downloader.exe (ή το diavgeia-downloader-linux αντίστοιχα σε ένα terminal εάν έχετε Linux) περνώντας ως παράμετρο τη διεύθυνση/URL στόχο και βάζοντας στην παράμετρο limit της διεύθυνσης της σελίδας τον επιθυμητό αριθμό pdf αρχείων. Για παράδειγμα, έστω η σελίδα:
που περιέχει τις τελευταίες 50 δαπάνες-αποφάσεις του ΕΛΚΕ του Πανεπιστημίου Μακεδονίας. Για λήψη αυτών των αρχείων τοπικά (στο φάκελο στον οποίο βρίσκεται και το εκτελέσιμο) αρκεί να δoθεί η ακόλουθη εντολή:
Η εφαρμογή υποστηρίζει δύο επιπλέον προαιρετικές παραμέτρους: [-dir folder] [-sleep N]
  • όπου folder το όνομα του φακέλου στον οποίο θα αποθηκευθούν τα αρχεία και
  • N o αριθμός των δευτερολέπτων των χρονικών παύσεων μεταξύ των εντολών μεταφόρτωσης ώστε να μην επιβαρύνεται ιδιαίτερα ο server της Διαύγειας.
Μπορείτε να κατεβάσετε τόσο τον πηγαίο κώδικα (σε Perl) όσο και τα εκτελέσιμα αρχεία (για Windows & Linux αντίστοιχα)! Η άδεια χρήσης του προγράμματος είναι η GNU General Public License version 3. Για περιβάλλον Windows μάλιστα υπάρχει και η γραφική διεπαφή (GUI) που κάνει προφανή όλα τα, μάλλον, πολύπλοκα για πολλούς παραπάνω. Πιο κάτω βλέπετε ένα  screenshot από το GUI εργαλείο. Η χρήση του είναι πολύ απλή, αρκεί ο χρήστης να δώσει με copy paste τη διεύθυνση της επιθυμητής σελίδας από το site της Διαύγειας και να πατήσει Go!

Ελπίζουμε η εφαρμογή αυτή να φανεί χρήσιμη. Σχόλια και προτάσεις ευπρόσδεκτα!

Wednesday, June 22, 2011

deixto.com/blog

Are you looking for a web content extraction tool to scrape data from websites of interest? DEiXTo can probably help you!

DEiXTo is an ongoing effort that started back in 2007. It is freely available to download and to our knowledge it is being used by many users as well as some organisations and companies all over the world. Indicatively, DEiXTo's site received more than 5.700 visits from 103 different countries during the last 12 months.

Today, we launch the blog of DEiXTo aiming a) to keep you apprised you about the wealth of applications and the utility of this exciting tool and b) to bring forward interesting topics around web scraping.

We really hope that DEiXTo can be useful for you and this blog helps you towards this direction.