deixto.com/blog: Διαύγεια

Showing posts with label Διαύγεια. Show all posts

Tuesday, April 16, 2013

Visualizing Clarity document categories in a pie chart

The "Cl@rity" program of the Hellenic Republic offers a wealth of data about the decisions and expenditure of all Greek ministries and their organizations. It operates for more than two years now and it is a great source of public data waiting for all of us to explore. However, it has been facing a lot of technical problems over the last year because of the large number of documents uploaded daily and the heavy data management cost. Unfortunately, their frontend and its search functionality is not working most of the time. Thankfully, a private initiative, UltraCl@rity, has come up in the meantime to offer a great alternative for searching the digitally signed public pdf documents and their metadata, filling in the gap left by the Greek government.

As you probably already know we focus on web scraping and the utilization of the information extracted. One of the best ways to exploit the data you might have gathered with DEiXTo (or another web data extraction tool) is presenting it with a comprehensive chart. Hence, we thought it might be interesting to collect the subject categories of the documents published on Cl@rity by a big educational institution like the Athens University of Economics and Business (AUEB) and create a handy pie chart.

This page http://et.diavgeia.gov.gr/f/aueb/list.php?l=themes provides a convenient categorization of AUEB's decisions. Therefore, with a simple pattern (extraction rule), created with GUI DEiXTo, we captured the categories and their number of documents. Then, it was quite easy and straightforward to programmatically transform the output data (as of 16th of April 2013) into an interactive Google pie chart with the most popular categories using the amazing Google Chart Tools. So, here it is:

By the way, publicspending.gr and greekspending.com are truly remarkable research efforts aiming at visualizing public expenditure data from the Cl@rity project in user-friendly diagrams and charts. Of course the deixto-based scenario described above is just a simple scraping example. What we would like to point out is that this kind of data transformations could have some, innovative practical applications and facilitate useful web-based services. In conclusion, Cl@rity (or "Διαύγεια" as it is known in Greek) can be a goldmine, spark new innovations and allow citizens and developers in particular to dig into open data in a creative fashion and in favor of the transparency of public life.

Saturday, December 17, 2011

APIs vs Scraping - Cl@rity & Yperdiavgeia

Typically there are two main mechanisms to search and retrieve data from a website: either through an Application Programming Interface commonly known as an API (if available) or via screen scraping. The first one is better, faster and more reliable. However, there is not always a search API available or perhaps even if there exists one, it may not fully cover your needs. In such cases, web robots, also called agents, are usually used in order to simulate a person searching the target website/ online database through a web browser and capture bits of interest by utilizing scraping techniques.

An API that has attracted some attention over the last few months in Greece is the Opendata API that the "Cl@rity" program ("Διαύγεια" in Greek) is offering. Since the 1st of October 2010, all Greek Ministries are obliged to upload their decisions and expenditure on the Internet, through the Cl@rity program. Cl@rity is one of the major transparency initiatives of the Ministry of Interior, Decentralization and e-Government. Each document uploaded is digitally signed and given a transaction unique number automatically by the system.

The Opendata API offers a variety of search parameters such as organization, type, tag (subject), ada (the unique number assigned), signer and date. However, there are still a lot of parameters and functionality missing such as full text search as well as searching by certain criteria like beneficiary's name, VAT registration number (ΑΦΜ in Greek), document title and other metadata fields.

A remarkable alternative for searching effectively through the documents of the Greek public organizations is yperdiavgeia.gr ("ΥπερΔιαύγεια" in Greek), a web-based platform built by the expert in digital libraries and institutional repositories Vangelis Banos. Yperdiavgeia is a mirror site of Cl@rity that gets updated on a daily basis and it provides a powerful and robust OpenSearch API which is far more usable and easy to harness. Its great advantage is that it facilitates full text searching. Currently, it lacks some parameters support but it seems that they are going to be added soon since it is under active development.

Even though both APIs mentioned above are really remarkable (especially for communicating and exchanging data with third party programs) there is still some room for utilizing scraping techniques and coming up with some "magic". In a previous post we had described in detail an application we developed mostly for downloading a user-specified number of the latest PDF documents of a specific organization uploaded to Cl@rity. We believe that this little utility we created (offering both a GUI as well as a command line version) can be quite useful for many people working in the public sector and potentially save a lot of time and effort. For further information about it, please check out this post (although it is written in Greek).

So, in this short post we just wanted to point out that there are quite a lot of great APIs out there, provided mostly by large organizations (e.g., firms, governments, cultural institutions and digital libraries - collections) as well as the major players of the IT industry such as Google, Amazon, etc, offering amazing features and functionality. Nevertheless, scraping the native web interface of a target site can still be useful and sometimes come up with a solution that overpasses difficulties or/and inefficiencies of APIs and yield an innovative outcome. Moreover, there are numerous websites that do not offer an API, thus a scraper could perhaps be deployed in case data searching, gathering or exporting would be needed. Therefore, the "battle" between APIs and scraping still rages.. and we are eager to see how things will evolve. Truth be told, we love them both!

Sunday, June 26, 2011

Μεταφόρτωση PDF αρχείων από τον Δικτυακό τόπο της Διαύγειας

Εδώ και σχεδόν 9 μήνες, στο πλαίσιο της λειτουργίας του προγράμματος «Διαύγεια», όλα τα κυβερνητικά όργανα, οι φορείς του στενού και ευρύτερου δημόσιου τομέα και οι Ανεξάρτητες αρχές υποχρεούνται πλέον να αναρτούν το σύνολο των αποφάσεων και των δαπανών τους στο Διαδίκτυο και συγκεκριμένα στο δικτυακό τόπο της Διαύγειας (http://et.diavgeia.gov.gr).

Σε καθημερινή βάση λοιπόν, πολλοί εργαζόμενοι στο Δημόσιο έχουν επιφορτιστεί με την ευθύνη να ανεβάζουν στη Διαύγεια μεγάλο όγκο από εντάλματα πληρωμής και αποφάσεις σε pdf μορφή. Και μάλιστα συνήθως μετά την ανάρτησή τους και αφού έχουν πάρει ΑΔΑ (Αριθμός Διαδικτυακής Ανάρτησης), ο εκάστοτε αρμόδιος υπάλληλος πρέπει να τα "κατεβάσει" χειροκίνητα κάνοντας απανωτά κλικ στους αντίστοιχους συνδέσμους "Λήψη Αρχείου" και να τα τυπώσει, κατά βάση για γραφειοκρατικούς λόγους. Η διαδικασία αυτή ωστόσο είναι χρονοβόρα, ιδιαίτερα όταν το πλήθος των αρχείων είναι μεγάλο. Τυγχάνει να το γνωρίζω αυτό από πρώτο χέρι όντας εργαζόμενος στο Διεθνές Πανεπιστήμιο Ελλάδος.

Στην εικόνα που ακολουθεί φαίνονται μερικές τυπικές εγγραφές στο site της Διαύγειας για κάποιο φορέα. Βλέπετε δεξιά τους υπερσυνδέσμους (links) για μεταφόρτωση των ενταλμάτων πληρωμής.

Σκεφτήκαμε λοιπόν να φτιάξουμε μία μικρή εφαρμογή η οποία θα εντοπίζει τις διευθύνσεις (URLs) των pdf αρχείων πάνω σε μία σελίδα αποτελεσμάτων της Διαύγειας, π.χ. των Χ τελευταίων που ανέβασε κάποιος την προηγούμενη μέρα, και στη συνέχεια θα τα κατεβάζει μαζικά. Ένα τέτοιο πρόγραμμα πιθανότατα θα μπορούσε να βοηθήσει αρκετούς δημόσιους υπαλλήλους να εξοικονομήσουν κόπο και χρόνο.

Η εφαρμογή είναι διαθέσιμη τόσο σε εκδοχή για χρήση σε γραμμή εντολών (σε Windows και Linux) όσο και για χρήση μέσω γραφικής διεπαφής (GUI) σε περιβάλλον Windows (για το δεύτερο, δείτε στο τέλος του post).

Σε γραμμή εντολών των Windows (DOS prompt) τρέξτε το diavgeia-downloader.exe (ή το diavgeia-downloader-linux αντίστοιχα σε ένα terminal εάν έχετε Linux) περνώντας ως παράμετρο τη διεύθυνση/URL στόχο και βάζοντας στην παράμετρο limit της διεύθυνσης της σελίδας τον επιθυμητό αριθμό pdf αρχείων. Για παράδειγμα, έστω η σελίδα:

http://et.diavgeia.gov.gr/f/pamak/find/unit:4652/from:0/limit:50

που περιέχει τις τελευταίες 50 δαπάνες-αποφάσεις του ΕΛΚΕ του Πανεπιστημίου Μακεδονίας. Για λήψη αυτών των αρχείων τοπικά (στο φάκελο στον οποίο βρίσκεται και το εκτελέσιμο) αρκεί να δoθεί η ακόλουθη εντολή:

diavgeia-downloader.exe -url http://et.diavgeia.gov.gr/f/pamak/find/unit:4652/from:0/limit:50

Η εφαρμογή υποστηρίζει δύο επιπλέον προαιρετικές παραμέτρους: [-dir folder] [-sleep N]

όπου folder το όνομα του φακέλου στον οποίο θα αποθηκευθούν τα αρχεία και
N o αριθμός των δευτερολέπτων των χρονικών παύσεων μεταξύ των εντολών μεταφόρτωσης ώστε να μην επιβαρύνεται ιδιαίτερα ο server της Διαύγειας.

Μπορείτε να κατεβάσετε τόσο τον πηγαίο κώδικα (σε Perl) όσο και τα εκτελέσιμα αρχεία (για Windows & Linux αντίστοιχα)! Η άδεια χρήσης του προγράμματος είναι η GNU General Public License version 3. Για περιβάλλον Windows μάλιστα υπάρχει και η γραφική διεπαφή (GUI) που κάνει προφανή όλα τα, μάλλον, πολύπλοκα για πολλούς παραπάνω. Πιο κάτω βλέπετε ένα screenshot από το GUI εργαλείο. Η χρήση του είναι πολύ απλή, αρκεί ο χρήστης να δώσει με copy paste τη διεύθυνση της επιθυμητής σελίδας από το site της Διαύγειας και να πατήσει Go!

Ελπίζουμε η εφαρμογή αυτή να φανεί χρήσιμη. Σχόλια και προτάσεις ευπρόσδεκτα!