deixto.com/blog: April 2013

Tuesday, April 16, 2013

Visualizing Clarity document categories in a pie chart

The "Cl@rity" program of the Hellenic Republic offers a wealth of data about the decisions and expenditure of all Greek ministries and their organizations. It operates for more than two years now and it is a great source of public data waiting for all of us to explore. However, it has been facing a lot of technical problems over the last year because of the large number of documents uploaded daily and the heavy data management cost. Unfortunately, their frontend and its search functionality is not working most of the time. Thankfully, a private initiative, UltraCl@rity, has come up in the meantime to offer a great alternative for searching the digitally signed public pdf documents and their metadata, filling in the gap left by the Greek government.

As you probably already know we focus on web scraping and the utilization of the information extracted. One of the best ways to exploit the data you might have gathered with DEiXTo (or another web data extraction tool) is presenting it with a comprehensive chart. Hence, we thought it might be interesting to collect the subject categories of the documents published on Cl@rity by a big educational institution like the Athens University of Economics and Business (AUEB) and create a handy pie chart.

This page http://et.diavgeia.gov.gr/f/aueb/list.php?l=themes provides a convenient categorization of AUEB's decisions. Therefore, with a simple pattern (extraction rule), created with GUI DEiXTo, we captured the categories and their number of documents. Then, it was quite easy and straightforward to programmatically transform the output data (as of 16th of April 2013) into an interactive Google pie chart with the most popular categories using the amazing Google Chart Tools. So, here it is:

By the way, publicspending.gr and greekspending.com are truly remarkable research efforts aiming at visualizing public expenditure data from the Cl@rity project in user-friendly diagrams and charts. Of course the deixto-based scenario described above is just a simple scraping example. What we would like to point out is that this kind of data transformations could have some, innovative practical applications and facilitate useful web-based services. In conclusion, Cl@rity (or "Διαύγεια" as it is known in Greek) can be a goldmine, spark new innovations and allow citizens and developers in particular to dig into open data in a creative fashion and in favor of the transparency of public life.

Saturday, April 13, 2013

Fuel price monitoring & data visualization

Recently, we stumbled upon a very useful public, web-based service, the Greek Fuel Prices Observatory ("Παρατηρητήριο Τιμών Υγρών Καυσίμων" in Greek). Its main objective is to allow consumers find out the prices of liquid fuels per type and geographic region. Having a wealth of fuel-related information at his disposal, one could build some innovative services (e.g. taking advantage of the geo-location of gas stations), find interesting stats or create meaningful charts.

One of the site's most interesting pages is that which contains the min, max and mean prices over the last 3 months: http://www.fuelprices.gr/price_stats_ng.view?time=1&prodclass=1

However, the hyperlink at the bottom right corner of the page (with the text "Γραφήματος") which is supposed to display a comprehensive graph returns an HTTP Status 500 exception message instead (at least as of 13th of April 2013). So, we could not resist scraping the data from the table with DEiXTo and presenting it nicely with a Google line chart after some post-processing. We used a regular expression to isolate the date, we reversed the order of the records found (so that the list is sorted chronologically, the oldest one first), we replaced commas in prices with dots (as a decimal mark) and we wrote a short script to produce the necessary lines for the arrayToDataTable method call of the Google Visualization API. Therefore, it was pretty straightforward to create the following:

Generally, there are various remarkable data visualization tools out there (one of the best is Google Charts of course) but we would not like to elaborate further on this now. Nevertheless, we would like to give emphasis on the fact that once you have rich and useful web data in hand you can exploit them in a wide variety of ways and come up with smart methods to analyze, use and present them. Your imagination is the only limit (along with the copyright restrictions).

Friday, April 12, 2013

Data migration through browser automation

As we have already mentioned quite a few times, browser automation can have a lot of practical applications ranging from testing of web applications to web-based administration tasks and web scraping. The latter (scraping) is our field of expertise and Selenium is our tool of choice when it comes to automated browser interaction and dealing with complex, JavaScript-rich pages.

A very interesting scenario (among others) of combining our beloved web scraping tool, DEiXTo, with Selenium could be data migration. Imagine for example that you have an osCommerce online store and you would like to migrate it to a Joomla VirtueMart e-commerce system. Wouldn't it be great if you could scrape the product details from the old, online catalogue through DEiXTo and then automate the data entry labor via Selenium? Once we have the data at hand in a suitable format, e.g. comma/ tab delimited or XML, we could then write a script that would repeatedly visit the data entry online form (in the administration environment of the new e-shop), fill in the necessary fields and submit it (once for each single product) so as to insert automatically all the products into the new website.

This way you can save a lot of time and effort and avoid messing with complex data migration tools (which are very useful in many cases). Important: of course we don't claim that migrating databases through scraping and automated data entry is the best solution. However, it's a nice and quick alternative approach for several, especially relatively simple, cases. The big advantage is that you don't even need to know the underlying schemas of the two systems under consideration. The only condition is to have access to the administrator interface of the new system.
By the way, below you can see a screenshot from Altova MapForce, maybe the best (but not free) data mapping, conversion and integration software tool out there.

Generally speaking, the uses and applications of web data extraction are numerous. You can check out some of them here. Perhaps you are about to think the next one and we would be glad to help you with the technicalities!