deixto.com/blog: May 2013

Saturday, May 4, 2013

Creating a complete list of FLOSS Weekly podcast episodes

It was not until recently that I discovered and started subscribing to podcasts. I wish I did earlier but the lack of available time (mostly) kept me away from them although we should always try to find time to learn and explore new things and technologies. So, I was very excited when I ran across FLOSS Weekly, a popular Free Libre Open Source (FLOSS) themed podcast from the TWiT Network. Currently, the lead host is Randal Schwartz, a renowned Perl hacker and programming consultant. As a Perl developer myself, it's needless to say that I greatly admire and respect him. FLOSS Weekly debuted back in April 2006 and as of 4th of May 2013 it features 250 episodes! That's a lot of episodes and lots of great stuff to explore.

Inevitably, if you don't have the time to listen to them all and had to choose only some of them, you would need to browse through all the listing pages (each containing 7 episodes) in order to find those that would interest you most. As I am writing this post one would have to visit 36 pages (by repeatedly clicking on the NEXT page link) to get a complete picture of all subjects discussed. Consequently, it's not that easy to quickly locate the ones that you find more interesting and compile a To-Listen (or To-Watch if you prefer video) list. I am not 100% sure that there is no such thing available on the twit.tv website but I was not able to find a full episodes list on a single place/ page. Therefore, I thought that a spreadsheet (or even better a JSON document) containing the basic info for each episode (title, date, link and description) would come in handy.

Hence, I utilised my beloved home-made scraping tool, DEiXTo, in order to extract the episodes metadata so that one can have a convenient, compact view of all available topics and decide easier which ones to choose. It was really simple to build a wrapper for this task and in a few minutes I had the data at hand (in a tab delimited text file). Then it was straightforward to import it in an Excel spreadsheet (you can download it here). Moreover, with a few lines of Perl code the data scraped was transformed into a JSON file (with all the advantages this brings) suitable for further use.
Check FLOSS Weekly out! You might find several great episodes that could illuminate you and bring into your attention amazing tools and technologies. As a free software supporter, I highly recommend it (despite the fact that I discovered it with a few years delay, hopefully it's never too late).

Friday, May 3, 2013

Scraping the members of the Greek Parliament

The Hellenic Parliament is the supreme democratic institution that represents Greek citizens through an elected body of Members of Parliament (MPs). It is a legislature of 300 members, elected for a four-year term, that submits bills and amendments. Its website, www.hellenicparliament.gr, has a lot of interesting data on it that could potentially be useful for mere citizens, certain types of professionals like journalists and lawyers, the media as well as businesses.

Inspired by existing scrapers for many Parliaments of the world like these on ScraperWiki, an amazing web-based scraping platform, we decided to write a simple, though efficient, DEiXToBot-based script that gathers information (such as the full name, constituency and contact details) from the CVs pages of Greek MPs and exports it (after some post-processing, e.g. deducing the party name to which the MP belongs from the logo in the party column) to a tab delimited text file that can then be easily imported in an ODF spreadsheet or into a database. The script uses a tree pattern previously built with the GUI DEiXTo tool to identify the data under interest and visits all 30 target pages (each containing ten records) by utilizing the pageNo URL parameter. It should also be noted that we used Selenium for our purposes, our favorite browser automation tool. Eventually, the results of the execution of the script can be found in this .ods file. In case you would like to take a look at the Perl code that got the job done you can download it here.

Open data — data that is free for use, reuse, and redistribution — is a goldmine that can stimulate innovative ways to discover knowledge and analyze rich data sets available on the World Wide Web. Scraping is an invaluable tool that can help towards this direction and serve transparency and openness. Currently there is a wide variety of remarkable web data extraction tools (among which quite a few free). Perhaps you would like to give DEiXTo a try and start building your own web robots to get the data you need and transform it into a suitable format for further use.
In conclusion, scraping has numerous uses and applications and there is a high chance you could come up with an interesting and creative use case scenario tailored to your requirements. So, if you need any help with DEiXTo or have any inquiries, please do not hesitate to contact us!