Tuesday, January 21, 2014

Celery task/ job queue

Queues are very common in computer science and in real-world programs. A queue is a FIFO data structure where new elements are typically added to the rear position whereas the first items inserted will be the first ones to be removed/ served. A nice, thorough collection of queueing systems can be found on queues.io. Task/Job queues in particular are used by numerous systems/ services and they apply to a wide range of applications. They can alleviate the complexity of system design and implementation, boost scalability and generally have many advantages. Some of the most popular ones are CeleryRQ (Redis Queue) and Gearman.

    The one we recently stumbled upon and immediately took advantage of was Celery which is written in Python and is based on distributed message passing. It focuses on real-time operation but supports scheduling as well. The units of work to be performed, called tasks, are executed simultaneously on a single or more worker servers using multiprocessing. Thus, concurrent workers run in the background waiting for new job arrivals and when a task arrives (and its turn comes) a worker processes it. Some of Celery's uses are handling long running jobs, asynchronous task processing, offloading heavy tasks, job routing, task trees, etc.

    Now let's see a use case where Celery untied our hands. As you might already know, for quite some time we have been developing Python, Selenium-based scripts for web scraping and browser automation. So, occasionally we came across data records/ cases, while executing a script, that had to be dealt separately by another process/ script. For instance, in the context of the recent e-procurement project, when scraping through DEiXToBot the detail page of a payment (published on the Greek e-procurement platform) you could find a reference towards a relevant contract which you would also like to download and scrape. Additionally, this contract could also link with a tender notice and the latter may be a corrected version of an existing tender or connect in turn with another decision/ document.
    Thus, we thought it would be handy and more convenient if we could add the unique codes/ identifiers of these extra documents to a queue system and let a background worker get the job done asynchronously. It should be noted that the lack of persistent links on the eprocurement website made it harder to download a detail page programmatically at a later stage since you could access it only after performing a new search with its ID and automating a series of steps with Selenium depending on the type of the target document.

    So, it was not long before we installed Celery on our Linux server and started experimenting with it. We were amazed with its simplicity and efficiency. We quickly wrote a script that fitted the bill for the e-procurement scenario we described in the previous paragraph. The code we wrote provided an elegant and practical solution to the problem at hand and was something like that (note the recursion!):

from celery import Celery

app = Celery('tasks', backend='amqp', broker='amqp://')

def download(id):
    ... selenium_stuff ...
    if (reference_found)
        download.delay(new_id) # delay sends a task message

    In conclusion, we are happy to have found Celery, it's really promising and we thought it would be nice to share this news with you. We are looking forward to using Celery further for our heavy scraping needs and we are glad that we added it to our arsenal.

Friday, January 17, 2014

About web proxies

Rightly or wrongly there are times when one would like to conceal his IP address, especially while scraping a target website. Perhaps the most popular way to do that is by using web proxy servers. A proxy server is a computer system or an application that acts as an intermediary for requests from clients seeking resources from other servers. Thus, web proxies allow users to mask their true IP and enable them to surf anonymously online. But personally we are mostly interested in their use for web data extraction and automated systems in general. So, we did some Google search to locate notable proxy service providers but surprisingly the majority of the results were dubious websites of low trustworthiness and low Google PageRank scores. However, there were a few that stood out from the crowd. We will name two: a) HideMyAss (or HMA for short) and b) Proxify.

    HMA provides (among others) a large real-time database of free working public proxies. These proxies are open to everyone and vary in speed and anonymity level. Nevertheless, free shared proxies have certain disadvantages mostly in terms of security and privacy. They are third-party proxies and HMA cannot vouch for their reliability. On the other hand, HMA offers a powerful Pro VPN service which encrypts your entire internet activity and unlike a web proxy it automatically works with all applications on your computer (whereas web proxies typically work with web browsers like Firefox or Chrome and utilities like cURL or GNU Wget). However, the company's policy and Pro VPN's terms of use are not robots-friendly, so using Pro VPN for web scraping might cause an abuse warning and result in the suspension of the account.

    The second high-quality proxy service that we found was Proxify. They offer 3 packages: Basic, Pro and SwitchProxy. The latter is very fast and it's intended for web crawling and automated systems of any scale. Since we are mostly interested in web scraping, SwitchProxy is the tool that suits us the most. It provides a rich set of features and gives access to 1296 "satellites" in 279 cities in 74 countries worldwide. They also offer an auto IP change mechanism that runs either after each request (assigning each time a random IP address) or once every 10 minutes (scheduled rotation). Therefore, it seems a great option for scraping purposes, maybe the best out there. However, it's quite expensive with plans starting at a minimum cost of 100$ per month. Additionally, Proxify provides some nice code examples about how one could integrate SwitchProxy with his program/ web robot. As far as WWW::Mechanize and Selenium are concerned (these two are our favorite web browsing tools), it is easy and straightforward to combine them with SwitchProxy.

    Finally, we would like to bring forward once again the access restrictions and terms of use that many websites impose. Before launching a scraper make sure you check their robots.txt file as well as their copyright notice. For further information about this topic we wrote a relevant post some time ago, perhaps you would like to check it out too.

Thursday, January 2, 2014

Visualizing e-procurement tenders with a bubble chart

A few weeks ago we started gathering data from the Greek e-procurement platform through DEiXTo aiming to build an RSS feed with the latest tender notices and in order to provide a method to automatically retrieve fresh data from the Central Electronic Registry of Public Contracts (CERPC or “Κεντρικό Ηλεκτρονικό Μητρώο Δημοσίων Συμβάσεων” in Greek).  For further information you can read this post. Only a few days later, we were happy to find out that the first power user consuming the feed popped up: yperdiavgeia.gr, a popular search engine indexing all public documents uploaded to the Clarity website.

    So now that we have a good deal of data at hand and we systematically ingest public procurement info every single day, we are trying to think of innovative ways to utilise it creatively. They say a picture is worth a thousand words. Therefore, one of the first ideas that occurred to us (and inspired by greekspending.com), we thought it would be nice to visualize the feed with some beautiful graphics. After a little experimentation with the great D3.js library and puttering around with the JSON Perl module, we managed to come up with a handy bubble chart which you may check out here: http://deixto.gr/eprocurement/visualize

    Let's note a couple of things in order to better comprehend the chart.
  • the bigger the budget, the bigger the bubble
  • if you click on a bubble then you will be redirected to the full text PDF document
  • on mouseover a tooltip appears with some basic data fields
    The good news is that this chart will be produced automatically on a daily basis along with the RSS feed.  So, one could easily browse through the tenders published on CERPC over the last few days and locate the high-budget ones. Finally, as open data supporters we are very glad to see transparency initiatives like Clarity or CERPC and we warmly encourage people and organisations to take advantage of open public data and use it for a good purpose. Any suggestions or comments about further use of the e-procurement data would be very welcome!