Tuesday, January 21, 2014

Celery task/ job queue

Queues are very common in computer science and in real-world programs. A queue is a FIFO data structure where new elements are typically added to the rear position whereas the first items inserted will be the first ones to be removed/ served. A nice, thorough collection of queueing systems can be found on queues.io. Task/Job queues in particular are used by numerous systems/ services and they apply to a wide range of applications. They can alleviate the complexity of system design and implementation, boost scalability and generally have many advantages. Some of the most popular ones are CeleryRQ (Redis Queue) and Gearman.

    The one we recently stumbled upon and immediately took advantage of was Celery which is written in Python and is based on distributed message passing. It focuses on real-time operation but supports scheduling as well. The units of work to be performed, called tasks, are executed simultaneously on a single or more worker servers using multiprocessing. Thus, concurrent workers run in the background waiting for new job arrivals and when a task arrives (and its turn comes) a worker processes it. Some of Celery's uses are handling long running jobs, asynchronous task processing, offloading heavy tasks, job routing, task trees, etc.

    Now let's see a use case where Celery untied our hands. As you might already know, for quite some time we have been developing Python, Selenium-based scripts for web scraping and browser automation. So, occasionally we came across data records/ cases, while executing a script, that had to be dealt separately by another process/ script. For instance, in the context of the recent e-procurement project, when scraping through DEiXToBot the detail page of a payment (published on the Greek e-procurement platform) you could find a reference towards a relevant contract which you would also like to download and scrape. Additionally, this contract could also link with a tender notice and the latter may be a corrected version of an existing tender or connect in turn with another decision/ document.
    Thus, we thought it would be handy and more convenient if we could add the unique codes/ identifiers of these extra documents to a queue system and let a background worker get the job done asynchronously. It should be noted that the lack of persistent links on the eprocurement website made it harder to download a detail page programmatically at a later stage since you could access it only after performing a new search with its ID and automating a series of steps with Selenium depending on the type of the target document.

    So, it was not long before we installed Celery on our Linux server and started experimenting with it. We were amazed with its simplicity and efficiency. We quickly wrote a script that fitted the bill for the e-procurement scenario we described in the previous paragraph. The code we wrote provided an elegant and practical solution to the problem at hand and was something like that (note the recursion!):

from celery import Celery

app = Celery('tasks', backend='amqp', broker='amqp://')

def download(id):
    ... selenium_stuff ...
    if (reference_found)
        download.delay(new_id) # delay sends a task message

    In conclusion, we are happy to have found Celery, it's really promising and we thought it would be nice to share this news with you. We are looking forward to using Celery further for our heavy scraping needs and we are glad that we added it to our arsenal.

Friday, January 17, 2014

About web proxies

Rightly or wrongly there are times when one would like to conceal his IP address, especially while scraping a target website. Perhaps the most popular way to do that is by using web proxy servers. A proxy server is a computer system or an application that acts as an intermediary for requests from clients seeking resources from other servers. Thus, web proxies allow users to mask their true IP and enable them to surf anonymously online. But personally we are mostly interested in their use for web data extraction and automated systems in general. So, we did some Google search to locate notable proxy service providers but surprisingly the majority of the results were dubious websites of low trustworthiness and low Google PageRank scores. However, there were a few that stood out from the crowd. We will name two: a) HideMyAss (or HMA for short) and b) Proxify.

    HMA provides (among others) a large real-time database of free working public proxies. These proxies are open to everyone and vary in speed and anonymity level. Nevertheless, free shared proxies have certain disadvantages mostly in terms of security and privacy. They are third-party proxies and HMA cannot vouch for their reliability. On the other hand, HMA offers a powerful Pro VPN service which encrypts your entire internet activity and unlike a web proxy it automatically works with all applications on your computer (whereas web proxies typically work with web browsers like Firefox or Chrome and utilities like cURL or GNU Wget). However, the company's policy and Pro VPN's terms of use are not robots-friendly, so using Pro VPN for web scraping might cause an abuse warning and result in the suspension of the account.

    The second high-quality proxy service that we found was Proxify. They offer 3 packages: Basic, Pro and SwitchProxy. The latter is very fast and it's intended for web crawling and automated systems of any scale. Since we are mostly interested in web scraping, SwitchProxy is the tool that suits us the most. It provides a rich set of features and gives access to 1296 "satellites" in 279 cities in 74 countries worldwide. They also offer an auto IP change mechanism that runs either after each request (assigning each time a random IP address) or once every 10 minutes (scheduled rotation). Therefore, it seems a great option for scraping purposes, maybe the best out there. However, it's quite expensive with plans starting at a minimum cost of 100$ per month. Additionally, Proxify provides some nice code examples about how one could integrate SwitchProxy with his program/ web robot. As far as WWW::Mechanize and Selenium are concerned (these two are our favorite web browsing tools), it is easy and straightforward to combine them with SwitchProxy.

    Finally, we would like to bring forward once again the access restrictions and terms of use that many websites impose. Before launching a scraper make sure you check their robots.txt file as well as their copyright notice. For further information about this topic we wrote a relevant post some time ago, perhaps you would like to check it out too.

Thursday, January 2, 2014

Visualizing e-procurement tenders with a bubble chart

A few weeks ago we started gathering data from the Greek e-procurement platform through DEiXTo aiming to build an RSS feed with the latest tender notices and in order to provide a method to automatically retrieve fresh data from the Central Electronic Registry of Public Contracts (CERPC or “Κεντρικό Ηλεκτρονικό Μητρώο Δημοσίων Συμβάσεων” in Greek).  For further information you can read this post. Only a few days later, we were happy to find out that the first power user consuming the feed popped up: yperdiavgeia.gr, a popular search engine indexing all public documents uploaded to the Clarity website.

    So now that we have a good deal of data at hand and we systematically ingest public procurement info every single day, we are trying to think of innovative ways to utilise it creatively. They say a picture is worth a thousand words. Therefore, one of the first ideas that occurred to us (and inspired by greekspending.com), we thought it would be nice to visualize the feed with some beautiful graphics. After a little experimentation with the great D3.js library and puttering around with the JSON Perl module, we managed to come up with a handy bubble chart which you may check out here: http://deixto.gr/eprocurement/visualize

    Let's note a couple of things in order to better comprehend the chart.
  • the bigger the budget, the bigger the bubble
  • if you click on a bubble then you will be redirected to the full text PDF document
  • on mouseover a tooltip appears with some basic data fields
    The good news is that this chart will be produced automatically on a daily basis along with the RSS feed.  So, one could easily browse through the tenders published on CERPC over the last few days and locate the high-budget ones. Finally, as open data supporters we are very glad to see transparency initiatives like Clarity or CERPC and we warmly encourage people and organisations to take advantage of open public data and use it for a good purpose. Any suggestions or comments about further use of the e-procurement data would be very welcome!

Tuesday, December 10, 2013

Web archiving and Heritrix

A topic that has gained increasing attention lately is web archiving. In an older post we started talking about it and we cited a remarkable online tool named ArchiveReady that checks whether a web page is easily archivable. Perhaps the most well-known web archiving project is currently the Internet Archive which is a non-profit organization aiming to build a permanently and freely accessible Internet library. Their Wayback Machine, a digital archive of the World Wide Web, is really interesting. It enables users to "travel" across time and visit archived versions of web pages.

    As web scraping aficionados we are mostly interested in their crawling toolset. So, the web crawler used by the Internet Archive is Heritrix, a free, powerful Java crawler released under the Apache License. The latest version is 3.1.1 and it was made available back in May 2012. Heritrix creates copies of websites and generates WARC (Web ARChive) files. The WARC format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file.
    Heritrix offers a basic web based user interface (admin console) to manage your crawls as well as a command line tool that can optionally be used to initiate archiving jobs. We played with it a bit and found it handy for quite a few cases but overall it left us with a sense of obsolescence.

    In our humble opinion (and someone please correct us if we are wrong) the two main drawbacks Heritrix has are: a) lack of distributed crawling support and b) lack of JavaScript/AJAX support. The first one means that if you would like to scan a really big source of data, for example the great Digital Public Library of America (DPLA) with more than 5 million items/ pages, then Heritrix would take a lot of time since it runs locally on a single machine. Even if multiple Heritrix crawlers were combined and a subset of the target URL space was assigned to each of them, then again it wouldn't be an optimal solution. From our point of view it would be much better and faster if several cooperating agents on multiple different servers could actually collaborate to complete the task. Therefore, scaling and time issues arise when the number of pages goes very large.

    The second disadvantage on the other hand is related to the trend of modern websites towards heavy use of JavaScript and AJAX calls. Heritrix provides just basic browser functionality and it does not include a fully-fledged web browser. Therefore, it's not able to archive efficiently pages that use JavaScript/ AJAX to populate parts of the page. Thus, it cannot capture properly social media content.
    We think that both of these issues could be surpassed using a cloud, Selenium-based architecture like Sauce Labs (although the cost for an Enterprise plan is a matter that should be considered). This choice would allow you a) to run your crawls in the cloud in parallel and b) use a real web browser with full JavaScript support, like Firefox, Chrome or Safari. We have already covered Selenium in previous posts and it is absolutely a great browser automation tool. In conclusion, we recommend Selenium and a different, cloud-based approach for implementing large-scale, web archiving projects. Heritrix is quite good and has proved a valuable ally but we think that other, state-of-the-art technologies are nowadays more suitable for the job especially with the latest Web 2.0 developments. What's your opinion? 

Monday, September 16, 2013

DEiXTo at BCI 2013

We are pleased to inform you that our short paper titled “DEiXTo: A web data extraction suite” has been accepted for presentation at the 6th Balkan Conference in Informatics ( BCI 2013 ) to be held in Thessaloniki on September 19-21 2013. The main goal of the BCI series of conferences is to provide a forum for discussions and dissemination of research accomplishments and to promote interaction and collaboration among scientists from the Balkan countries.

So, if you would like to cite DEiXTo in your thesis, project or scientific work, please use the following reference:
F. Kokkoras, K. Ntonas, N. Bassiliades. “DEiXTo: A web data extraction suite”, In proc. of the 6th Balkan Conference in Informatics (BCI-2013), September 19-21, 2013, Thessaloniki, Greece

Monday, September 9, 2013

Using XPath for web scraping

The last few years we have worked quite a bit on aggregators gathering periodically information from multiple online sources. We usually write own custom code and mostly use DOM-based extraction patterns (built with our home made DEiXTo GUI tool) but we also use other technologies and useful tools, when possible, in order to get the job done and make our scraping tasks easier. One of them is XPath which is a query language, defined by W3C, for selecting nodes from an XML document. Note that an HTML page (even malformed) can be represented as a DOM tree, thus an XML document. XPath is quite effective, especially for relatively simple scraping cases.

    Suppose for instance that we would like to retrieve the content of an article/ post/ story on a specific website or blog. Of course this scenario could be extended to several posts from many different sources and go large in scale. Typically the body of a post resides in a DIV (or a certain type of) HTML element with a particular attribute value (the same stands for the post title as well). Therefore, the text content of a post is usually included in something like the following html segment (especially if you consider that numerous blogs and websites live on platforms like blogger or WordPress.com and share a similar layout):

<div class="post-body entry-content" ...>the content we want</div>

    DEiXo tree rules are more suitable and efficient when there are multiple structure-rich record occurrences on a page. So, if you need just a specific div element it's better to stick with an XPath expression. It's pretty simple and it works. Then you could do some post processing on the data scraped and further utilise it by passing it to other techniques e.g. using regular expressions on the inner text to identify dates, places or other pieces of interest or parsing the outer HTML code of the selected element with a specialised tool looking for interesting stuff. So, instead of creating a rule with DEiXTo for the case described above we could just use an XPath selector: /div[@class="post-body entry-content"] to select the proper element and access its contents.

    We actually used this simple but effective technique repetitively for myVisitPlanner, a project funded by the Greek Ministry of Education aiming at creating a personalised system for cultural itineraries planning. The main content of event pages (related to music, theatre, festivals, exhibitions, etc) is systematically extracted from a wide variety of local websites (most lacking RSS and APIs) in order to automatically monitor and aggregate events information. We could show you some code to demonstrate how to scrape a site with XPath but instead we would like to cite an amazing blog dedicated to web scraping which gives a nice code example of using XPath in screen scraping: extract-web-data.com. It provides a lot of information about web data extraction techniques and covers plenty of relevant tools. It's a nice, thorough and well-written read.

    Anyway, if you need some web data just for personal use or your boss asked you so, why don't you consider using DEiXTo or one of the remarkable software tools out there? The use case scenarios are limitless and we are sure you could come up with a useful and interesting one.

Saturday, May 4, 2013

Creating a complete list of FLOSS Weekly podcast episodes

It was not until recently that I discovered and started subscribing to podcasts. I wish I did earlier but the lack of available time (mostly) kept me away from them although we should always try to find time to learn and explore new things and technologies. So, I was very excited when I ran across FLOSS Weekly, a popular Free Libre Open Source (FLOSS) themed podcast from the TWiT Network. Currently, the lead host is Randal Schwartz, a renowned Perl hacker and programming consultant. As a Perl developer myself, it's needless to say that I greatly admire and respect him. FLOSS Weekly debuted back in April 2006 and as of 4th of May 2013 it features 250 episodes! That's a lot of episodes and lots of great stuff to explore.

    Inevitably, if you don't have the time to listen to them all and had to choose only some of them, you would need to browse through all the listing pages (each containing 7 episodes) in order to find those that would interest you most. As I am writing this post one would have to visit 36 pages (by repeatedly clicking on the NEXT page link) to get a complete picture of all subjects discussed. Consequently, it's not that easy to quickly locate the ones that you find more interesting and compile a To-Listen (or To-Watch if you prefer video) list. I am not 100% sure that there is no such thing available on the twit.tv website but I was not able to find a full episodes list on a single place/ page. Therefore, I thought that a spreadsheet (or even better a JSON document) containing the basic info for each episode (title, date, link and description) would come in handy.

    Hence, I utilised my beloved home-made scraping tool, DEiXTo, in order to extract the episodes metadata so that one can have a convenient, compact view of all available topics and decide easier which ones to choose. It was really simple to build a wrapper for this task and in a few minutes I had the data at hand (in a tab delimited text file). Then it was straightforward to import it in an Excel spreadsheet (you can download it here). Moreover, with a few lines of Perl code the data scraped was transformed into a JSON file (with all the advantages this brings) suitable for further use.
    Check FLOSS Weekly out! You might find several great episodes that could illuminate you and bring into your attention amazing tools and technologies. As a free software supporter, I highly recommend it (despite the fact that I discovered it with a few years delay, hopefully it's never too late).