Tuesday, December 10, 2013

Web archiving and Heritrix

A topic that has gained increasing attention lately is web archiving. In an older post we started talking about it and we cited a remarkable online tool named ArchiveReady that checks whether a web page is easily archivable. Perhaps the most well-known web archiving project is currently the Internet Archive which is a non-profit organization aiming to build a permanently and freely accessible Internet library. Their Wayback Machine, a digital archive of the World Wide Web, is really interesting. It enables users to "travel" across time and visit archived versions of web pages.


    As web scraping aficionados we are mostly interested in their crawling toolset. So, the web crawler used by the Internet Archive is Heritrix, a free, powerful Java crawler released under the Apache License. The latest version is 3.1.1 and it was made available back in May 2012. Heritrix creates copies of websites and generates WARC (Web ARChive) files. The WARC format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file.
    Heritrix offers a basic web based user interface (admin console) to manage your crawls as well as a command line tool that can optionally be used to initiate archiving jobs. We played with it a bit and found it handy for quite a few cases but overall it left us with a sense of obsolescence.


    In our humble opinion (and someone please correct us if we are wrong) the two main drawbacks Heritrix has are: a) lack of distributed crawling support and b) lack of JavaScript/AJAX support. The first one means that if you would like to scan a really big source of data, for example the great Digital Public Library of America (DPLA) with more than 5 million items/ pages, then Heritrix would take a lot of time since it runs locally on a single machine. Even if multiple Heritrix crawlers were combined and a subset of the target URL space was assigned to each of them, then again it wouldn't be an optimal solution. From our point of view it would be much better and faster if several cooperating agents on multiple different servers could actually collaborate to complete the task. Therefore, scaling and time issues arise when the number of pages goes very large.


    The second disadvantage on the other hand is related to the trend of modern websites towards heavy use of JavaScript and AJAX calls. Heritrix provides just basic browser functionality and it does not include a fully-fledged web browser. Therefore, it's not able to archive efficiently pages that use JavaScript/ AJAX to populate parts of the page. Thus, it cannot capture properly social media content.
    We think that both of these issues could be surpassed using a cloud, Selenium-based architecture like Sauce Labs (although the cost for an Enterprise plan is a matter that should be considered). This choice would allow you a) to run your crawls in the cloud in parallel and b) use a real web browser with full JavaScript support, like Firefox, Chrome or Safari. We have already covered Selenium in previous posts and it is absolutely a great browser automation tool. In conclusion, we recommend Selenium and a different, cloud-based approach for implementing large-scale, web archiving projects. Heritrix is quite good and has proved a valuable ally but we think that other, state-of-the-art technologies are nowadays more suitable for the job especially with the latest Web 2.0 developments. What's your opinion? 

Monday, September 16, 2013

DEiXTo at BCI 2013

We are pleased to inform you that our short paper titled “DEiXTo: A web data extraction suite” has been accepted for presentation at the 6th Balkan Conference in Informatics ( BCI 2013 ) to be held in Thessaloniki on September 19-21 2013. The main goal of the BCI series of conferences is to provide a forum for discussions and dissemination of research accomplishments and to promote interaction and collaboration among scientists from the Balkan countries.


So, if you would like to cite DEiXTo in your thesis, project or scientific work, please use the following reference:
F. Kokkoras, K. Ntonas, N. Bassiliades. “DEiXTo: A web data extraction suite”, In proc. of the 6th Balkan Conference in Informatics (BCI-2013), September 19-21, 2013, Thessaloniki, Greece

Monday, September 9, 2013

Using XPath for web scraping

The last few years we have worked quite a bit on aggregators gathering periodically information from multiple online sources. We usually write own custom code and mostly use DOM-based extraction patterns (built with our home made DEiXTo GUI tool) but we also use other technologies and useful tools, when possible, in order to get the job done and make our scraping tasks easier. One of them is XPath which is a query language, defined by W3C, for selecting nodes from an XML document. Note that an HTML page (even malformed) can be represented as a DOM tree, thus an XML document. XPath is quite effective, especially for relatively simple scraping cases.

    Suppose for instance that we would like to retrieve the content of an article/ post/ story on a specific website or blog. Of course this scenario could be extended to several posts from many different sources and go large in scale. Typically the body of a post resides in a DIV (or a certain type of) HTML element with a particular attribute value (the same stands for the post title as well). Therefore, the text content of a post is usually included in something like the following html segment (especially if you consider that numerous blogs and websites live on platforms like blogger or WordPress.com and share a similar layout):

<div class="post-body entry-content" ...>the content we want</div>

    DEiXo tree rules are more suitable and efficient when there are multiple structure-rich record occurrences on a page. So, if you need just a specific div element it's better to stick with an XPath expression. It's pretty simple and it works. Then you could do some post processing on the data scraped and further utilise it by passing it to other techniques e.g. using regular expressions on the inner text to identify dates, places or other pieces of interest or parsing the outer HTML code of the selected element with a specialised tool looking for interesting stuff. So, instead of creating a rule with DEiXTo for the case described above we could just use an XPath selector: /div[@class="post-body entry-content"] to select the proper element and access its contents.

    We actually used this simple but effective technique repetitively for myVisitPlanner, a project funded by the Greek Ministry of Education aiming at creating a personalised system for cultural itineraries planning. The main content of event pages (related to music, theatre, festivals, exhibitions, etc) is systematically extracted from a wide variety of local websites (most lacking RSS and APIs) in order to automatically monitor and aggregate events information. We could show you some code to demonstrate how to scrape a site with XPath but instead we would like to cite an amazing blog dedicated to web scraping which gives a nice code example of using XPath in screen scraping: extract-web-data.com. It provides a lot of information about web data extraction techniques and covers plenty of relevant tools. It's a nice, thorough and well-written read.


    Anyway, if you need some web data just for personal use or your boss asked you so, why don't you consider using DEiXTo or one of the remarkable software tools out there? The use case scenarios are limitless and we are sure you could come up with a useful and interesting one.

Saturday, May 4, 2013

Creating a complete list of FLOSS Weekly podcast episodes

It was not until recently that I discovered and started subscribing to podcasts. I wish I did earlier but the lack of available time (mostly) kept me away from them although we should always try to find time to learn and explore new things and technologies. So, I was very excited when I ran across FLOSS Weekly, a popular Free Libre Open Source (FLOSS) themed podcast from the TWiT Network. Currently, the lead host is Randal Schwartz, a renowned Perl hacker and programming consultant. As a Perl developer myself, it's needless to say that I greatly admire and respect him. FLOSS Weekly debuted back in April 2006 and as of 4th of May 2013 it features 250 episodes! That's a lot of episodes and lots of great stuff to explore.


    Inevitably, if you don't have the time to listen to them all and had to choose only some of them, you would need to browse through all the listing pages (each containing 7 episodes) in order to find those that would interest you most. As I am writing this post one would have to visit 36 pages (by repeatedly clicking on the NEXT page link) to get a complete picture of all subjects discussed. Consequently, it's not that easy to quickly locate the ones that you find more interesting and compile a To-Listen (or To-Watch if you prefer video) list. I am not 100% sure that there is no such thing available on the twit.tv website but I was not able to find a full episodes list on a single place/ page. Therefore, I thought that a spreadsheet (or even better a JSON document) containing the basic info for each episode (title, date, link and description) would come in handy.


    Hence, I utilised my beloved home-made scraping tool, DEiXTo, in order to extract the episodes metadata so that one can have a convenient, compact view of all available topics and decide easier which ones to choose. It was really simple to build a wrapper for this task and in a few minutes I had the data at hand (in a tab delimited text file). Then it was straightforward to import it in an Excel spreadsheet (you can download it here). Moreover, with a few lines of Perl code the data scraped was transformed into a JSON file (with all the advantages this brings) suitable for further use.
    Check FLOSS Weekly out! You might find several great episodes that could illuminate you and bring into your attention amazing tools and technologies. As a free software supporter, I highly recommend it (despite the fact that I discovered it with a few years delay, hopefully it's never too late).

Friday, May 3, 2013

Scraping the members of the Greek Parliament

The Hellenic Parliament is the supreme democratic institution that represents Greek citizens through an elected body of Members of Parliament (MPs). It is a legislature of 300 members, elected for a four-year term, that submits bills and amendments. Its website, www.hellenicparliament.gr, has a lot of interesting data on it that could potentially be useful for mere citizens, certain types of professionals like journalists and lawyers, the media as well as businesses.


    Inspired by existing scrapers for many Parliaments of the world like these on ScraperWiki, an amazing web-based scraping platform, we decided to write a simple, though efficient, DEiXToBot-based script that gathers information (such as the full name, constituency and contact details) from the CVs pages of Greek MPs and exports it (after some post-processing, e.g. deducing the party name to which the MP belongs from the logo in the party column) to a tab delimited text file that can then be easily imported in an ODF spreadsheet or into a database. The script uses a tree pattern previously built with the GUI DEiXTo tool to identify the data under interest and visits all 30 target pages (each containing ten records) by utilizing the pageNo URL parameter. It should also be noted that we used Selenium for our purposes, our favorite browser automation tool. Eventually, the results of the execution of the script can be found in this .ods fileIn case you would like to take a look at the Perl code that got the job done you can download it here


    Open data — data that is free for use, reuse, and redistribution — is a goldmine that can stimulate innovative ways to discover knowledge and analyze rich data sets available on the World Wide Web. Scraping is an invaluable tool that can help towards this direction and serve transparency and openness. Currently there is a wide variety of remarkable web data extraction tools (among which quite a few free). Perhaps you would like to give DEiXTo a try and start building your own web robots to get the data you need and transform it into a suitable format for further use.
    In conclusion, scraping has numerous uses and applications and there is a high chance you could come up with an interesting and creative use case scenario tailored to your requirements. So, if you need any help with DEiXTo or have any inquiries, please do not hesitate to contact us!

Tuesday, April 16, 2013

Visualizing Clarity document categories in a pie chart

The "Cl@rity" program of the Hellenic Republic offers a wealth of data about the decisions and expenditure of all Greek ministries and their organizations. It operates for more than two years now and it is a great source of public data waiting for all of us to explore. However, it has been facing a lot of technical problems over the last year because of the large number of documents uploaded daily and the heavy data management cost. Unfortunately, their frontend and its search functionality is not working most of the time. Thankfully, a private initiative, UltraCl@rity, has come up in the meantime to offer a great alternative for searching the digitally signed public pdf documents and their metadata, filling in the gap left by the Greek government.


    As you probably already know we focus on web scraping and the utilization of the information extracted. One of the best ways to exploit the data you might have gathered with DEiXTo (or another web data extraction tool) is presenting it with a comprehensive chart. Hence, we thought it might be interesting to collect the subject categories of the documents published on Cl@rity by a big educational institution like the Athens University of Economics and Business (AUEB) and create a handy pie chart.

    This page http://et.diavgeia.gov.gr/f/aueb/list.php?l=themes provides a convenient categorization of AUEB's decisions. Therefore, with a simple pattern (extraction rule), created with GUI DEiXTo, we captured the categories and their number of documents. Then, it was quite easy and straightforward to programmatically transform the output data (as of 16th of April 2013) into an interactive Google pie chart with the most popular categories using the amazing Google Chart Tools. So, here it is:

    By the way, publicspending.gr and greekspending.com are truly remarkable research efforts aiming at visualizing public expenditure data from the Cl@rity project in user-friendly diagrams and charts. Of course the deixto-based scenario described above is just a simple scraping example. What we would like to point out is that this kind of data transformations could have some, innovative practical applications and facilitate useful web-based services. In conclusion, Cl@rity (or "Διαύγεια" as it is known in Greek) can be a goldmine, spark new innovations and allow citizens and developers in particular to dig into open data in a creative fashion and in favor of the transparency of public life.

Saturday, April 13, 2013

Fuel price monitoring & data visualization

Recently, we stumbled upon a very useful public, web-based service, the Greek Fuel Prices Observatory ("Παρατηρητήριο Τιμών Υγρών Καυσίμων" in Greek). Its main objective is to allow consumers find out the prices of liquid fuels per type and geographic region. Having a wealth of fuel-related information at his disposal, one could build some innovative services (e.g. taking advantage of the geo-location of gas stations), find interesting stats or create meaningful charts.


    One of the site's most interesting pages is that which contains the min, max and mean prices over the last 3 months: http://www.fuelprices.gr/price_stats_ng.view?time=1&prodclass=1
However, the hyperlink at the bottom right corner of the page (with the text "Γραφήματος") which is supposed to display a comprehensive graph returns an HTTP Status 500 exception message instead (at least as of 13th of April 2013). So, we could not resist scraping the data from the table with DEiXTo and presenting it nicely with a Google line chart after some post-processing. We used a regular expression to isolate the date, we reversed the order of the records found (so that the list is sorted chronologically, the oldest one first), we replaced commas in prices with dots (as a decimal mark) and we wrote a short script to produce the necessary lines for the arrayToDataTable method call of the Google Visualization API. Therefore, it was pretty straightforward to create the following:


    Generally, there are various remarkable data visualization tools out there (one of the best is Google Charts of course) but we would not like to elaborate further on this now. Nevertheless, we would like to give emphasis on the fact that once you have rich and useful web data in hand you can exploit them in a wide variety of ways and come up with smart methods to analyze, use and present them. Your imagination is the only limit (along with the copyright restrictions).

Friday, April 12, 2013

Data migration through browser automation

As we have already mentioned quite a few times, browser automation can have a lot of practical applications ranging from testing of web applications to web-based administration tasks and web scraping. The latter (scraping) is our field of expertise and Selenium is our tool of choice when it comes to automated browser interaction and dealing with complex, JavaScript-rich pages. 
    A very interesting scenario (among others) of combining our beloved web scraping tool, DEiXTo, with Selenium could be data migration. Imagine for example that you have an osCommerce online store and you would like to migrate it to a Joomla VirtueMart e-commerce system. Wouldn't it be great if you could scrape the product details from the old, online catalogue through DEiXTo and then automate the data entry labor via Selenium? Once we have the data at hand in a suitable format, e.g. comma/ tab delimited or XML, we could then write a script that would repeatedly visit the data entry online form (in the administration environment of the new e-shop), fill in the necessary fields and submit it (once for each single product) so as to insert automatically all the products into the new website.


    This way you can save a lot of time and effort and avoid messing with complex data migration tools (which are very useful in many cases). Important: of course we don't claim that migrating databases through scraping and automated data entry is the best solution. However, it's a nice and quick alternative approach for several, especially relatively simple, cases. The big advantage is that you don't even need to know the underlying schemas of the two systems under consideration. The only condition is to have access to the administrator interface of the new system.
    By the way, below you can see a screenshot from Altova MapForce, maybe the best (but not free) data mapping, conversion and integration software tool out there.


    Generally speaking, the uses and applications of web data extraction are numerous. You can check out some of them here. Perhaps you are about to think the next one and we would be glad to help you with the technicalities!

Thursday, March 28, 2013

How to pass Selenium pages to DEiXToBot

Recently we talked about Selenium and its potential combination with DEiXTo. It is a truly remarkable browser automation tool with numerous uses and applications. For those of you wondering how to programmatically pass pages fetched with Selenium to DEiXToBot on the fly, then here is a way (provided you are familiar with Perl programming):

# suppose that you have already fetched the target page with the WWW::Selenium object ($sel variable)
my $content = $sel->get_html_source(); # get the page source code

my ($fh,$name); # create a temporary file containing the page's source code
do { $name = tmpnam() } until $fh = IO::File->new($name, O_RDWR|O_CREAT|O_EXCL);
print $fh $content;
close $fh;

$agent->get("file://$name"); # load the temporary file/page with the DEiXToBot agent using the file:// scheme

unlink $name; # delete the temporary file, it is not needed any more

if (! $agent->success) { die "Could not fetch the temp file!"; }

$agent->load_pattern('pattern.xml'); # load the pattern built with the GUI tool

$agent->build_dom(); # build the DOM tree of the page

$agent->extract_content(); # apply the pattern 

my @records = @{$agent->records};
for my $record (@records) { # loop through the data/ records scraped
....

    Therefore, you can create temporary HTML files, in real time, containing the source code of the target pages (after the WWW::Selenium object gets these pages) and pass them to the DEiXToBot agent to do the scraping job. Another interesting scenario is to download the pages locally with Selenium and then read/ scrape them directly from the disk at a later stage. We hope the above snippet helps. Please do not hesitate to contact us for any questions or feedback!

Saturday, February 2, 2013

Digital Preservation and ArchiveReady


Although our blog's main focus is scraping data from web information sources (especially via DEiXTo), we are also very interested in services and applications that can be built on top of agents and crawlers. Our favorite tools for programmatic web browsing are WWW::Mechanize and Selenium. The first one is a handy Perl object (that lacks Javascript support though) whereas the latter is a great browser automation tool that we have been using more and more lately in a variety of cases. Through them we can simulate whatever a user can do in a browser window and automate the interaction with pages of interest.
   Traversing a website is one of the most basic and common tasks for a developer of web robots. However, the methodologies used and the mechanisms deployed can vary a lot. So, we tried to think of a meaningful crawler-based scenario that would blend various "tasty" ingredients and come up with a nice story. Hopefully we did and our post has four major pillars that we would like to highlight and discuss further below:
  • Crawling (through Selenium)
  • Sitemaps
  • Archivability
  • Scraping (in the demo that follows we download reports from a target website)
    An interesting topic we recently stumbled upon is digital preservation which can be viewed as a series of policies and strategies necessary to ensure continued access to digital content over time and regardless of the challenges of media failure and technological change. In this context, we discovered ArchiveReady, a remarkable web application that checks whether a website is easily archivable. This means that it scans a page and checks whether it's suitable for web archiving projects (such as the Internet Archive and BlogForever) to access and preserve it. However, you can only pass one web page at a time to its checker (not an entire website) and it might take some time to complete depending on the complexity and size of the page. Therefore, we thought it could be useful for those interested to test multiple pages if we wrote a small script that parses the XML sitemap of a target site and checks each of the URLs contained in it against the ArchiveReady service and at the same time downloads the results.
    Sitemaps, as you probably already know, are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a sitemap is an XML file that lists URLs for a site along with some additional metadata about each URL so that search engines can more intelligently crawl the site. Typically sitemaps are auto-generated by plugins.
   OK, enough with the talking. Let's get to work and write some code! The Perl modules we utilised for our purposes were WWW::Selenium and XML::LibXML. The activities we had to do were the following:
  • launch a Firefox instance 
  • read the sitemap document of a sample target website (we chose openarchives.gr)
  • pass each of its URLs to the ArchiveReady validation engine and finally
  • download locally the results returned in Evaluation and Report Language (EARL) format since the ArchiveReady offers this option
So, here is the code (just note that we wait till the validation page contains 8 "Checking complete" messages, one for each section, to determine whether the processing has finished):

use WWW::Selenium;
use XML::LibXML;

my $parser = XML::LibXML->new();
my $dom = $parser->parse_file('http://openarchives.gr/sitemap.xml');
my @loc_elms = $dom->getElementsByTagName('loc');
my @urls;
for my $loc (@loc_elms) {
    push @urls,$loc->textContent;
}
# launch a Firefox instance
my $sel = WWW::Selenium->new( host => "localhost",
                              port => 4444,
                              browser => "*firefox",
                              browser_url => "http://archiveready.com/"
                            );
$sel->start;
# parse through the pages contained in the sitemap
for my $u (@urls) {
    $sel->open("http://archiveready.com/check?url=$u");
    my $content = $sel->get_html_source();
    while ( (() = $content =~ /Checking complete/g) != 8) { # check if complete
        sleep(1);
        $content = $sel->get_html_source();
    }
    $content=~m#href="/download-results\?test_id=(\d+)&amp;format=earl"#;
    my $id = $1; # capture the identifier of the current validation test
    $sel->click('xpath=//a[contains(@href,"earl")]'); # click on the EARL link
    $sel->wait_for_page_to_load(5000);
    eval { $content = $sel->get_html_source(); };
    open my $fh,">:utf8","download_results_$id.xml"; # write the EARL report to a file
    print $fh $content;
    close $fh;
}
$sel->stop;

    We hope you found the above helpful and the scenario described interesting. We tried to take advantage of software agents/ crawlers and use them creatively in combination with ArchiveReady, an innovative service that helps you strengthen your website's archivability. Finally, scraping and automated browsing can have an extremely extensive set of uses and applications. Please check out DEiXTo, our feature-rich web data extraction tool, and don't hesitate to contact us! Maybe we can help you with your tedious and time-consuming web tasks and data needs!

Thursday, January 24, 2013

Cloudify your browser testing (and scraping) with Sauce!

For quite some time now, along with our DEiXTo scraping software, we have been using Selenium which is perhaps the best web browser automation tool currently available. It's really great and has helped us a lot in a variety of web data extraction cases (we published another post about it recently). We tried it locally as well as on remote GNU/Linux servers and we wrote code for a couple of automated tests and scraping tasks. However, it was not that easy to set everything up and get things running; we came across various difficulties (ranging from installation to stability issues e.g. sporadic timeout errors) although we were finally able to surpass most of them.
    Wouldn't it be great though if there was a robust framework that would provide you with the necessary infrastructure and all possible browser/OS combinations and allow you to run your Selenium tests in the cloud? You would not have to worry about setting a bunch of things up, installing updates, machines management, maintenance, etc. Well, there is! And it offers a whole lot more.. Its name is Sauce Labs and it provides an amazing set of tools and features. Admittedly they have done awesome work and they bring great products to software developers. Moreover, their team seems to share some great values: pursuit of excellence, innovation and open source culture (among others).
    They offer a variety of pricing plans (a bit expensive in my opinion though) while the free account includes 100 automated code minutes for Win, Linux and Android, 40 automated code minutes for Mac and iOS and 30 Minutes of manual testing. And for those contributing to an open source project that needs testing support, Open Sauce Plan is just for you (unlimited minutes without any cost!). Please note that the Selenium project is sponsored by Sauce Labs.


    Being a Perl programmer, I could not resist signing up and writing some Perl code to run a test on the ondemand.saucelabs.com host! I was already familiar with the WWW::Selenium CPAN module, so it was quite easy and straightforward. It should be noted that they provide useful guidelines and various examples online for multiple languages e.g. Python, Java, PHP and others. Overall my test script worked pretty well but it was a bit slow (compared to running the same code locally). However, one could improve speed by deploying lots of processes in parallel (if the use case scenario is suitable) and by disabling video (the script's execution and browser activity is recorded for easier debugging). Furthermore, Sauce's big advantage is that it can go large scale, which would be especially suited for complex cases with heavy requirements.
     The bottom line is that the "Selenium - Sauce Labs" pair is remarkable and can be very useful in a wide range of cases and purposes. Sauce in particular offers developers an exciting way to cloudify and manage their automated browser testing (although we personally focus more on the scraping capabilities that these tools provide). Their combination with DEiXTo extraction patterns could definitely be very fertile and open new, interesting potentials. In conclusion, the uses and applications of web scraping are limitless and Selenium turns out to be a powerful tool in our quiver!

Sunday, January 13, 2013

Scraping PDF files


While puttering around on the Internet, I recently stumbled upon the website of the National Printing House ("Εθνικό Τυπογραφείο" in Greek) which is the public service responsible for the dissemination of Greek law. It publishes and distributes the government gazette and its website provides free access to all series of the Official Journal of the Hellenic Republic (ΦΕΚ).


    So, at its search page I noticed a section with the most popular issues. The most-viewed one, as of 13 Jan 2013, with 351.595 views was this: ΦΕΚ A 226 - 27.10.2011. Out of curiosity I decided to download it in order to take a quick look and see what it is all about. It was available in a PDF format and it turned out to be an issue about the Economic Adjustment Programme for Greece aiming to reduce its macroeconomic and fiscal imbalances. However, I was quite surprised to find that the text was contained in images and you could not perform any keyword search in it nor could you copy-paste its textual content! I guess because the document's pages were scanned and converted to digital images.
    This instantly brought to my mind once again the difficulties that PDF scraping involves. From our web scraping experience there are many cases where the data is "locked" in PDF files e.g. in a .pdf brochure. Getting the data of interest out is not an easy task but quite a few tools (pdftotext is one of them) have popped up over the years to ease the pain. One of the best tools I have encountered so far is Tesseract, a pretty accurate open source OCR engine currently maintained by Google.


    So, I thought it would be nice to put Tesseract into action and check its efficiency against the PDF document (that so dramatically affects the lives of all Greeks..). It worked quite well, although not perfect (probably because of the Greek language), and a few minutes later (and after converting the pdf to a tiff image through Ghostscript) I had the full text, or at least most of it, in my hands. The output text file generated can be found here. The truth is that I could not do much with it (and the austerity measures were harsh..) but at least I was happy that I was able to extract the largest part of the text.
    Of course this is just an example, there are numerous PDF files out there containing rich, inaccessible data that could potentially be processed and further utilised e.g. in order to create a full text search index. DEiXTo, our beloved web data extraction tool, can scrape only HTML pages. It cannot deal with PDF files residing on the Web. However, we do have the tools and the knowledge to parse those as well, find bits of interest and unleash their value!

Friday, January 11, 2013

Selenium: a web browser automation companion for DEiXTo


 Selenium is probably the best web browser automation tool we have come across so far. Primarily it is intended for automated testing of web applications but it's certainly not limited to that; it provides a suite of free software tools to automate web browsers across many platforms. The range of its use case scenarios is really wide and its usefulness is just great.
    However, as scraping experts, we inevitably focus on using Selenium for web data extraction purposes. Its functionality-rich client API can be used to launch browser instances (e.g. Firefox processes) and simulate, through the proper commands, almost everything a user could do on a web site/ page. Thus, it allows you to deploy a fully-fledged web browser and surpass the difficulties that pop up from heavy JavaScript/ AJAX use. Moreover, via the virtual framebuffer X server (Xvfb), one could automate browsers without the need for an actual display and create scripts/ services running periodically or at will on a headless server e.g. on a remote GNU/Linux machine. Therefore, Selenium could successfully be used in combination with DEiXToBot, our beloved Mechanize scraping module. 
    For example, the Selenium-automated browser could fetch a target page after a couple of steps (like clicking a button/ hyperlink, selecting an item from a drop-down list, submitting a form, etc.) and then pass it to DEiXToBot (which lacks JavaScript support) to do the scraping job through DOM-based tree patterns previously generated with the GUI DEiXTo tool. This is particularly useful for complex scraping cases and opens new potential for DEiXTo wrappers.
    The Selenium Server component (formerly the Selenium RC Server) as well as the client drivers that allow you to write scripts that interact with the Selenium Server can be found here. We have used it quite a few times for various cases and the results were great. In conclusion, Selenium is an amazing "weapon" added to our arsenal and we strongly believe that along with DEiXTo it boosts our scraping capabilities. If you have an idea/ project that involves web browser automation or/ and web data extraction, we would be more than glad to hear from you!