Friday, January 11, 2013

Selenium: a web browser automation companion for DEiXTo

 Selenium is probably the best web browser automation tool we have come across so far. Primarily it is intended for automated testing of web applications but it's certainly not limited to that; it provides a suite of free software tools to automate web browsers across many platforms. The range of its use case scenarios is really wide and its usefulness is just great.
    However, as scraping experts, we inevitably focus on using Selenium for web data extraction purposes. Its functionality-rich client API can be used to launch browser instances (e.g. Firefox processes) and simulate, through the proper commands, almost everything a user could do on a web site/ page. Thus, it allows you to deploy a fully-fledged web browser and surpass the difficulties that pop up from heavy JavaScript/ AJAX use. Moreover, via the virtual framebuffer X server (Xvfb), one could automate browsers without the need for an actual display and create scripts/ services running periodically or at will on a headless server e.g. on a remote GNU/Linux machine. Therefore, Selenium could successfully be used in combination with DEiXToBot, our beloved Mechanize scraping module. 
    For example, the Selenium-automated browser could fetch a target page after a couple of steps (like clicking a button/ hyperlink, selecting an item from a drop-down list, submitting a form, etc.) and then pass it to DEiXToBot (which lacks JavaScript support) to do the scraping job through DOM-based tree patterns previously generated with the GUI DEiXTo tool. This is particularly useful for complex scraping cases and opens new potential for DEiXTo wrappers.
    The Selenium Server component (formerly the Selenium RC Server) as well as the client drivers that allow you to write scripts that interact with the Selenium Server can be found here. We have used it quite a few times for various cases and the results were great. In conclusion, Selenium is an amazing "weapon" added to our arsenal and we strongly believe that along with DEiXTo it boosts our scraping capabilities. If you have an idea/ project that involves web browser automation or/ and web data extraction, we would be more than glad to hear from you!

No comments:

Post a Comment