Monday, September 9, 2013

Using XPath for web scraping

The last few years we have worked quite a bit on aggregators gathering periodically information from multiple online sources. We usually write own custom code and mostly use DOM-based extraction patterns (built with our home made DEiXTo GUI tool) but we also use other technologies and useful tools, when possible, in order to get the job done and make our scraping tasks easier. One of them is XPath which is a query language, defined by W3C, for selecting nodes from an XML document. Note that an HTML page (even malformed) can be represented as a DOM tree, thus an XML document. XPath is quite effective, especially for relatively simple scraping cases.

    Suppose for instance that we would like to retrieve the content of an article/ post/ story on a specific website or blog. Of course this scenario could be extended to several posts from many different sources and go large in scale. Typically the body of a post resides in a DIV (or a certain type of) HTML element with a particular attribute value (the same stands for the post title as well). Therefore, the text content of a post is usually included in something like the following html segment (especially if you consider that numerous blogs and websites live on platforms like blogger or WordPress.com and share a similar layout):

<div class="post-body entry-content" ...>the content we want</div>

    DEiXo tree rules are more suitable and efficient when there are multiple structure-rich record occurrences on a page. So, if you need just a specific div element it's better to stick with an XPath expression. It's pretty simple and it works. Then you could do some post processing on the data scraped and further utilise it by passing it to other techniques e.g. using regular expressions on the inner text to identify dates, places or other pieces of interest or parsing the outer HTML code of the selected element with a specialised tool looking for interesting stuff. So, instead of creating a rule with DEiXTo for the case described above we could just use an XPath selector: /div[@class="post-body entry-content"] to select the proper element and access its contents.

    We actually used this simple but effective technique repetitively for myVisitPlanner, a project funded by the Greek Ministry of Education aiming at creating a personalised system for cultural itineraries planning. The main content of event pages (related to music, theatre, festivals, exhibitions, etc) is systematically extracted from a wide variety of local websites (most lacking RSS and APIs) in order to automatically monitor and aggregate events information. We could show you some code to demonstrate how to scrape a site with XPath but instead we would like to cite an amazing blog dedicated to web scraping which gives a nice code example of using XPath in screen scraping: extract-web-data.com. It provides a lot of information about web data extraction techniques and covers plenty of relevant tools. It's a nice, thorough and well-written read.


    Anyway, if you need some web data just for personal use or your boss asked you so, why don't you consider using DEiXTo or one of the remarkable software tools out there? The use case scenarios are limitless and we are sure you could come up with a useful and interesting one.

1 comment:

  1. I am so interested in using Xpath selector because something DOM changes among URLs. However, I don't know how to add manual code of XPath into XML rules. Can you give me an example of file XML using xpath, which will be loaded into Extraction Pattern so that I can learn?
    Thank you.

    ReplyDelete