Monday, January 2, 2012

Cooperating DEiXTo agents

Basically there are two major, broad categories of cooperating DEiXTo wrappers. In the first one, the wrappers are executed and applied on the same, single page so as to capture bits of interest that are scattered all over this particular target page. On the other hand, the second category comprises cases where the output of a wrapper serves as input for a second one. For the latter, typically the output of the first wrapper is a txt file containing the target URLs leading to pages with detailed information.
    The first category is not supported directly by the GUI tool. However, DEiXToBot  (a Mechanize agent object capable of executing extraction rules previously built with the GUI tool) allows the combination of multiple extraction rules/ patterns on the same page and their results through Perl code. So, if you have come across a complex, data-rich page and you are fluent with Perl and DEiXToBot's interface, you can build the necessary tree patterns separately with the GUI tool and then write a highly efficient set of cooperating Perl robots aiming at capturing all the desired data. It is not easy though since it requires programming skills and custom code.
    As far as the second type of collaboration is concerned, we have stumbled upon numerous cases where a first wrapper collects the detailed target URLs from listing pages and passes them to a second wrapper which in turn takes over and gathers all data of interest from the pages containing the full text/ description. A typical case would be a blog or a news site or an e-shop, where a first agent could scrape the URLs of the detailed pages and a second one would visit each one of them extracting every single piece of desired information. If you wonder how you can set a DEiXTo wrapper to visit multiple target pages, this can be done either through a text file containing their addresses or via a list. Both ways can be specified in the Project Info tab of the DEiXTo GUI tool.
    Moreover, for the first wrapper which is intended to scrape the URLs, you only have to create a pattern that locates the links towards the detailed pages. Usually this is easy and straightforward. You should just point at a representative link, use it as a record instance and set the A rule node as "checked" (right click on the A node and select "Match and Extract Content"). The resulting pattern will be something like this:
    Then, via executing the rule you can extract the "href" attribute (essentially the URI) of each matching link and export the results to a txt file, say target_urls.txt, which subsequently will be fed to the next wrapper. Please note that if you provide just the A rule node as a pattern, you will capture ALL the hyperlinks found on the page but we guess you don't want that (we want only those leading to the detailed pages).
   In conclusion, DEiXTo can power schemes of cooperative robots and achieve very high precision. Especially for more advanced cases, synergies of multiple wrappers are always needed. Their coordination though usually needs some careful thought and effort. Should you have any questions, please do not hesitate to contact us!

No comments:

Post a Comment