Saturday, February 11, 2012

DEiXTo components clarified

From the emails and feedback received, it seems that many people get a bit confused about the utility and functionality of the DEiXTo GUI tool compared to the Perl command line executor (CLE). DEiXToBot is even more confusing for quite a few users. So, let's clarify things.
    The GUI tool is freeware (available at no cost but without any source code, at least yet) and it allows you to visually build and execute extraction rules for web pages of interest with point and click convenience. It offers you an embedded web browser and a friendly graphical interface so that you can highlight an element/ record instance as the mouse moves over it. The GUI tool is a Windows-only application that harnesses Internet Explorer's HTML parser and render engine.  It is worth noting that it can support simple cooperative extraction scenarios as well as periodic, scheduled execution through batch files and the Windows Task Scheduler. Perhaps its main drawback is that it can execute just one pattern on a page although for several cases (maybe for the majority) one and only extraction rule is enough to get the job done.
    On the other hand, the command line executor, or CLE for short, is implemented in Perl and it is freely distributed under the GNU General Public License v3, thus its source code is included. Its purpose is to execute wrapper project files (.wpf) that have previously been created with the GUI tool. It runs on a DOS prompt window or on a Linux/ Mac terminal.  Besides the code though, we have built two standalone executables so that you can run CLE either on a Windows or a GNU/Linux machine without having Perl or any prerequisite modules installed. CLE is faster, offers more output formats and has some additional features such as an efficient post-processing mechanism and database support. However, it shares the same shortcoming as the GUI tool: it supports just one pattern on a page. Finally, it relies on DEiXToBot, a "homemade" package that facilitates the execution of GUI DEiXTo generated wrappers.
    DEiXToBot is the third and probably the most powerful and well-crafted software component of the DEiXTo scraping suite and it is available under the GPL v3 license. It is a Perl module based on WWW::Mechanize::Sleepy, a handy web browser Perl object, and several other CPAN modules. It allows extensive customization and tailor-made solutions since it facilitates the combination of multiple extraction rules/ patterns as well as the post-processing of their results through custom code. Therefore, it can deal with complex cases and cover more advanced web scraping needs. But it requires programming skills in order to use it. 
    The bottom line is that DEiXToBot is the essence of our long experience. The GUI tool might be more suitable for most every-day users (due to its visual convenience) but when things get difficult or the situation requires a more advanced solution (e.g. scheduled or on-demand execution and coordination of multiple wrappers on a GNU/Linux server), a customized DEiXToBot-based script is your choice. You can use the GUI tool first to create the necessary patterns and then deploy a Perl script that uses them to extract structured data from the pages of the target website. So, if you are familiar with Perl, you should not find it very hard to write your first deixto-based spider/ crawler!

2 comments:

  1. Please make a diagram of the architecture describing deixto components.

    Also, a data flow diagram would help.

    Thank you

    ReplyDelete
  2. Awesome tool, its work in my first try.

    ReplyDelete