Recently we talked about Selenium and its potential combination with DEiXTo. It is a truly remarkable browser automation tool with numerous uses and applications. For those of you wondering how to programmatically pass pages fetched with Selenium to DEiXToBot on the fly, then here is a way (provided you are familiar with Perl programming):
# suppose that you have already fetched the target page with the WWW::Selenium object ($sel variable)
my $content = $sel->get_html_source(); # get the page source code
my ($fh,$name); # create a temporary file containing the page's source code
do { $name = tmpnam() } until $fh = IO::File->new($name, O_RDWR|O_CREAT|O_EXCL);
print $fh $content;
close $fh;
$agent->get("file://$name"); # load the temporary file/page with the DEiXToBot agent using the file:// scheme
unlink $name; # delete the temporary file, it is not needed any more
if (! $agent->success) { die "Could not fetch the temp file!"; }
$agent->load_pattern('pattern.xml'); # load the pattern built with the GUI tool
$agent->build_dom(); # build the DOM tree of the page
$agent->extract_content(); # apply the pattern
my @records = @{$agent->records};
for my $record (@records) { # loop through the data/ records scraped
....
Therefore, you can create temporary HTML files, in real time, containing the source code of the target pages (after the WWW::Selenium object gets these pages) and pass them to the DEiXToBot agent to do the scraping job. Another interesting scenario is to download the pages locally with Selenium and then read/ scrape them directly from the disk at a later stage. We hope the above snippet helps. Please do not hesitate to contact us for any questions or feedback!
No comments:
Post a Comment