A really serious matter that often many people ignore (deliberately or not) is the access and copyright restrictions that several website owners/ administrators impose. A lot of websites want robots entirely out. The method they use to keep cooperating web robots out of certain site content is a robots.txt file that resides in their root directory and functions as a request that visiting bots ignore specified files or directories.
For example, the next 2 lines indicate that robots should not visit any pages on the site:
User-agent: *
Disallow: /
However, a large number of scraping agents violate the restrictions set by the vendors and content providers. This is a very important issue and it raises significant legal concerns. Undoubtedly, there has been an ongoing raging war between bots and websites with strict terms of use. The latter deploy various technical measures to stop robots (an excellent white paper about detecting and blocking site scraping attacks is here) and sometimes even take legal action and resort to courts. There have been many cases over the last years with contradictory decisions. So, the whole issue is quite unclear. You can read more about it in the relevant section of the "Web scraping" Wikipedia article. Both sides have their arguments, so it's not at all an easy verdict.
User-agent: *
Disallow: /
However, a large number of scraping agents violate the restrictions set by the vendors and content providers. This is a very important issue and it raises significant legal concerns. Undoubtedly, there has been an ongoing raging war between bots and websites with strict terms of use. The latter deploy various technical measures to stop robots (an excellent white paper about detecting and blocking site scraping attacks is here) and sometimes even take legal action and resort to courts. There have been many cases over the last years with contradictory decisions. So, the whole issue is quite unclear. You can read more about it in the relevant section of the "Web scraping" Wikipedia article. Both sides have their arguments, so it's not at all an easy verdict.
The DEiXTo command line executor by default respects the robots.txt file of potential target websites (through the use of the WWW::RobotRules Perl module). Nevertheless, you can override this configuration (at your own risk!) by setting the -nice parameter to 0. It is strongly recommended though that you comply with webmasters' requests and keep out of pages that have access restrictions.
Generally speaking, data copyright is a HUGE issue, especially in today's Web 2.0 era, and has sparked endless discussions and spawned numerous articles, opinions, licenses, disputes and legitimacy issues.
By the way, it is worth mentioning that currently there is a strong movement in favor of openness in data, standards and software. And according to many, openness fosters innovation and promotes transparency and collaboration.
By the way, it is worth mentioning that currently there is a strong movement in favor of openness in data, standards and software. And according to many, openness fosters innovation and promotes transparency and collaboration.
Finally, we would like to suggest to everyone using web data extraction tools to comply with the terms of use that the websites set and think twice before deploying a scraper, especially if the data is going to be used for commercial purposes. A good practice is to contact the webmaster and ask for permission accessing and using their content. Quite a few times the website might be interested in such a cooperation mostly for marketing and advertising reasons. So, as soon as you get a "green light", start building your scraper with DEiXTo and we are here to help you!