Friday, January 17, 2014

About web proxies

Rightly or wrongly there are times when one would like to conceal his IP address, especially while scraping a target website. Perhaps the most popular way to do that is by using web proxy servers. A proxy server is a computer system or an application that acts as an intermediary for requests from clients seeking resources from other servers. Thus, web proxies allow users to mask their true IP and enable them to surf anonymously online. But personally we are mostly interested in their use for web data extraction and automated systems in general. So, we did some Google search to locate notable proxy service providers but surprisingly the majority of the results were dubious websites of low trustworthiness and low Google PageRank scores. However, there were a few that stood out from the crowd. We will name two: a) HideMyAss (or HMA for short) and b) Proxify.


    HMA provides (among others) a large real-time database of free working public proxies. These proxies are open to everyone and vary in speed and anonymity level. Nevertheless, free shared proxies have certain disadvantages mostly in terms of security and privacy. They are third-party proxies and HMA cannot vouch for their reliability. On the other hand, HMA offers a powerful Pro VPN service which encrypts your entire internet activity and unlike a web proxy it automatically works with all applications on your computer (whereas web proxies typically work with web browsers like Firefox or Chrome and utilities like cURL or GNU Wget). However, the company's policy and Pro VPN's terms of use are not robots-friendly, so using Pro VPN for web scraping might cause an abuse warning and result in the suspension of the account.


    The second high-quality proxy service that we found was Proxify. They offer 3 packages: Basic, Pro and SwitchProxy. The latter is very fast and it's intended for web crawling and automated systems of any scale. Since we are mostly interested in web scraping, SwitchProxy is the tool that suits us the most. It provides a rich set of features and gives access to 1296 "satellites" in 279 cities in 74 countries worldwide. They also offer an auto IP change mechanism that runs either after each request (assigning each time a random IP address) or once every 10 minutes (scheduled rotation). Therefore, it seems a great option for scraping purposes, maybe the best out there. However, it's quite expensive with plans starting at a minimum cost of 100$ per month. Additionally, Proxify provides some nice code examples about how one could integrate SwitchProxy with his program/ web robot. As far as WWW::Mechanize and Selenium are concerned (these two are our favorite web browsing tools), it is easy and straightforward to combine them with SwitchProxy.


    Finally, we would like to bring forward once again the access restrictions and terms of use that many websites impose. Before launching a scraper make sure you check their robots.txt file as well as their copyright notice. For further information about this topic we wrote a relevant post some time ago, perhaps you would like to check it out too.

No comments:

Post a Comment