Friday, March 9, 2012

DEiXTo powers ΟPEN-SME

We are happy to announce that DEiXTo is going to power ΟPEN-SME, an exciting EU-funded project that promotes software reuse among small and medium-sized software enterprises (SMEs). ΟPEN-SME is coordinated by the Greek Association of Computer Engineers and it is aiming to develop a set of methodologies, tools and business models centered on SME Associations, which will enable software SMEs to effectively introduce open source software reuse practices in their production processes.

   DEiXTo-based wrappers have been successfully deployed in order to enable the project's federated search engine, called OCEAN (developed by the Department of Informatics of the Aristotle University of Thessaloniki), to simultaneously search in real time existing open source software search engines that do NOT offer API access (i.e. Koders and Krugle). To achieve this, custom Perl code was written so as to submit the user-specified queries to the native websites and scrape the (N first) results returned into a suitable form.

    We are really glad that we are participating in this challenging and innovative project and we hope that DEiXTo will help ΟPEN-SME towards implementing its goals. So, if you are looking for a web scraping framework to power your aggregator or search engine, please do not hesitate to contact us!

Monday, March 5, 2012

Uses and applications of web scraping

Some people wonder what the uses of web scraping might be. Well, your imagination is the only limit (along with the copyright notices perhaps). There is a huge wealth of data out there and many believe that the open Web is a real goldmine. So, web data extraction tools and DEiXTo in particular could help you unlock this treasure and give birth to innovations, applications and new ideas.
    Public institutions, companies and organizations, entrepreneurs, professionals as well as mere citizens and users generate an enormous amount of information every single day. The question is: how effectively is it being used? Towards this direction, web content extraction can prove a valuable ally. Along with data mining, they have much to offer in every field you can imagine. The following are only some of the uses of web scraping:
  • collect properties from real estate listings
  • scrape retailer sites on a daily basis
  • extract offers and discounts from deal-of-the-day websites
  • gather data for hotels and vacation rentals
  • scrape jobs postings and internships
  • crawl forums and social sites so as to enable analysis and post-processing of their rich data
  • power aggregators and product search engines
  • monitor your online reputation and check what is being said for you or your brand
  • quickly populate product catalogues with full specifications
  • monitor prices of the competition
  • scrape the content of digital libraries in order to transform it into suitable, structured forms
  • collect and aggregate government and public data
  • search (in real time) bibliographic databases and online sources that don't offer an API, thus powering federated search engines
  • look for educational material and information from across traditional formal higher education subjects and real-life context environments in order to help the contemporary learner
  • power mobile applications
  • help building geolocation apps (e.g. extracting addresses available on web pages and using their coordinates to build meaningful maps with points of interest)
  • prepare large, focused datasets for scientific tasks (i.e. data mining)
  • extract and summarize large volumes of text (e.g. summarizing product reviews)
  • <your scraping task goes here!>
    This list can grow very long. There are countless use cases and potential scenarios, either business-oriented or non-profit. As far as the access and copyright restrictions are concerned, it is a really significant issue that has raised a lot of discussion and controversy. However, the opinion that seems to be gaining ground is that (well-intentioned) web scraping is legal since the data is publicly and freely available on the Web. So, let your creativity and imagination loose; DEiXTo can probably help you to achieve your scraping-based project goals. We would be more than happy to hear from you.