Web Scraping

Web Scraping

 


What is Web Scraping?

 

Web scraping is sometimes called data harvesting or site scraping and is a technique to retrieve data from a website or multiple websites.  Scraping can occur by manually copying material from a website, but most scraping is conducted by complex software that can automatically access and retrieve data from websites. 

 

There is a surprisingly few amount of lawsuits for the amount of scraping that occurs on the internet because the data retrieval borders between legal and illegal activity. 

 

Examples of Web Scraping

 

Scraping comes in many different forms, but the examples listed below are common forms of scraping:

 

·         obtaining emails from websites in order to launch spam emails

·         using user information from social networking websites

·         obtaining price and product information from a website to decrease their sales

·         plagiarizing information from articles, blog posts, and published research

·         relisting a website’s listings like job postings, phone listings, etc

 

Example of Lawsuit Involving Web Scraping

 

Feist Publications v. Rural Telephone Service Company

 

Rural Telephone Service Company is a public service that provides telephone service to communities in the northwest part of Kansas.  The company publishes a telephone directory that has white and yellow pages.  Feist Publications, Inc. is a company that publishes similar information for a larger range of people.

 

Rural refused to license its white pages to Feist, so Feist decided to extract the listing from Rural’s directory anyway.  Feist altered many of these listing, but some were left the same, so Rural sued Feist for copyright infringement. 

 

The Supreme Court ruled that the names, towns, and telephone numbers used by Feist were not originally produced by Rural.  Therefore, the public information was not protected under copyright law.  The Court ruled that the material in Rural’s white pages lacked the minimum amount of material needed for copyright protection because they failed to arrange and coordinate the material in an original way. 

 

Numerous petitioners have sued for web scraping and won, however.  Companies like American Airlines, Southwest Airlines, eBay, Facebook, and more have taken action against websites that harvested their data. 

 

Ways to Prevent Web Scraping

 

The first and most obvious was to stop scraping is to threaten legal action unless the company or website ceases and desists the publication of the information.  The following measures can help stop and prevent scraping from computer programs:

 

1.       Use a JavaScript calculation to make sure the person accessing the site is actually a web user

2.       Use a CAPTCHA (which requires the person to type in characters of distorted letters)

3.       Use images or flash files and provide text by using a script or style sheet

4.       Block a competitor’s IP address

5.       Install software that can detect a scraping bot

6.       Change your HTML tags along with URLS and tag names

7.       Set up fake links to fictitious content to trap the scraper

 

Lastly, there is a large amount of software that websites can use to detect scrapers.  If you’re concerned about the content of your website, considering looking into software that can deter scrapers.

Related Topics