< Back

Share |

Battling the Robots – Scraping Personal Information

Have you ever wondered about those distorted onscreen words that websites can ask you to type when you log-in (and which are sometimes too difficult to decipher)?

February 2011

Have you ever wondered about those distorted onscreen words that websites can ask you to type when you log-in (and which are sometimes too difficult to decipher)? These are called ‘Captchas’, and are a type of gatekeeper, or online doorman, employed to distinguish real people from automated software applications ("bots") looking to access websites and harvest or ‘scrape’ content from the site.

Not all scraping activity is unwelcome. Developers for example, find scraping tools a useful way of copying and reusing content from their website when redesigning a new site. Likewise, certain aggregation services may have the agreement of different website operators to scrape pricing data for products or services and re-present this information as a single page result for the user, such as with certain insurance comparison sites.

Commonly, however scraping tools are used in more intrusive ways, and increasingly to mine personal information posted on websites by consumers. Common examples of personal information extracted by scraping from websites include user contact details, email addresses, and CVs. In addition to this there is increasing scraping of website discussion boards (or chat rooms) about what people say online.

PrivacyEavesdroppers

Like the Victorian servant pressing an ear to the door of the master’s drawing room, some businesses increasingly want to be privy to our online conversations, even when these take place behind the virtual ‘closed doors’ of a private user forum. There is an interest in understanding what consumers are saying about specific products and services, and the scraping activity to harvest this information may in part reflect the common misconception that where information is in the public domain, it is somehow available for any use. This is not the case.

Last year in the US, ‘patientslikeme.com’, a health focused social network site that enables people to share symptom and treatment information, discovered it had been targeted by a market research company. The research company had used scraping techniques to access one of the patientslikeme.com private health forums and had copied subscriber messages which included personal information exchanged between subscribers. The overall objective of the exercise was to exploit the commercial value of the online conversations for onward exploitation through pharmaceutical companies. The resulting publicity was extremely negative for the market research company and it immediately took steps to change its policies and practices.

Data protection lawsData protection

Whilst there have not been any cases or high profile complaints in the UK as of yet, businesses looking to use automated scraping techniques to collect information about individuals should be aware that they risk breaching UK data protection law if they collect "personal data" (any information that can be used to identify a living individual) in a way that is "unfair". The Data Protection Act 1998 includes a number of factors that are used to determine whether a particular use of data is "fair", a key factor being transparency – i.e. that the person whose data is collected and used knows when they post the information who will access it and for what purposes. This is typically achieved by providing a privacy policy. It is hard to conceive of examples where harvesting of the types of information described above would be acceptable to users of website or indeed to the websites themselves, and therefore such use is almost certainly unfair, and in contravention of the Data Protection Act 1998.

Another important point is that the Data Protection Act 1998 requires – generally - anyone handling personal data, including website operators, to keep it secure from unauthorised access. Whilst an online discussion board is clearly intended to be available to anyone that logs in to view it, it is arguable that a failure by a website operator to employ technology that can prevent bots from scraping the content would amount to a failure to comply with the security requirements under the Act.

Contravention of the Data Protection Act 1998 can lead to enforcement action by the UK regulator of data protection, the Information Commissioner, and/ or criminal offences, fines of up to £500,000, and/or claims for damages by users themselves. Perhaps more significant however is the potential brand damage that can result where a business is caught eavesdropping on private online message boards, or performing other unauthorised data mining.

What does it all mean?What does this all mean?

Website operators should ensure their privacy policies are up to date, and clear, and should consider any available means of preventing unauthorised scraping of data from their sites (keeping up with industry practice and available technology). Businesses intending on scraping other websites should take great care to consider the risks before doing so.

Ultimately, it's worth noting that the sort of activity described in this article is in its infancy, and it remains to be seen whether the public's reaction to any increased prevalence of such scraping may actually deter businesses – at least legitimate ones who would worry about brand damage – from using such methods.

If you have any questions on this article please contact us.

Sraping
Sally Annereau

Sally Annereau


A brief overview of how the use of scraping techniques to capture personal information from websites can give rise to data protection issues.

"Like the Victorian servant pressing an ear to the door of the master’s drawing room, some businesses increasingly want to be privy to our online conversations"