Battling the Robots – Scraping Personal Information
Have you ever wondered about those distorted onscreen words that websites can ask you to type when you log-in (and which are sometimes too difficult to decipher)?
Have you ever wondered about those distorted onscreen words that websites can ask you to type when you log-in (and which are sometimes too difficult to decipher)? These are called ‘Captchas’, and are a type of gatekeeper, or online doorman, employed to distinguish real people from automated software applications ("bots") looking to access websites and harvest or ‘scrape’ content from the site.
Not all scraping activity is unwelcome. Developers for example, find scraping tools a useful way of copying and reusing content from their website when redesigning a new site. Likewise, certain aggregation services may have the agreement of different website operators to scrape pricing data for products or services and re-present this information as a single page result for the user, such as with certain insurance comparison sites.
Commonly, however scraping tools are used in more intrusive ways, and increasingly to mine personal information posted on websites by consumers. Common examples of personal information extracted by scraping from websites include user contact details, email addresses, and CVs. In addition to this there is increasing scraping of website discussion boards (or chat rooms) about what people say online.
Like the Victorian servant pressing an ear to the door of the master’s drawing room, some businesses increasingly want to be privy to our online conversations, even when these take place behind the virtual ‘closed doors’ of a private user forum. There is an interest in understanding what consumers are saying about specific products and services, and the scraping activity to harvest this information may in part reflect the common misconception that where information is in the public domain, it is somehow available for any use. This is not the case.
Last year in the US, ‘patientslikeme.com’, a health focused social network site that enables people to share symptom and treatment information, discovered it had been targeted by a market research company. The research company had used scraping techniques to access one of the patientslikeme.com private health forums and had copied subscriber messages which included personal information exchanged between subscribers. The overall objective of the exercise was to exploit the commercial value of the online conversations for onward exploitation through pharmaceutical companies. The resulting publicity was extremely negative for the market research company and it immediately took steps to change its policies and practices.
Data protection laws
Another important point is that the Data Protection Act 1998 requires – generally - anyone handling personal data, including website operators, to keep it secure from unauthorised access. Whilst an online discussion board is clearly intended to be available to anyone that logs in to view it, it is arguable that a failure by a website operator to employ technology that can prevent bots from scraping the content would amount to a failure to comply with the security requirements under the Act.
Contravention of the Data Protection Act 1998 can lead to enforcement action by the UK regulator of data protection, the Information Commissioner, and/ or criminal offences, fines of up to £500,000, and/or claims for damages by users themselves. Perhaps more significant however is the potential brand damage that can result where a business is caught eavesdropping on private online message boards, or performing other unauthorised data mining.
What does this all mean?
Website operators should ensure their privacy policies are up to date, and clear, and should consider any available means of preventing unauthorised scraping of data from their sites (keeping up with industry practice and available technology). Businesses intending on scraping other websites should take great care to consider the risks before doing so.
Ultimately, it's worth noting that the sort of activity described in this article is in its infancy, and it remains to be seen whether the public's reaction to any increased prevalence of such scraping may actually deter businesses – at least legitimate ones who would worry about brand damage – from using such methods.
"A brief overview of how the use of scraping techniques to capture personal information from websites can give rise to data protection issues."
"Like the Victorian servant pressing an ear to the door of the master’s drawing room, some businesses increasingly want to be privy to our online conversations"