6 février 2025
Businesses aiming to train new AI applications must navigate a complex legal landscape. One important part of this landscape is data protection law, such as the EU’s General Data Protection Regulation (GDPR), which imposes strict requirements on how personal data can be collected and used. In this article, we explore some key legal considerations that businesses must address when scraping and processing personal data for AI training purposes.
To understand if European data protection rules must be observed, companies must first assess whether their intended operations fall within the scope of relevant data protection laws. Depending on the scope of services provided, GDPR can apply even to non-EU entities.
The GDPR, like many other data protection laws, in general only applies to the processing of personal data, ie information relating to individuals. Non-personal or anonymised data, on the other hand, does not fall within the GDPR’s scope of application. Businesses should, therefore, assess whether relying solely on non-personal data constitutes a viable option. Data is deemed non-personal (anonymised) when the individual can no longer be identified, taking into account 'all the means reasonably likely to be used'.
Once it is established that AI training operations fall within the GDPR’s ambit, all data protection rules and principles generally apply. This includes the need to establish a legal basis for web-scraping or other data-collection practices. In very rare cases, businesses may be able to scrape or collect training data based on the individuals’ consent, for instance, where a business intends to use data exclusively from its own users instead of scraping data from other websites. Often, however, the data collection will rely on the grounds of 'legitimate interest': This allows data processing where the interests of the individuals in not having their data collected do not prevail over the company’s interests in collecting such data.
Guidance by several European Data Protection Authorities has highlighted that technical safeguards can help tip the scales in businesses’ favour and ensure the lawfulness of the data collection. If possible, businesses should, therefore, define precise collection criteria and apply filters to exclude the collection of unnecessary data categories, especially sensitive data, such as health data. Companies should also ensure that they do not scrape data from websites that have explicitly objected to web scraping, eg via robot.txt or ai.txt files.
Even during the collection stage companies must adhere to the data protection principles of the GDPR, including the principles of transparency, purpose limitation, data minimisation, accuracy, integrity and confidentiality.
Transparency, for instance, requires the distribution of clear and easily accessible information regarding ongoing data collection operations (including on the purposes of the processing, the categories of personal data concerned and the sources from which personal data originates). It also requires informing individuals about their data subject rights, such as their right to request access to personal data, their right to rectification and erasure and their right to object to the collection and processing of their data. This right to object means that companies may need to provide data subjects with appropriate opt-out mechanisms through which individuals can ensure that their personal data is either not scraped or promptly deleted. Finally, the principles of purpose limitation, data minimisation, and accuracy necessitate a thorough analysis of why the collected data is necessary for achieving the intended purposes and how its collection supports the AI model’s accurate and reliable performance.
Where companies have implemented these guidelines, data protection authorities have approved data scraping operations in the past (eg Meta’s scraping of user content based on legitimate interest). But, it may also be noted that some EU data protection authorities, such as the Dutch Autoriteit Persoonsgegevens, take a very restrictive stance on web scraping, arguing that purely commercial interests cannot justify web scraping at all. Rather, companies would need to be able to cite an interest specifically protected by the law, such as the will to exercise one’s right to freedom of information.
Last but not least: To the extent that third parties are involved in the data collection and use processes, data protection agreements may need to be concluded.
The GDPR’s requirements extend beyond the scraping or collection of training data. Its principles and rules must also be adhered to during the AI training phase. To minimise the impact on individuals, companies should continually review the relevance of the data they collected, assess its necessity for model training and promptly delete data that has been identified as irrelevant. Furthermore, they should de-identify personal data as early as possible and consider replacing real data with synthetic data where feasible. To minimise the risk of inadvertent personal data disclosure, models should furthermore be tested and evaluated for unintended data memorisation.
Moreover, businesses must continue to ensure transparency about the methods and purposes of their AI training operations and facilitate the exercise of data subjects’ rights. The implementation of these rights can be challenging. Businesses should thus lay down internal procedures regarding the conditions for the exercise of these rights. For example, respecting the right to data access means that companies must be ready to send concerned individuals training data extracts as well as associated annotations and metadata in an easily understandable format. To comply with the right to object to the processing of their data and data erasure, AI developers should provide individuals with 'personal data removal request forms' or retain 'black lists' that ensure that personal data will be deleted from training databases. Where AI systems output personal data, so-called 'machine unlearning' strategies may need to be adopted to make the models 'forget' personal data. Also, businesses should consider carrying out a data protection impact assessment if their operations are likely to pose high risks to the rights and freedoms of individuals.
On 12 July 2024, the EU passed its new AI Act, which establishes a comprehensive regulatory framework for – inter alia – providers of certain AI systems and models and also creates new duties when training AI. This includes duties on 'Data and Data Governance'. Developers of so-called 'high risk' AI systems must implement appropriate data governance practices, which must include robust design choices, data collection processes and data preparation (eg cleaning, updating, enrichment), quality assessments, bias mitigation, as well as the identification of data gaps. Training, validation and testing data sets shall be relevant, sufficiently representative, and to the best extent possible, free of errors and complete. Covered businesses will need to comply with their respective duties by August 2026.
Businesses intending to train an AI application on personal data have to be aware of a number of key regulatory requirements: