As artificial intelligence (AI) continues to revolutionise the life sciences sector, the use of real-world evidence data to train AI models is becoming a critical component in advancing research, drug development, and healthcare solutions. However, when dealing with personal data, life sciences companies must pay close attention to two core principles of the General Data Protection Regulation (GDPR): ‘purpose limitation’ and ‘data minimisation’. These principles are not just regulatory requirements, they are fundamental to ensuring privacy compliance and fostering trust with stakeholders.
Purpose limitation: ensuring compliance in AI development
Purpose limitation is a key principle under Article 5 of the GDPR. It mandates that personal data can only be used for the specific purposes for which it was originally collected. This principle is particularly relevant when life sciences companies reuse datasets, such as clinical trial data, to develop AI models.
How to ensure compliance with ‘purpose limitation’:
- Is the new use compliant? The development of AI models often involves the reuse of data for purposes that differ from the original intent. Companies need to assess whether the new use is compatible with the original purpose.
- Conduct a compatibility test (Article 6(4) of the GDPR): this test evaluates whether the new purpose aligns with the original one. Key criteria include whether the new use is a 'logical next step' and whether it is 'foreseeable'. Compatibility must always be assessed on a case-by-case basis. Where the personal data was originally collected on the basis of consent, the data should be used for the purpose(s) expressed in that consent. Therefore, a company may need to obtain new consent from the individual in order to use their data for a new purpose.
- Scientific research purposes exception: for scientific research purposes, there is a legal presumption of compatibility under Article 5(1)(b) of the GDPR, but you need to ensure you meet Article 89 of the GDPR as well, plus any local member state requirements which set out the parameters for scientific research. Whether AI development qualifies as scientific research depends on the specific case.
- Incompatible purposes: if the new purpose is deemed incompatible, the data cannot be used for AI development without obtaining explicit consent from the data subject or collecting new data.
Key takeaway: ensure that AI development uses data in line with its original purpose or seek consent or fresh data when there is a significant change in purpose.
Data minimisation: balancing privacy with AI development needs
‘Data minimisation’, another core GDPR principle, ensures that only the minimum amount of personal data necessary for the AI model's development is used. If personal identification of individuals is not required, anonymised or synthetic data should be prioritised.
Effective anonymisation: a key tool for privacy
Anonymisation of data is a key tool for protecting privacy while training AI models. Anonymisation involves removing identifiable information from datasets to prevent re-identification. If anonymous data is used the GDPR does not apply. However, there has long been discussions as well as a number of court decisions on the conditions under which effective anonymisation of data can be achieved. This can be particularly challenging when health data is involved. Therefore, the anonymisation process remains a risk-based approach because it is associated with legal uncertainties.
How to ensure effective anonymisation:
- Remove direct identifiers: remove names, addresses, and other personal identifiers.
- Address indirect identifiers: remove or de-identify indirect identifiers such as age, gender, and medical conditions to prevent re-identification.
- Leverage advanced techniques: utilise tools like differential privacy, which adds statistical noise to data, to make re-identification more challenging.
- Demonstrate proper anonymisation: companies should document their anonymisation process, conduct regular data protection impact assessments (DPIA), and ideally obtain third-party verification to ensure their practices meet privacy standards. Transparency in this process helps build trust and demonstrates a commitment to privacy compliance.
Key takeaway: focus on using only the minimum amount of personal data necessary and employ robust anonymisation techniques to safeguard privacy. Where possible, prioritise the use of anonymised or synthetic data to reduce privacy risks.
Synthetic data: an intelligent solution to data minimisation challenges
In addition to anonymisation, the use of synthetic data presents a powerful solution to the ’data minimisation’ challenges. Synthetic data replicates the statistical characteristics of real-world data without exposing sensitive personal data. Of course, the creation of synthetic data itself can involve processing of personal data so you would need to have a lawful basis under GDPR to proceed with that activity, and whether or not all synthetic data is anonymous is debatable.
Benefits of synthetic data for AI development:
- Bypass privacy restrictions: synthetic data as a special anonymisation technique is not subject to privacy compliance risks, allowing for greater flexibility and speed in AI development.
- Mitigate privacy risks: fully synthetic datasets eliminate the risk of re-identification, addressing privacy concerns like ‘purpose limitations’ and ‘data minimisation’.
- Improve AI models: synthetic data can include a broader range of data points, improving the robustness and generalisability of AI models.
Key takeaway: synthetic data offers an effective solution to both ‘purpose limitation’ and ‘data minimisation’ by enabling flexible, privacy-compliant AI model development. However, care must be taken to ensure synthetic data accurately represents real-world scenarios to avoid model bias.
Conclusion: building trust through privacy compliance
For life sciences companies, addressing ‘purpose limitation’ and ‘data minimisation’ principles from the outset is crucial in navigating regulatory complexities, safeguarding patient privacy, and driving AI innovation. By adopting strong anonymisation protocols and leveraging synthetic data, companies can develop AI models that are both effective and privacy-compliant accelerating research and drug development without compromising ethical standards.