Using and training generative AI tools – IP ownership and infringement issues

Individuals and businesses are increasingly using generative AI tools to perform a range of functions, from improving operational efficiency to generating content that is monetised as part of their products and services. As the use of generative AI proliferates, so too do associated legal disputes. Getty Images has started proceedings against Stability AI in the US and UK alleging that AI image generation tool Stable Diffusion infringes its copyright and trade marks. GitHub, Microsoft and OpenAI are fighting a class action in the US relating to the GitHub Copilot tool's generation and attribution of open source software. This article examines the IP ownership and infringement implications of the use and provision of generative AI services, and provides key points to consider when training or using generative AI tools.

How generative AI works

At a general level, sophisticated generative AI models are trained on large datasets (eg of text, images, music, or software) and learn to recognise highly nuanced patterns and relationships within the training data. In theory they should not memorise the training data (eg strings of text or images), but only relational principles present in the data. When responding to a prompt, the model uses interpolation to generate a response based on its learned relational principles. In theory, the output should be a wholly new output generated from scratch by the model but in practice, it can in some instances be similar or identical to material present in the training data.

Ownership issues

Ownership of the AI program

There are a multiplicity of stakeholders involved in the creation of an AI system (eg programmers, data suppliers, trainers, feedback providers, investors funding creation of the system, and system operators). In many cases, the AI system software programmers will likely be seen as the first authors of the copyright in the AI system as a computer program, but questions of co-authorship could arise depending on the contributions of other individuals involved.

Ownership of the AI-generated output

Ownership of copyright in an AI program may not automatically result in copyright ownership over the future output created by the AI system (assuming copyright subsists in that output). Ultimately, who owns any copyright in the AI-generated output is a complex question and still unsettled in the UK.

The UK currently provides copyright protection over computer-generated works (CGW), ie works generated by a computer in circumstances where there is no human author of the work, subject to originality requirements. The author for copyright ownership purposes is the person by whom the arrangements necessary for the creation of the work were undertaken.

It is unclear whether AI-generated works fall firmly within the definition of CGWs. The government acknowledged this uncertainty and sought to address it through its consultations into AI and copyright in 2021 and 2022. It ultimately decided not to table immediate changes to the law regarding authorship because the use of AI is still at a nascent stage. It is possible the growing mainstream appeal of generative AI like ChatGPT could compel a reconsideration of current laws. For now, in the absence of any clear alternative, the best analysis in most cases is likely to be that AI-generated works fall within the definition of CGW and are protected by copyright on that basis.

Who precisely is the "person by whom the arrangements necessary for the creation of the work were undertaken" in the context of AI-generated output is not clear cut.

For corporate users who invest in the development of their training model to generate works for their own purposes and employ the software developers who create the code (eg a games developer using an AI system to generate images for its virtual world), it is more straightforward to identify them as having made the necessary arrangements.

Where there is arguably more than one "contributor" to the arrangements necessary for the work's creation, eg users who input into the creation of works by publicly available systems, the question is much less straightforward. Here, the user could, depending on their contribution, also claim authorship.

The answer may depend on the level of creativity contributed by the AI creator and the user. Where the AI for the most part automatically produces the generated works creatively and independently, with minimal human intervention (beyond inputting simple prompts), the output may more likely be owned by the AI creator. Where the generation of the new works is AI-assisted (ie AI functions as a tool to enable a user to achieve a particular result), but considerably more human intervention is necessary, the output may more likely be owned by the user.

Regardless of the default legal position, ownership of AI-generated outputs is, in practice, often determined by the AI service provider's terms and conditions. Conflict of laws considerations can also play a part in the analysis where the AI user and AI service provider are located in different jurisdictions.

IP Infringement

The training and use of generative AI models can give rise to a variety of IP infringement risks at various different stages of the process:

Stage 1: Obtaining the training data to train the AI

If the training data is obtained from unlicensed sources, eg by scraping, and consists of copyright-protected works eg artworks, music, videos, strings of text, the copying/reproduction of the data without a licence can infringe copyright. There can also be infringement of database rights if the data is extracted from a protected database.

Potential "non-commercial use" exception: Exceptions to copyright and database rights infringement may exist if the use is for non-commercial research or non-commercial text and data mining. However, any research that is used for a purpose which has some commercial value would not benefit from these exceptions.

Proposed text and data mining exception: The government has decided not to proceed with a proposed new copyright and database right exception, which would have permitted text and data mining for any purpose (including commercial ones).

Stage 2: Training process

During the training process, the training data may be stored, potentially in different formats, for the duration of the training, which may take several months. The making of new copies of the data for these purposes, including in encoded or compressed forms, without a licence could constitute further acts of copyright infringement.

The AI Model is also likely to make temporary copies of the training data in its own memory while 'reading' it which may also constitute copyright infringement.

Potential temporary copies defence: This defence applies to the making of copies that are transient or incidental and an integral and essential part of a technological process, the sole purpose of which is to enable a lawful use of the copyright work and which have no independent economic significance. This exception has not been tried in the UK courts in relation to the AI training process but could potentially apply depending on the circumstances.

Stage 3: Storing 'learned information'

If the AI creates and stores copies of the training data or parts of it, including as compressed versions of the original, rather than merely learning relational principles present in the training data, this could amount to another act of copyright and/or database rights infringement.

Even if an AI model stores information in an abstract form, provided the neural network is capable of reproducing a substantial part of the creative original elements of a copyrighted work in the training dataset, this may amount to copying in a manner analogous to storing content in a compressed file format. This can occur as a result of the way the AI is trained or as a result of a poor-quality training dataset, eg that has not been sufficiently de-duplicated.

Stage 4: Generation of AI-output

If the AI-generated works replicate a "substantial part" of a copyright work contained in the training data, or "substantial part" of a database from which training data was obtained, the creation and use of those works could amount to copyright and database rights infringement. This can occur if the AI essentially creates a digital collage of copied parts of the training data to create a "new" work, or where the model has 'memorised' the training data rather than learning more generalised principles from it.

There can potentially be copyright infringement even if no part of the original training data is replicated exactly but the AI-generated work as a whole gives a sufficiently similar impression to a work included in the training data (a non-literal copy).

There may also be a risk of trade mark infringement and/or passing off if the AI-generated output includes trade marks (eg logos) in a manner that gives rise to a likelihood of confusion or may lead the public to believe the output is somehow associated with or endorsed by the trade mark owner.

Stage 5: Onward use of AI-generated works

If an infringing AI-generated work is subsequently used, eg by offering copies for sale, posting online, or performing in public, this could amount to further acts of IP infringement.

Key considerations when using and providing generative AI services

Using AI services

What do the AI service's terms and conditions say about the rights the service provider has to use the content you have provided when using the service?
As a general rule, avoid including any confidential or sensitive information in the content you input into the system.
What do the AI service's terms and conditions say about whether you own the copyright in the output generated?
What do they say about your rights to use the output? Are there any restrictions on further use eg for commercial purposes?
Be aware that AI generated output can infringe third party IP rights and your onward use of the output may constitute a new act of infringement.
Does the output consist of creative material, or material whose content is functional, generic or commonplace?
Do you know what materials the tool was trained on and whether the provider had a licence to use them?
Consider the practical risks arising from your proposed uses of the output (eg are you copying the output wholesale or making substantial changes, what mediums/ channels are you making the output available on, are they public, and are they part of a commercial offering)?
If you are offering access to a third party AI service, do you have rights from the third party to do so, have you made clear that the service is provided by a third party, and would any other disclaimers be appropriate?

Providing AI services

Where is the training data being sourced from, is it of such a nature that would be protected by IP rights, and do you have a relevant licence to use it?
What is the quality of the training data, eg will it contain duplicates of the same data, does it feature third party trade marks?
When obtaining and using training data, are you able to rely on any of the non-commercial use exceptions to copyright infringement?
How is the training data being stored and for how long? Are you able to rely on the temporary copies exception to copyright infringement?
What is the AI model trained to 'learn', eg does it learn relational principles or does it 'memorise' training data?
How does it generate output, eg does it combine content it has memorised from the training data or generate new content based on statistical regularities?
If the AI model generates new content, have you kept records that would allow you to explain this process, eg to a court or regulator?
Have you tested the AI-generated outputs against the training dataset to detect any identical or highly similar outputs and how often they arise?
Consider the practical risks arising from the nature of the AI service – is it used internally or made available publicly and/or for a fee?
Have all actors involved in the creation of the AI system (particularly the writers of the software code) executed assignments of the relevant rights to you?
What do the user-facing terms and conditions of the AI service say about ownership of outputs and your ability to use users' inputs to the system for further training?

Proceed carefully

Given the complexities of generative AI systems and the, in some cases, uncertain or untested application of IP laws to their output, it's important to consider the issues carefully and remain aware of incoming laws and relevant court decisions in this space.