No more Training - Is the Copyright Disaster looming for AI Developers? | HÄRTING Rechtsanwälte

The US Authors Guild has filed a class-action lawsuit in New York against ChatGPT developer OpenAI on behalf of numerous well-known authors, including Jonathan Franzen, John Grisham and George R. R. Martin. Here, you can find out what OpenAI is accused of, what arguments are put forward and what the legal situation would be in this case in Germany.

26. September 2023

The accusation

OpenAI is accused of using the authors’ copyrighted works to train the GPT-3, GPT-3.5 and GPT-4 large language models in violation of copyright law. The company is alleged to have made large-scale copies of e-books available online without the required licences from illegal datasets in order to use them as training data for their language models. OpenAI is accused of having done this deliberately in order to have the highest quality and most comprehensive text data possible when training the language models

The technical background

The GPT models are based on the statistical evaluation of word and sentence contexts. This means that certain contexts that make up natural human speech are extracted from huge amounts of data so that the language model can form “new” sentences that correspond to these contexts

A good language model requires good training data. Books that contain a large amount of error-free text are particularly suitable. Books also provide access to diverse text forms. The language model learns certain styles (e.g. scientific writing from non-fiction books) and forms (e.g. an inner monologue or a dialogue). For these reasons, OpenAI has deliberately included books in the training data. This is well known and still common practice in training language models

The arguments

What has not yet been clarified, however, is whether this is allowed at all Books are the ideal type of work protected by copyright (cf. § 2 para. 1 no. 1 UrhG or Copyright Act of 1976, 17 U.S.C., § 102(a)(1)). Without permission from the author, their use is therefore only permitted within narrow limits. According to the plaintiff authors, these limits were disregarded by OpenAI

For the training of a language model, the works must always be reproduced so that they can be included in the set of training data. As a rule, this reproduction alone is not permitted without a licence. So does that clear everything up? OpenAI will have reproduced the books, so the copyright infringement is clear, isn’t it?

Of course, it’s not that simple. The developers of language models are invoking the “fair use” exception (which, by the way, does not exist in Germany) under 17 U.S.C. § 107. The “fair use” exception applies to all uses of copyrighted works for scientific and artistic purposes. That is, it is an open exception, for all uses that are “fair”. It should be possible to determine when a use is “fair” by weighing certain criteria. Explicitly mentioned are the purpose of the use, the character of the work used, the materiality of the use and the impact on the market or the value of the work used. The “fair use” exception is intended to help ensure that the purpose of copyright has a meaningful effect: Copyright protection is intended to stimulate progress through research and creativity by protecting their exploitation. But since all research and art is inspired by prior results and works, there must be an exception to copyright protection to prevent it from stalling (this is how the Supreme Court justifies the exception, see: Campbell v. Acuff-Rose Music, Inc. – 510 U.S. 569 (1994), p. 575). In American legal practice, previous case law is usually relied upon in “fair use” cases in order to better understand the open-ended provision. Examples include parody, criticism or use in school teaching

An interesting consequence of the exception, for example, is already the corporate structure of the developers of language models. These are generally divided into a non-profit arm that does the research (here: OpenAI Inc.) and a commercial arm that markets the language model (here: OpenAI LLC). This allows the developers to argue that they are using the training data for non-commercial research purposes, which argues for “fair use”

OpenAI will also argue that when the language model is trained, nothing happens other than a very extensive and effective consumption of copyrighted works from which the models are then “inspired”. Above all, OpenAI will be able to point to the purpose of the provision and point out that progress in arguably the most important new technology of the 21st century would be sluggish if copyrighted works could no longer be used for training. It is precisely this situation that the exception is intended to prevent

However, the actual consequences of exploiting copyrighted works throws these arguments into serious doubt. The better the language models become, the more significant the impact on the authors of the works used. Already, many companies prefer to use faster and cheaper AI-generated alternatives rather than pay an artist to do so. In the long term, therefore – as the lawsuit also argues – there is a threat of the extinction of the author’s profession, as a result of the fact that the works of those same authors were evaluated without compensation and the language model can approximately recreate their style

Nor is it certain whether the legal corporate structure will convince the courts in view of the obvious commercialisation by the same corporation. Generally speaking, without a court decision, everything is open in “fair use” cases. The exception deliberately builds on case law

The evidence

The lawsuit relies on general evidence on the one hand and evidence on specific books on the other

In general, the lawsuit points to the known datasets of GPT language models, which include databases titled “Books1” and “Books2”. In view of the size of these databases (over 100,000 books), it could be assumed that the works were unlawfully downloaded from relevant piracy portals. This would already be the case because otherwise there would be no correspondingly extensive digital book databases on the internet. Moreover, independent research has proven that the generation of extensive databases with the help of illegal portals is possible in principle

Until a corresponding change was made by OpenAI, it is even said to have been possible to get GPT-3 to quote verbatim from books by means of a corresponding prompt. Today, ChatGPT informs the user that it cannot quote from works protected by copyright

With regard to the specific works, the lawsuit also wants to provide proof through prompts. Thus, the language model was prompted to write a summary of books by certain authors, to write a summary of a hypothetical sequel and to summarise the last chapter. All things that – without having the book available as training data – should not be possible, according to the complaint. The explanations are of varying probative value; especially the summary of the book does not seem very convincing, considering that it could also – legitimately – have been arrived at by evaluating freely available summaries on the internet. More convincing is the summary of specific chapters and the hypothetical continuation, provided that it deals in detail with events and characters not included in summaries

Very interesting is an approach mentioned in connection with “Game of Thrones” by George R. R. Martin. Here the complaint refers to a study by the University of California on the “memorisation” of individual books, i.e. how well language models can “remember” the content of books. The researchers used a so-called “name cloze” procedure. This means that the language model is asked to complete a cloze from a book. The cloze is any paragraph from a book from which any name of a book character has been removed. A person basically fails this test because it is about minor characters that do not stick in the memory. However, GPT-4 can complete the paragraphs correctly, which can only be possible if the paragraph is included in the training data in the wording, because in normal language use there can be no increased probability that the correct name from the book fits into the gap. GPT-4 assumes this probability only because the paragraph is included in the training data. In other words, by including a certain paragraph from a book several times in the training data, GPT-4 assumes that it is always quite likely that a certain name will be said after the sentence in the cloze

What applies in Germany?

In Germany, the following applies above all: Nothing is certain without judicial clarification. Fair use” is alien to German copyright law. Instead, there are many different narrowly defined exceptions for very specific uses of works. There is no doubt about copyright infringement by copying under German copyright law (Section 16 UrhG). The discussion in this country revolves primarily around the exception for text and data mining from Section 44b UrhG

Although the temporary act of reproduction (Section 44a UrhG) is also discussed in part, the prevailing opinion is that because of the effects on the performance of the language model, these acts cannot be classified as “accompanying” and are therefore not elements of the offence

Text and data mining, on the other hand, serves precisely to enable progress through innovative data analysis in the private sector and therefore best fits the goals of AI developers. However, the exception only applies to works that are already lawfully accessible (Section 44b (2) UrhG) and can therefore in no way justify the use of pirated books. In addition, Section 44b (3) UrhG allows an opt-out procedure through a machine-readable reservation of use

Thus, if the factual situation is as in the lawsuit, so that the books were scraped by providers of illegal reproductions, a copyright infringement liable to compensation would undoubtedly be assumed under German law. This raises further almost insoluble consequential questions: Copyright infringements also always trigger claims for injunctive relief and removal (Section 97 UrhG) – Must the results of the training then also be removed? And if so, is this technically feasible at all?

German copyright law has not yet been modernised accordingly and, apart from Section 44b UrhG, does not really offer adequate instruments for AI training. The European AI Act is also silent on the handling of intellectual property rights in its current draft stage. With the increasing concentration of worldwide research on the improvement of AI models, there is an urgent need for action here – also at the European level – for the legislator to do justice to the special situation

Links

To the lawsuit

To the “Memorisation” study