Skip to content

From software development to e-commerce, marketing and consulting services: The use of (generative) artificial intelligence (henceforth “AI”) is penetrating every conceivable industry. This raises many different questions from a legal perspective. One of the most relevant is that of data protection. The European data protection authorities impose severe fines in the millions for violations of the GDPR. The need for companies to make their offerings legally compliant is therefore just as high.

With the introduction of GPTs, Open AI (the company behind ChatGPT) has made the development of custom Large Language Models (LLMs) based on ChatGPT extremely accessible, even without programming knowledge, so that data protection issues have become even more relevant. In the following, we will therefore provide an overview of the data protection aspects that should be taken into account when using AI and LLMs. In doing so, we will present practical problems and their solutions.

Applicability of the GDPR

AI is a foreign concept to the GDPR. However, Recital 15 of the GDPR already clarifies that the data protection should be “technologically neutral”, i.e. it should also cover new technologies. This is also reflected in the wording of the articles of the GDPR. AI can be read into Art. 22 GDPR as a system for automated decision-making. Art. 35 (1) sentence 1 GDPR also refers to “using new technologies”.

The decisive factor is whether personal data is processed using the relevant technology. Art. 4 No. 1 GDPR defines personal data as ” any information relating to an identified or identifiable natural person (‘data subject’)”. This definition is to be understood broadly. Data that can be linked to a specific person due to its content, purpose or effect is data that makes a person identifiable. This information therefore ranges from a real name to a pseudonym to information about an individual’s ability to pay. So-called pseudonymous data, such as a customer number, can also be linked to individual persons by using additional information and therefore also fall under the scope of the GDPR.

If the data does not relate to a person, the scope of the GDPR does not apply, Art. 2 (1) GDPR.

Type and scope of personal data processed when using AI

AI can be used in a variety of different ways. Currently, the most widespread form of generative AI is the chatbot, which reacts interactively to user input and thus enables contextualised responses and conversations. In the following, we will therefore use chatbots as an example to show what personal data can be processed by AI and how.

Before a self-developed chatbot can be used for interactive correspondence, it must first be trained using very large data sets in order to recognise correlations and make predictions. The data is usually audio, video, image or text data. The fact that personal references may also be present in this data lies in the nature of things.

This raises the question of where developers gather the necessary training data. The following sources are conceivable:

  • Own data sets
  • Data from freely accessible sources
  • External providers of training data sets

After completion of the training, the AI is ready for use. In the case of a chatbot, this means that users can enter information on a specific issue in a dialogue box.

An example:

„Can I enter into a mobile phone contract with you if I am a minor?“

The situational conversation may also contain pseudonymous or directly personal data:

My contract number is 1234, I would like to change my address from Main Road 1 to Side Street 2 as of 1 January 2024. I would also like to change my family name in my profile from Meyer to Mueller.“

The algorithm behind the chatbot categorises this information, processes it further and, if necessary, changes existing data with the help of other programmes. The AI then makes a decision regarding the reaction to the input and generates a contextualised response as output:

„To conclude a mobile phone contract with us, you must have reached the age of 18 and therefore be legally of age.“


„Hello Mrs Mueller, I have changed your name and made a reservation for an automatic change of address to Side Street 2 as of 1 January 2024.“

The chatbot can also be trained or pre-programmed to pre-emptively recognise user concerns based on their input and profile. The chatbot can then anticipate that the person in the chat window may also want to have their energy requirements recalculated in addition to changing their address, or, if the new address is in an area with many electric cars, that it is possible to submit a customised offer in this regard.

Particular caution is required in cases where so-called special categories of personal data (‘sensitive data’) are to be processed. This includes information such as sexual orientation, ethnic origin, political opinions, religious or philosophical beliefs, see Art. 9 (1) GDPR.

Legal Basis for Data Processing

The processing of personal data always requires a legal basis. For personal data, Art. 6 GDPR is applicable; for special categories of personal data, the authorisation provisions of Art. 9 (2) GDPR applies.

The most relevant basis will be the consent of the data subject pursuant to Art. 6 (1) (a) GDPR (or, in the case of special categories of personal data, Art. 9 (2) (a) GDPR) and the legitimate interest pursuant to Art. 6 (1) (f) GDPR. In the case of a contractual relationship with data subjects of data processing with AI, Art. 6 (1) (b) GDPR also comes into consideration.

In addition to the GDPR, other legal bases regulated in national data protection laws – such as the German Federal Data Protection Act (“Bundesdatenschutzgesetz”, BDSG) – come into play. This applies in particular to the processing of personal data in the context of public administration and the judiciary (e.g. B2G), for employee data and for scientific research.

In the following, we break down the relevant legal bases based on the respective purposes of processing in the context of the use of AI. Essentially, a distinction must be made between training data and the use of AI through user input.

Training Data

The legal basis for the processing of personal training data depends in particular on the origin of the data.

Training data can be collected as such. This constellation will be rather unusual in practice. If the data subject participates voluntarily, the legal basis here is Art. 6 (1) (f) GDPR. On the other hand, data processing can also be based on consent in accordance with Art. 6 (1) (a) GDPR.

More exciting are the cases in which inventory data or data collected through scraping is to be further processed.

Problem: Change of purpose for existing data

The further processing of existing data raises the question of the so-called change of purpose. If companies want to train their own generative AI, they must consider the following problem:

Existing data was originally collected for a purpose other than training the AI. However, according to the purpose limitation principle in Art. 5 (1) No. 2 GDPR, data may only be processed for the initial purpose for which it was collected. If processing is to be carried out for a different purpose, this must be measured against the standards of Art. 6 (4) GDPR. The new purpose must therefore be compatible with the initial purpose. Such an assumption for the use of personal data to train an AI would be difficult, if not impossible, to justify.

Problem: Data gathered through scraping

When processing data from other sources, the question arises as to whether it was primarily collected lawfully and/or on what basis it is now being further processed for AI training. The processor of the data almost certainly does not know the data subjects, as the amount of data involved is almost inconceivable. This means that, under data protection law, it is generally not possible to obtain consent in accordance with Art. 6 (1) (a) GDPR. This means that the only legal basis that can be considered is the legitimate interest pursuant to Art. 6 (1) (f) GDPR. A balancing of interests must be carried out here, which must in particular also take into account the purpose of the planned processing of the data. The question of whether the individual whose data is used for the training can be re-identified with little effort or not must also be taken into account.

Problem: Sensitive data?

Training data may also contain sensitive information in accordance with Art. 9 GDPR. This can already begin with images of people who wear glasses and is of course particularly relevant in the training of AI for the medical sector, where AI can be used for the early diagnosis of a variety of diseases based on image recognition. In addition to consent, Art. 9 (2) GDPR also provides for other legal bases; in individual cases, data processing for the purpose of research may also be privileged (see statements in the Guidelines of the EDPB on the processing of data concerning health for the purpose of scientific research, here from para. 23). The so-called “Petersberger Erklärung” of the  German data protection conference (“Datenschutzkonferenz”, DSK) also deals with this fundamental issue, among others.

Solution: Anonymised or aggregated data?

A possible solution to these problems is the use of anonymised data or aggregated data.

The data is anonymised if the reference to a specific individual can no longer be restored or can only be restored with disproportionate effort. As it can be reasonably argued that these anonymised data are no longer personal data within the meaning of Art. 4 No. 1 GDPR, the scope of the GDPR does not apply in this case. The anonymised data can therefore be processed.

It is also possible to use aggregated data for training. When aggregating data, case groups in the active data set are combined into individual cases, which are then saved as a separate aggregated file. In this case, the personal reference may be missing if the aggregated data does not refer to a specific individual but to a group of people. The use of this type of aggregated data, like the use of anonymised data, does not fall within the scope of the GDPR.

According to the prevailing view, both the anonymisation of data and the creation of aggregated data sets constitute the processing of personal data, meaning that a legal basis is required for them. Here, the legitimate interest comes into consideration – whereby the balancing of interests plays a decisive role in determining the legality of this processing.

User input (dialogue box)

If users are enabled to interact with the AI, e.g. through a chat feature or, in the case of an image AI, by uploading photos, the processing of personal data cannot be ruled out. If this is intended, e.g. in the case of a customer support chat, the legal basis is the contract between the parties in accordance with Art. 6 (1) (b) GDPR. If sensitive information is shared here, Art. 9 GDPR may apply and thus consent may be required.

Problem: Forced data

It becomes problematic if the entry of (certain) personal data is not intended. Even if a warning prohibits the input of personal data or particularly sensitive data, or non-contractual data, it can still happen that users enter such ” non-purposeful” data. These cases of so-called “imposed controllership” do not lead to a position of the providers of generative AI as controllers within the meaning of Art. 4 No. 7 GDPR. This is because this data is in fact processed when it is submitted. In the case of forced data, the legitimate interest pursuant to Art. 6 (1) (f) GDPR can be considered as the legal basis. As part of the balancing of interests, it can be argued here that the information – despite the context – was entered voluntarily and thus “imposed”. However, this does not work for health data. Their processing would have to be terminated and the imposed information would have to be deleted.

Problem: Subsequent processing of (forced) data

In addition, it is possible that the AI continues to train using the information received via the dialogue window. This subsequent processing requires a legal basis – in this case, consent in accordance with Art. 6 (1) (a) GDPR can naturally be considered, as the data was given freely. In addition, a legitimate interest may be debatable, although the above-mentioned aspects must not be disregarded in the balancing of interests to be carried out here.

Solution: Comprehensive prior information and consent

It is highly likely that it will be some time before the authorities or courts make a decision on this issue. Until then, it seems practical to ensure that the data subjects are fully informed and consent is obtained before AI is used. Depending on the purpose and type of AI, the legitimate interest also appears to be justifiable. However, a final categorisation requires a comprehensive examination in each individual case.

Rights of the Data Subjects and their Enforceability; Data Security

If personal data about a person is processed, this person is entitled to so-called data subject rights vis-à-vis the controller.

The implementation of data subject rights is already a challenge for conventionally organised companies. It becomes all the more complicated when the training or use of AI is being considered or is already being practised.

The main obligations of the controller include, among others:

  • the obligation to provide information about data processing and the personal data stored by the company (Art. 15 GDPR),
  • the obligation to
    • rectify (Art. 16 DSGVO),
    • delete (Art. 17 DSGVO) and
    • restrict processing of (Art. 18 DSGVO) the personal data.

Typically, data processing also falls under so-called profiling in accordance with Art. 4 No. 4 GDPR. This has an effect in particular on, for example, the request for information (Art. 15 (1) (h) GDPR).

Problem: Individualisation of the data subjects in the context of big data

A special problem is the enforceability and fulfilment of data subjects’ rights in the context of big data. Due to the mass of processed data, it is not always possible to trace which personal data is being processed. As a result, it may not be possible to effectively identify data subjects and thus fulfil their deletion requests. However, the individualised treatment of data records in a trained AI is also somewhat at odds with the nature of AI. This problem area already existed before the discussion surrounding AI with regards to large databases and other variants of big data.

Problem: Black box

An inherent problem in the field of AI arises from the fact that the decision-making process of AI cannot be fully understood due to its high technical complexity and therefore cannot be explained in a comprehensible manner (so-called black box). This affects not only laypeople, but also experts.

With regard to the rights of data subjects, this means that:

  • Claims for information cannot be fulfilled in full due to a lack of insight into to the neural networks;
  • deletion obligations for any (unwillingly) stored data cannot be specifically fulfilled.

Problem: Data security

Personal data that is processed for training or through the use of AI is exposed to particular risks. Attention must therefore be paid to data security. Using certain methods, it has already been proven that generative AI models unintentionally store the training data used and that this data can also be duplicated and/or read out using certain attacks (e.g. inversion attacks) (as shown most recently in “Scalable Extraction of Training Data from (Production) Language Models”, a computer science paper published through Cornell University’s arXiv).

Solution: As anonymous and secure as possible, no obligation to identify

The risks and enforcement difficulties cannot be completely eliminated due to the nature of the matter. In practical terms, however, they can be minimised:

It is advisable to ensure comprehensive documentation from the outset and to arm yourself against AI-specific attacks.

With regard to incoming requests from data subjects, attention must be paid to the possibilities of fulfilment; if it is unclear whether personal data is being processed, the GDPR even helps:

Art. 11 GDPR deals with situations in which the identity of the data subject is not clear (e.g. in the case of pseudonymised data). According to this, the controller is expressly not obliged to obtain further information in order to establish the identity separately. If they are unable to identify the person, they can inform them of this in accordance with Art. 11 (2) GDPR. If the data subject does not provide any further data that would enable identification by simple means, they are not entitled to the data subject rights under Art. 15 to 20 GDPR.

The Transparency Obligations of the Controller

Finally, there are the information obligations under Art. 13, 14 GDPR.

Cases where the processed data is not collected directly from the data subject pose a particular challenge regarding Art. 14 GDPR. Scraping (see training data above) may involve big data and therefore far too many data subjects. The controller has almost no way of contacting all of them in order to fulfil their obligations properly. This is remedied by Art. 14 (5) (b) GDPR in conjunction with Recital 62: If the effort is disproportionate, this obligation can be waived.

Exceptions to the transparency obligation are also possible if the identity of the data subject is unknown, see Recital 62 of the GDPR.

Regulations on Third Country Transfers

For the sake of completeness, attention must also be paid to third country transfers. If, for example, training data is processed by service providers based outside the EU/EEA, the provisions of Art. 44 et seq. GDPR must be observed. The users of AI must pay attention to whether the third countries are safe third countries, such as currently the US, or not. In the case of non-secure third countries, transfer impact assessments (TIAs) must then be carried out and standard contractual clauses (SCCs) must be concluded in the event of data processing on behalf of the controller.

Other Obligations

As AI is a so-called new technology within the meaning of Art. 35 (1) sentence 1 GDPR, a data protection impact assessment (DPIA) must be carried out. The use of AI must be reflected in the record of processing activities in accordance with Art. 30 GDPR. The technical and organisational measures (TOMs) must be adapted in accordance with Art. 32 GDPR. In addition, an algorithmic impact assessment can also be carried out, which also takes into account economic and ethical aspects of use in addition to the DPIA (see “Algorithmic Impact Assessment Tool“ of the Canadian government). When using a service provider, a so-called data processing agreement (DPA) must generally be drawn up with the service provider.

Conclusion & Compliance Measures

The use of generative AI in one’s own products or the creation of one’s own LLMs raises significant data protection issues with regard to the GDPR. The processing of personal data should be based on a clear legal basis. If a transfer takes place in a third country, the specific particularities must be taken into account.

If an own AI model is to be trained, the anonymisation of the training data is an adequate means of data protection-compliant training. Without anonymisation, particular problems arise with regard to the rights of data subjects.

Despite the innovative power of AI, it is therefore essential to take both technical and legal measures to fulfil data protection regulations and ensure the sensitive handling of personal data.

In order to ensure that the processing of personal data for the use of generative AI is as compliant with data protection law as possible, the following compliance measures, among others, should be taken:

  • The records of processing activities must be adapted accordingly.
  • The data protection information must be adapted accordingly.
  • Consent by the data subjects may have to be obtained.
  • A data protection impact assessment (“DPIA”) for the use of AI and/or even an Algorithmic Impact Assessment (“AIA”) that goes beyond this may have to be carried out and documented.
  • It is essential to take the necessary technical and organisational measures (TOMs) to ensure the security of the data.
  • In the case of the use of third-party AI or support by a service provider in the use of AI, the conclusion of a data processing agreement (DPA) may be required.

Finally, it is always worth taking a look at what the authorities and courts are doing. For example, the international data protection authorities deal with the new technologies with varying degrees of rigour (e.g. the ad hoc ban and the withdrawal of the ban on the use of Chat GPT by the Italian authorities in 2023). Attention should also be paid to the corresponding trends and interpretations of the requirements when it comes to AI compliance.

We will be happy to support you in this.