Search

Region
Jurisdiction
Firm
Author
Date
to
Keywords
Search

AI Training & Copyright: Infringement or Fair Use?

Khurana and Khurana, Advocates and IP Attorneys India


1. Introduction

The advent of artificial intelligence has unimaginably changed the lives of every single person on this planet. AI refers to a computing system that can think, reason, learn, and make autonomous decisions based on stimuli, similar to humans, who work with algorithms and vast amounts of data. However, the rapid advancement of AI has led to various legal and ethical complications, particularly regarding copyright laws. (Ravindra Kumar & Professor Pankaj Kumar, 2022)

The boom has led to the development of various generative AI models such as Open AI’s ChatGPT, Google’s Gemini, etc, which operate on a similar principle that requires the training of AI models based on the processing of vast amounts of data sets to bring out a specific outcome.

In recent times various suits have been filed around the continents against OpenAI and other GenAI model developers all the cases are leading to a central legal issue of whether this practice constitutes copyright infringement or can be considered a fair use of data because the data fed into the system are collected from a diverse range of sources, including open and copyrighted works.

The AI developer argues that the data is required for technological advancements and is only used for training purposes. On the other hand, the copyright holder argues that they have an exclusive right to reproduce the data and that this is an unauthorized use of their content and violates their rights due to copyright infringement.

This blog examines whether the use of copyrighted data sources for training AI models breaches the rights of copyright holders and should be considered an infringement. It explores various copyright regulations, legal precedents, and theories to investigate the current state of AI regulation in various jurisdictions, as well as its potential ramifications.

Keywords: Generative AI, Copyright Law, Fair Use, ComercialUse,Intellectual Property, AI Regulation.

2. Understanding AI and Copyright

  • I. Generative AI:
  • How GenAI Works

Generative AI (Gen AI) is a Machine-Learning model that generates original content such as text, images, audio, video, etc on command of the AI user through prompts. This machine-learning model is known as the Deep-Learning model based upon neural networks that try to mimic the working of a human brain combined with data inputs, weights, and bias to parse and process data sets.

The networks operate using algorithms, which allow the AI to adapt, learn, and perform complex tasks on its own without needing a human to guide the learning, which works on the principle of identifying and encoding the patterns and relationships in huge amounts of data. Generative AI operates in three phases:-

 

  1. Training:- A deep learning algorithm on vast amounts of raw, unstructured, unlabeled data. While it learns, the algorithm is intent on discovering intricate patterns between different data points, attempting to predict the next item in a sequence and constantly refining itself so as to reduce the disparity between what it predicts and the actual data

 

  1. Tuning:-The application is fed with specific types of data and types of questions or prompts to refine a pre-trained AI which helps the AI to be accurate in specific domain

 

  1. Generating, evaluating, and returning:- Users and developers constantly review the outputs of their generative AI applications, and continuously adjust the model, access to real-time information is granted to AI to make it more effective.

 

  1. Copyright Law Overview

Copyright Law (Secretary to Government of India, n.d.) provides a right to authors of literary, dramatic, musical, and artistic works and cinematograph film producers and producers of sound recordings. Actually, it is a set of rights encompassing, inter alia, rights of reproduction, communication to the public, adaptation, and translation of the work.

 

Copyright laws in India are governed by ‘The Copyright Act of 1957, where S.13 defines the works that are protected by copyright law it also states that the law only protects the expression of an idea, not the idea itself, S.14 grants creators of original work the exclusive right to publish and broadcast their work. S. 51 deals with copyright infringement and S. 52  provides certain exceptions such as fair dealing for the purposes of private or personal use, including research is permissible. If a user of Generative AI wishes to use copyrighted works for commercial purposes, they must obtain permission(PIB Delhi, 2024) unless their use falls under the fair dealing exceptions in Section 52 of the Act.

 

Whereas, copyright laws in the United States are governed by "The Copyright Act of 1976" and "The Digital Millennium Copyright Act (DMCA) of 1998," which were designed to safeguard copyrighted works in the digital era. The fair dealing principle in India is narrower than the Fair Use concept in the United States, which is granted by Section 107 of the Copyright Act. The Fair Use Doctrine uses a Four-Factor Test to determine the legal use and originality of work.

 

Before moving forward to discuss both these concepts in detail, let us first discuss the commercial use of AI

  1. AI Training v. Commercial Use of Copyrighted Material

Several AI models worldwide rely on copyrighted materials to improve their functionality:

  • OpenAI’s GPT models: AI models known as Generative Pre-trained Transformers (GPTs), which are designed to generate human-like natural language and are trained on an enormous amount of text data from the internet, including books, web articles, and other written materials. OpenAI was founded in 2015 as a non-profit organization, but it swiftly evolved into a multibillion-dollar for-profit corporation based in large part on the unauthorized exploitation of copyrighted material.
  • Google’s Gemini:-A sophisticated multimodal system(Sundar Pichai, 2023)designed to process and reason across diverse types of data, including text, audio, images, and video it uses web-scraped data, which may include copyrighted works.
  1. Training

AI model training is similar to teaching a kid to identify object (Dina Biago, 2024). If the kid is shown correct and varied examples, it gains a strong understanding. Likewise, AI models need a rich, varied, and representative dataset to learn patterns and generalize well. To prevent issues related to AI model improper training, hundreds of gigabytes of error-free and relevant training content need to be compiled from reliable sources, in which bulk amounts of data are gathered from the internet through data scraping .

The machine learning model uses this realm to include extensive, publicly accessible, and sometimes open-source libraries, repositories, websites, or platforms on the internet. which may not be possible to obtain the open information/data merely, rather the use of copyrighted material/information as well as open source licensed material may be required to be used that can lead to issues related to copyright infringement.

  1. Commercial Use of Data

The business application of data created by AI has given rise to a huge number of legal and economic issues since the free consumption of copyright-protected data for machine learning purposes can deprive the authors of their rights and cause economic losses. For example, an AI tool such as, Jukedeck (Ed Newton Rex, 2025), is capable of creating songs, based on copyrighted material without permission in such a case the resulting work cannot be considered fair, particularly if the work is commercially exploited.

  1. Copyright infringement in training

AI algorithms are designed to mimic human cognitive processes, and their training often incorporates protected content, raising challenges about copyright infringement. When AI generates outputs that closely resemble or replicate original works, it may be deemed a reproduction and will result in the infringement of the creator's rights. The principle of adaptation and derivative work under the copyright laws states that adaptation (Pleaders, 2016) is converting an existing copyrighted work into a new format and derivative work refers to a new creation based upon the original work, In situations where the AI-generated output is very similar to the original work, it may be classified as an unauthorized adaptation or derivative work rather than an independent production, resulting in infringement.

The Eastern Book Company vs. D.B. Modak[1] case was a landmark decision that mandated the necessity of human skill and judgment in determining the originality of work and established a precedent, emphasizing that merely making changes to an existing work is insufficient to qualify it as an original creation. This ruling can be best incorporated in the cases of AI, as a large number of cases involving the output generated by AI  lacks intervention by humans and can not be if courts continue to withhold their  stance then it will make it difficult for AI developers to further use the copyrighted sources for training and it will result in safeguarding the rights of copyright holders 

  1. The Fair use doctrine and use of copyrighted material

The Fair Use Doctrine as laid down under Section 107 of US copyright law decides if the work is copyright infringement or not by taking into account the Four-factor test[2]. The fair use doctrine is adaptive in the context of recent advances in technology in the area of Artificial intelligence. The Four-Factor test is given as:

  1. The purpose and character of the use- This element looks to whether the use is commercial or noncommercial and whether it "transforms" the copyrighted work. Noncommercial use is in favor of fair use, while commercial use is against it. Numerous AI platforms that are used commercially may not qualify. The more transformative the work is, the less significant the other factors become.
  2. The nature of the copyrighted work- If they use highly creative works, like art, music, or writing it tends to weigh against fair use.
  3. The amount and substantiality of the portion used- It is necessary that the work utilized is not the core of the project. For example, using a sentence from a poem is more likely to be fair use than a whole paragraph.
  4. The effect of the use upon the potential market-  It is one of the most significant considerations in determining fair use. For example, making the copyrighted work available to the public will reduce the value of the original work. In generative AI, copyright holders argue that their works are utilized in a manner to negate their economic prospects, therefore it is not fair use.

Under the fair use doctrine, a valid ownership must be proved by a copyright owner. In Cartoon LP v. CSC Holdings[3], the court differentiated between voluntary copying and non-volitional computer-generated copies and held that such "intermediate operational use" does not amount to infringement. In Authors Guild v. Google[4], Google taught the AI by providing the corpus of books to render it machine-readable and conduct keyword searches for handy accessibility by the users the Court ruled that book scanning is transformative fair use since the intent is unique than the original.

Copyright holders argue that AI affects the market of the copyright holder whereas we have witnessed that AI doesn’t destroy, in fact, it increases accessibility which benefits both the creator and user which is an important component under transformative use. The decision of courts in both cases aligns with the principle of transformative use and the use of data for training AI was not considered an infringement rather it is non-violative of the holder's rights and encourages innovation and also upholds copyright protection where necessary.

  1. Way Ahead

The future of AI regulation is uncertain as there are no clear criteria to determine whether there is infringement by AI, the courts may look into the working of the AI Models to consider the use of copyrighted data as infringement or fair use, in the case of the Cartoon LP v. CSC Holdings, Authors Guild v. Google, and Eastern Book Company vs. D.B. Modak has discussed certain situations where there is infringement, but there is still need of clearer stand by the judiciary and policymakers, another issue that requires to be addressed which is still far from settled that is the commercial use of the data generated by the AI and whether it should be considered as an infringement, this is the grey area with no specific legislation or precedents dealing with it.

The US Copyright Office has tried to address the issues in its second report(U.S. Copyright Office, 2025)on Copyright and Artificial Intelligence titled Copyrightability where they have tried to highlight that GenAI will be held responsible for copyright infringement when it has unauthorized access to sources of copyrighted materials and is producing outputs ‘substantive similar’ to the copyrighted work. Also in this report they have discussed in detail that human creativity is the center of copyright and IPR law regardless of its commercial use, AI can only be used as a creative tool but the work generated by AI can’t be copyrighted and its commercial use will lead to an infringement. Daniel Gervais (Ellen Glover, 2024), a professor at Vanderbilt Law School also stresses upon the same that if AI and humans collaborate, copyright can only be evaded if humans have authoritative control (U.S. Copyright Office, 2023) over AI.

Moving forward, judicial interpretation will play a crucial role in defining the future of AI and copyright law around the globe as there are a number of cases pending around jurisdiction and a lot of concerns will arise in the coming years due to the extensive use of AI that will create trouble not only for the human creators but also for the AI model developers, the path that will be taken by the courts today in deciding those cases will shape the future of AI and copyright laws.

  1. Reference

 

Cole Stryker, & Mark Scapicchio. (2024, March 22). What is generative AI? . IBM.

Dina Biago. (2024). Byte-sized Banditry – When AI Models Infringe Copyright in Training Content. Spoor Fisher.

Ed Newton Rex. (2025). Reflections on Angio Genius and Accelerate Cambridge. The Entrepreneurship Centre Blog .

Ellen Glover. (2024, September 18). AI-Generated Content and Copyright Law: What We Know. Builtin.

Jim Holdsworth, & Mark Scapicchio. (2024). What is deep learning?

PIB Delhi. (2024). Existing IPR regime well-equipped to protect AI generated works, no need to create separate category of rights.

Pleaders. (2016, October 17). What Is “Adaptation” Under Copyright Law?LawSikho.

Ravindra Kumar, & Professor Pankaj Kumar. (2022). TRAINING AI AND COPYRIGHT INFRINGEMENT: WHERE DOES THE LAW STAND? Indian Journal of Integrated Research in Law, 2(1).

Secretary to Government of India. (n.d.). RATIONALE OF COPYRIGHT PROTECTION. In Government of India Department For Promotion of Industry and Internal Trade Ministry of Commerce and Industry.

Sundar Pichai. (2023, December 6). Introducing Gemini: our largest and most capable AI model.

U.S. Copyright Office. (2025). Copyright and Artificial Intelligence.

U.S. Copyright Office, L. of Congress. (2023). Copyright Registration Guidance: Works Containing Material Generated by Artificial Intelligence. Federal Register.

What is data scraping? (2025). CLOUDFARE.

 

Khurana and Khurana, Advocates and IP Attorneys



About the Firm

Khurana and Khurana, Advocates and IP Attorneys

AddressD-45, UPSIDC, Site IV, Kasna Road, Greater Noida - 201308, National Capital Region, India
Tel91-120-313 2513, 91-120-350 5740
Fax91-120-4516201
Contact PersonTarun Khurana
Emailinfo@khuranaandkhurana.com
Linkwww.khuranaandkhurana.com


Related Articles