Search

Region
Jurisdiction
Firm
Author
Date
to
Keywords
Search

A.I. vs. Journalism: Unpacking the Copyright Battle Between The New York Times and ChatGPT

Khurana and Khurana, Advocates and IP Attorneys USA


INTRODUCTION

Artificial Intelligence (A.I.) is the development of machines that can perform tasks requiring human-like intelligence, such as learning, reasoning, and decision-making. Over the decades, A.I. has advanced tremendously, with modern systems being far more reliable and capable than their predecessors. Earlier models of A.I. were restricted by their rigid programming, but today's A.I. uses sophisticated machine learning and deep learning techniques to adapt, recognize patterns, and evolve through experience.

Generative AI is a branch of artificial intelligence that produces new forms of content, including text, images, music, or even computer code. Unlike traditional A.I., which mainly identifies patterns and makes decisions based on pre-existing information, generative A.I. models can create original outputs. These systems, driven by machine learning, use vast data to generate human-like and creative responses. A prime example of generative A.I. is ChatGPT, developed by OpenAI. ChatGPT is built on the GPT (Generative Pre-trained Transformer) framework, enabling it to comprehend natural language and generate contextually appropriate, coherent text.

The phrase "use vast amounts of data" has been of great concern in recent times as the New York Times has highlighted that Open A.I. free rides on the copyrighted data of NYT and uses it to train its model, which amounts to copyright infringement. Therefore, the complaint filed by NYT broadly points out two problems. First is, "these tools also wrongly attribute false information to The Times." Second, "by design, the training process does not preserve any copyright-management information, and the outputs of Defendants' GPT models removed any copyright notices, titles, and identifying information, even though those outputs were often verbatim reproductions of Times content."

A.I. models are trained using a variety of data. Still, one must stop and think about the copyrightability of these data because even though the data might be publicly available, someone still has a copyright over it, even if it's moral rights.

NYT LICENSING POLICY

The New York Times relies on its exclusive copyright rights—such as "reproduction, adaptation, publication, performance, and display"—to protect its content. For over a century, the Times has registered daily copyrights for its print editions, enforced a paywall, and set clear terms of service to regulate how its content is copied and used. Anyone looking to use Times content for commercial purposes must first seek a licensing agreement with the Times. The company requires third parties to get permission before using its content or trademarks. For many years, it has entered into licensing agreements to manage how its content is shared and ensure fair compensation. Large tech companies and other third parties pay substantial royalties to the Times for the limited right to use its content, with strict rules on usage beyond what is agreed. Additionally, the Times licenses some of its content through the Copyright Clearance Center (CCC), a platform that provides corporate, academic, and nonprofit licenses. For instance, a business can pay around ten dollars per article for a license to make photocopies or several thousand dollars to post a single Times article on a "commercial website for up to a year."

NYT'S VIEWPOINT

The Times' current complaint concerns using copyrighted material to create generative AI tools, specifically OpenAI's ChatGPT and Microsoft's Bing Chat. Both tools are powered by OpenAI's G.P.T. model, a type of large language model (LLM). These models are developed by being "trained" on vast text collections, allowing them to absorb patterns and understand how words fit together in context. When prompted, the LLM uses this knowledge to predict and generate responses that resemble natural language. The newest versions of GPT are trained on an enormous amount of data, "equivalent to a Microsoft Word document with over 3.7 billion pages." The heart of the Times' complaint lies in the fact that this massive dataset includes a substantial amount of copyrighted content from the Times[i].

Open AI is using and reproducing NYT's work in the duration of training their model, but the same is unlicensed and authorised, thereby infringing on NYT's copyright. The LLMs sometimes "memorize parts of works," which are present in the data fed during training. Due to this, occasionally, the models can produce "near verbatim reproduction of the works." Further, LLMs generate "synthetic" search results that, when "prompted," can recreate much more detailed content from an original article than what a typical online search would show. This allows users to bypass the Times' paywall, giving them access to expressive content without going through the proper channels[ii].

These issues act as a hindrance to providing high-quality journalism to the public at large. This would mean readers can conveniently get summaries or verbatim copies of the NYT text/news without spending a single penny. This will "obliviate the need" to purchase the NYT's paid access, which will eventually hamper the funds that NYT generates through this mode to fund research and its employees.

While Open A.I. contends itself to be a non-profit organization, the reality is quite the opposite. As per the records from April 2023, there were approximately 173 million users of ChatGPT. A fraction of those users who want to use ChatGPT Plus are charged 20$ per user per month. G.P.T. -4 is a subscription model with high capability and mainly targets corporate customers[iii].  

OPEN AI's VIEWPOINT

Open A.I. has relied on the "doctrine of transformative use" and has the potential for qualifying protection under the "fair use doctrine" enshrined under section 107 of the U.S. Copyright Act, 1976. They have argued that the ability of generative AI to "transform" the data fed for training forms a new output, and neither does it aim nor lead to the formation of "exact or substantially similar copies" of the original data/work. Their main argument is that whatever data they have taken for feeding its model is protected under the "doctrine of fair use." While it has also been contended that the "exact reproduction of NYT articles" is a bug that the developers are working on resolving, the aim of open A.I., as they claimed, is not to make an "exact reproduction" of training data[iv]. They aim to produce a new text that will arise from/ will be based on the "information absorbed through training" and not resemble the original work. However, if one looks into the complaint filed by NYT, they have showcased various instances from the year 2019 of ChatGPT, and a few of them were from 2023 of Bing Chat, which had exact verbatim copies.

The infringement, as seen in 2019, is that Open A.I. should have fixed its bugs to avoid copyright infringement. Instead, even after being put on notice by the plaintiff, the defendants, who were very much aware of multiple infringement instances, as they were being informed from time to time by NYT, intentionally removed CMI from the NYT's work while preparing that work to be utilized for training their model. They had complete knowledge that if they removed CMI, "it would not be retained within the models or displayed when the models present unauthorized copies or derivative copies" of NYT's work to users, which would aid in concealing infringement[v].  

FAIR USE DOCTRINE

The aim of Copyright law throughout all jurisdictions, whether India or the U.S., has always been to protect the creators by giving them "exclusive rights" over their original work. This protection has been enshrined under section 106 of the U.S. Copyright Act, 1976,[vi] which grants the copyright owner exclusive rights to "reproduce, distribute, perform and display copyrighted work." In continuation to this comes section 107[vii], a brief spelled above. This doctrine sanctions "unlicensed use of copyright work" by imposing certain conditions, thereby striking a balance between the copyright holder's interest and access to creative work by the public. The doctrine revolves around four factors which are[viii]:

  1. "Purpose and character of the use"- "use is commercial or non-profit educational purpose": Even though Open A.I. has portrayed itself as non-profit since its inception, reports from 2023 showcase a somewhat different picture, as stated above. The idea of transformative work is to add new meaning or ideas to the original work. Open A.I. is also said to work on this model. However, the instances reported by NYT in their complaint showcase exact verbatim copies with no novelty. This very much cancels the protection under fair use.
  2. The nature of copyright work can be creative, which implies more protection, or factual work, which is less protected. A.I.'s work here needed more creativity. The extracts from NYT and ChatGPT had almost no difference.
  3. The quantity and significance of the portion used about the entire copyrighted work.
  4. Impact of use of copyrighted material on potential market or the value of such work. G.P.T.'s use of the copyrighted work of NYT due to a paywall led to decreased revenue for them, and one must understand that if times and other organizations do not come up with legitimate news and "protect independent journalism," no computer or A.I. will be able to fill the vacuum created by this absence.

It would be correct to say that "A.I. has already grown to become a multi-billion dollar industry, and that, whatever the social benefit of their innovations, A.I. companies enjoy direct, staggering financial gains. At whose expense have these gains been secured?"

ETHICAL AND LEGAL CONCERNS

Generally, copyright law encompasses moral rights, which advocates for due attribution to the artist for their original work. While such a right does not exist within the U.S. Code, the same exists in the Berne Convention for protecting literary and artistic works.  

While it is said that A.I. always provides the finest and most accurate information because it is continuously fed data, in the author's opinion, this statement does not stand true. There have been many instances where ChatGPT has provided illegitimate case names along with made-up citations that do not exist in real life[ix]. So, what comes up for consideration is the question of where the legitimacy lies when such a vast amount of data is fed. This brings the works of human authors to a point where they not only have to fight with chatbots concerning money and infringement but also the legitimacy of the content being shown by them.

IN A NUTSHELL,

Either of the two outcomes are inevitable. Either the ruling would favor the plaintiff, thereby protecting the work of NYT from further infringement by either OpenAI or any other A.I. applications in the future or any original work. A ruling in their favor could also bring in concrete mandates as to what would be the correct way of using data available publicly to feed A.I. models. The companies will have to take licenses, make proper agreements, or ask for explicit permission from the author. This debate could also lead to other companies coming up and claiming a breach of their user policy in case the chatbot produces their work in verbatim copies with no transformation. It might also lead to the development of new legislation/provisions dealing with nuances of copyrighted work, infringement, and A.I. applications, which is also the need of the hour because many countries, such as India and the U.K., are still developing this aspect.

Conversely, suppose the ruling favors A.I. wherein their usage of copyrighted material is granted protection under the doctrine of fair use. In that case, it will boost the A.I. market by swiftly strengthening their models and training them with all the available material. This shall be helpful in the creation of new ideas but simultaneously raise concerns for the authors who do not have any protection over their work.

A balance must be struck between protecting the author's rights and A.I. applications using the author's data, which requires a model that would benefit both parties. While the latest update of this case showcases that both parties have failed to arrive at common grounds, it is exciting to note that OpenAI has signed a deal with two other major publishing entities, one of them being Axel-springer. The content would be used to train OpenAI's model. This brings us to another question: what conditions of the deal proposed by NYT are unacceptable to OpenAI or vice versa? The ruling is years away, and while NYT's suit might be justified and filed with good intentions, it would be interesting to see where this inhibits the evolution of A.I. as it is evident that A.I. has become an integral part of our lives with businesses relying on it. It would not be easy to put this genie back in the bottle.  

 

[i] Audrej Pope, NYT V. OpenAI: The Times’s About-Face, Harward Law Review, April 10, 2024.

[ii] Supra note iii.

[iii] Supra note ii.

[iv] Cade Metz, OpenAI Says New York Times Lawsuit Against It Is ‘Without Merit’, The New York Times, Jan 8, 2024.

[v] Supra note ii.

[vi] 17 U.S.C. § 106 (1976).

[vii] 17 U.S.C. § 107 (1976).

[viii] Supra note vii.

[ix] Supra note v.

Khurana and Khurana, Advocates and IP Attorneys



About the Firm

Khurana and Khurana, Advocates and IP Attorneys

AddressD-45, UPSIDC, Site IV, Kasna Road, Greater Noida - 201308, National Capital Region, India
Tel91-120-313 2513, 91-120-350 5740
Fax91-120-4516201
Contact PersonTarun Khurana
Emailinfo@khuranaandkhurana.com
Linkwww.khuranaandkhurana.com


Related Articles