The Future of AI Copyright Litigation: Piracy and Market Harm

Executive Summary

  • Two recent rulings in the U.S. District Court for the Northern District of California held that training artificial intelligence (AI) on copyrighted material qualifies as fair use; the court also, for the first time, characterized this practice as “highly transformative” – a key factor in fair use analysis – because it gives the material a new and different purpose, strengthening the legal case for fair use.
  • The judges in these two cases differed, however, on how to apply the fair use doctrine, particularly on whether AI training data should be considered as harmful to authors’ markets – another factor considered for fair use – and on how to treat the use of pirated content in AI training, issues that will be central to future AI litigation.
  • These divergent decisions will cause confusion for AI developers and rights holders moving forward, and Congress may ultimately have to take legislative action to provide clarity to courts that balances protections for creators against the potential barriers to AI development.

Introduction

The rise of artificial intelligence (AI) has fueled a debate between technology companies and creative industries over whether model training on copyrighted content falls within fair use, a legal exception that allows the use of copyrighted works for certain purposes. Lawsuits from artists, authors, and major media companies are challenging both the inputs (the data used to train AI) and the outputs (the content AI generates), alleging that these are unauthorized uses and reproductions of their work that threaten their markets and livelihoods. These suits have the potential to shape the future of both technology and creative industries. Yet most are still working through the courts, leaving uncertainty regarding the legality of AI training and outputs.

Notably, however, two recent rulings in the U.S. District Court for the Northern District of California – Bartz v. Anthropic and Kadrey v. Meta – mark the first time courts have held that using copyrighted material to train generative AI models qualifies as fair use; the court also, for the first time, characterized this practice as a “highly transformative” – a key factor in fair use analysis –  because it gives the material a new and different purpose, strengthening the legal case for fair use . Although both decisions reached the same result, their fair use reasoning differed. When deciding if using copyrighted works is fair, courts consider four factors. In these cases, the judges disagreed on two key points: whether AI training could harm authors’ sales, and how to handle the use of pirated content in training AI. The Meta case focused on market harm (one of the fair use factors), with the decision warning that training on copyrighted works could disrupt authors’ markets. The judge in the Anthropic case, by contrast, argued that training data is not public and therefore does not compete with authors’ markets. The decisions also split on pirated works: The judge in the Anthropic case held that only lawfully obtained materials can qualify for fair use, while the judge in the Meta case treated downloading and training as one unified, transformative act.

The differing rulings in Anthropic and Meta underscore how fair use decisions depend on specific facts and often vary from court to court, and thus it’s not clear exactly what that means for future AI copyright cases. Since there are no higher court rulings to guide them, other district courts could reach different conclusions even with similar facts. Appeals of these rulings are expected, and new lawsuits are likely to emerge. Therefore, as Congress considers whether to step in, any legislative action will need to reconcile these divergent judicial approaches on market harm and piracy to ensure both meaningful protection for creators and opportunity for AI innovation.

Copyright Concerns With AI

The rapid rise of AI has fueled a heated debate between AI companies and content creators on whether the use of copyrighted works to train models falls under the umbrella of fair use, a legal exception that allows the use of copyrighted works for certain purposes. To train a model, developers first obtain preexisting information (in the form of books, audio, or images) and translate it into mathematical representations. These representations are then used to teach the model to recognize patterns, enabling it to generate responses to user prompts. For example, when asking a model to write in Shakespeare’s style, the model would access and process the stored representations of his works and style to craft a response for the user.

Currently, there are dozens of lawsuits in the courts with cases against many of the big tech companies, including Anthropic, Meta, and OpenAI, with plaintiffs ranging from individual artists to large companies such as Disney and The New York Times. While the battles have mainly focused on inputs – the use of copyrighted data to train AI systems – other disputes have also raised concerns about the outputs – or the responses – these systems produce and how these may harm the creative community. Most cases are still under litigation, but two major decisions, Bartz v. Anthropic and Kadrey v. Meta Platforms, were released in June.

These two decisions mark the first time courts have held that using copyrighted material to train generative AI models qualifies as fair use. Yet, despite reaching the same conclusion, their fair use reasoning differed on certain points.

Approaches for Ruling and Points of Convergence

In determining whether the use of a work in any particular case can be considered fair use, the courts consider four factors: 1) the purpose and character of the use, including whether such use is transformative and creates something different; 2) the nature of the copyrighted work; 3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and 4) the effect of the use upon the potential market for or value of the copyrighted work. The judges in the Anthropic and Meta cases approached the first, second, and third factors similarly, but took a different path on the fourth factor, and left differing guidance on how to assess the use of pirated works for AI training.

The first fair use factor supported the AI developers, as both decisions held that training an AI model is a transformative use. Judge Alsup in the Anthropic case described the technology as one of “the most transformative many of us will see in our lifetimes,” while Judge Chhabria found for Meta that there was “no serious question that Meta’s use of the plaintiffs’ books…was highly transformative.” This means that judges emphasized that training gives the copyrighted material a new purpose, making it more likely to be protected under fair use. Also, both decisions held that the second fair use factor favored the plaintiffs, since all the works included expressive elements, and both concluded that the third factor favored the AI developers, as copying books in full was reasonably necessary to train the models effectively.

Point of Divergence: Proving Market Harm

The fourth fair use factor – the effect of the use on the market – favored the AI developers in both cases, but the approach taken by the two judges to address them was different.

The decision in the Anthropic case held that, since training data are not publicly available, developers do not compete in the same market as the plaintiffs’ works. Plaintiffs, in this case, only challenged the inputs – the data used to train AI. Still, the court left open the possibility that its conclusions could change if the dispute had involved outputs, since outputs might compete with the original works in the same market.

Yet, as the inputs inevitably lead to outputs – the content AI generates – the judge in the Meta case gave special attention to the market effects of training AI on authors’ work, emphasizing the risk that resulting outputs could lead to market dilution, highlighting that no other use rivals AI’s potential to saturate markets with competing works. Still, Judge Chhabria ruled in favor of the AI developers, concluding that plaintiffs failed to prove that Meta’s models reproduced their works or caused either direct or indirect market harm.

The Case for Pirated Works and Storage of Data

While on its face the Anthropic case seems like a significant win for AI developers, the decision introduced a wrinkle that could limit the effectiveness of the fair use defense. Specifically, while AI training does not affect the market for the works used, it can if the AI developer didn’t acquire those works legally and instead used pirated works to train the models.

Anthropic initially trained its models on a dataset of more than 7 million pirated works obtained from file-sharing platforms, which it later stopped using. It then created a second dataset by buying physical books, digitizing them, and destroying the originals. The Anthropic ruling separated the act of shifting the format of lawfully bought books, from the use and storage of pirated copies. The judge in the Anthropic case addressed the practice of purchasing books, destroying the originals, and scanning them to create digital copies for internal use as fair use, because this “format-shifting” did not result in additional copies being distributed or made available to the public, but simply replaced a physical copy with a digital one for the internal purpose. On the other hand, the judge ruled that creating a permanent library of pirated books was not fair use, finding that all four factors favored the plaintiffs: The copying wasn’t transformative, the works were creative, Anthropic had no right to keep fully pirated copies, and allowing such practices would directly undercut sales and threaten the publishing market. The Meta case likewise noted that Meta obtained the data from online repositories, including “shadow libraries” that distribute copyrighted books without the rightsholders’ permission. Unlike Anthropic, the decision in the Meta case did not draw a line between pirated and legitimate copies when addressing fair use. The judge in the Meta case found that obtaining the data and training the model could not be treated as separate acts, since the data acquisition serves the end goal of transformative process of model training.

Policy Implications and Congressional Action Looking Forward

The differing rulings in Anthropic and Meta highlight the fact-specific and unpredictable nature of fair use, where outcomes often hinge on how individual judges weigh the same factors.

As a result of the uncertainty, AI developers still lack clear guidelines for appropriate training practices and risks of training AI models on information acquired without purchase, such as web scraping – extracting and reusing data from web pages. Without certainty, copyright holders could sue any AI developer regardless of whether the model produces outputs that infringe on the copyright in question. Rising litigation against AI developers threatens to slow the progress of advanced models and undermine efforts to maintain global leadership in AI.

To address these challenges, policymakers could provide clearer guidance on fair use, which would offer a more predictable framework for the industry. Policymakers could start with the two themes that have obtained increasing attention in recent cases – market harm and the use of pirated works in AI training. For example, Congress could clarify that market harm includes “indirect substitution,” ensuring courts weigh both direct competition and the risk of AI flooding markets with substitute content, ensuring also that piracy is avoided while innovation remains protected. Congress could also state that training in copyrighted works is inherently transformative, providing a clear precedent for the first fair use factor. Finally, Congress could require developers to implement output guardrails to prevent infringing responses, providing developers with a clear path to satisfy this aspect of the fair use analysis. Clearer frameworks on the types of uses that are transformative could give better guidance to developers and courts and limit unnecessary litigation.

Conclusion

It appears that, for now, the debates on copyright will focus on the market effects of AI inputs and outputs, as well as on questions of data provenance. How courts will handle fair use in the context of both AI inputs and outputs remains uncertain. Appeals of these rulings are expected, and new lawsuits are almost certain to emerge. As Congress considers whether to step in, to successfully mitigate these challenges, legislation must reconcile these divergent judicial approaches on market harm and piracy to ensure both meaningful protection for creators and opportunity for AI innovation, moving the industry beyond today’s fragmented legal landscape.


Continue Reading