Skip to content
JCDL 2004
JCDL.2004
Digital Libraries Summit
Shadow Libraries, AI Training Data, and the Copyright Problem Digital Libraries Cannot Avoid
← All posts

Shadow Libraries, AI Training Data, and the Copyright Problem Digital Libraries Cannot Avoid

The problem the field was handed without being asked In 2023 and 2024, a sequence of disclosures, lawsuits, and investigative reports established what many in the digital library community had long suspected: that several of the most prominent general-purpose AI language models were trained, in whole or in substantial part, on datasets assembled from shadow libraries — large-scale, unauthorised repositories of copyrighted scholarly and literary works. The Books3 dataset, derived in part from Library Genesis, appeared in training corpora including those used for models developed by major AI laboratories. Academic corpus scraping, conducted at a scale and speed that outpaced any institutional response, became the default practice for AI training data collection during precisely the years when digital library infrastructure was making that collection technically trivial.

The digital library community did not create this situation. Institutional repositories, open access mandates, and the infrastructure of scholarly communication were built on principles of broad access that assumed the good-faith use of that access for the purposes those systems were designed to serve: education, research, reading. The repurposing of that infrastructure for commercial AI training at scale was not anticipated by the legal frameworks that govern it, the terms of service that licensed it, or the professional norms that shaped it.

That fact does not reduce the field's responsibility to respond. It makes the response more difficult and more urgent simultaneously.

What shadow libraries are and why they matter for this conversation

Shadow libraries occupy an ambiguous position in the information ecosystem. Defined by Rademeyer and Selvadurai in their January 2026 analysis in the Journal of Intellectual Property Law & Practice as "vast online repositories of textual content, the majority of which are copyrighted works that have been compiled without permission," they include platforms such as Library Genesis, Sci-Hub, and Z-Library — repositories that have provided access to scholarly literature at a scale and breadth that no institutional subscription model has matched, particularly for researchers in the Global South.

The ethical complexity of shadow libraries is real and well-documented in the literature. They simultaneously represent a failure of equitable access to scholarly knowledge — a failure created by the commercial publishing system that shadow libraries exploit — and a systematic violation of copyright that those same publishers have, with arguable hypocrisy, prosecuted aggressively. Both things are true. The digital library community has engaged with this complexity for two decades without reaching consensus, which is the appropriate outcome for a genuinely contested question.

What changed in 2023 and 2024 is that this complexity became the foundation for a new and distinct problem: shadow libraries were used as training data for commercial AI systems. The Books3 dataset, one of the most widely discussed training corpus components, contained approximately 196,000 books sourced from sources including Bibliotik, a private tracker, and was used in the training of several high-profile language models. The dataset's contents included contemporary fiction, academic monographs, and technical works — categories that map directly onto the holdings of digital libraries operating under formal institutional licences.

The crucial distinction that the January 2026 paper by Rademeyer and Selvadurai draws is between the act of creating a shadow library (which is copyright infringement under most jurisdictions' law) and the downstream use of that library as AI training data. Current regulatory frameworks are, as the paper argues, poorly equipped to address the downstream use case. An AI model trained on infringing material does not, under most existing doctrine, carry the infringement forward in a form that straightforwardly supports a rights-holder claim against the model's output. This is what Mukherjee and Chang, in their January 2026 paper on what they term "the AI Ouroboros," describe as "copyright laundering through recursive training" — a situation in which synthetic data generated by models trained on infringing material is then used to train successor models, with each generational step further attenuating the evidentiary chain between the original infringing act and the current model's outputs.

Where digital libraries are implicated

The direct implication for digital libraries is institutional, not merely legal.

Digital libraries that participate in open access mandates, that operate institutional repositories, that maintain consortial agreements for content sharing, and that have digitised their holdings under fair use or equivalent exceptions are all, in different ways, sources of data that AI training pipelines can and do access. The question of whether any particular act of AI training from institutional repository content constitutes infringement is, in most jurisdictions, genuinely unsettled. The EU AI Act's Article 53(1)(d) requires providers of general-purpose AI models to "draw up and make publicly available a sufficiently detailed summary about the content used for training the model" — a transparency requirement that has, since the Act's commencement, produced summaries of varying quality and specificity that rarely allow a digital library to determine whether its holdings were included.

The text and data mining (TDM) exceptions that exist in the EU's Copyright in the Digital Single Market (CDSM) Directive, and in national implementations thereof, permit TDM for scientific research purposes. The CDSM Directive's Article 4 permits a broader TDM exception for any lawful use, with an opt-out right for rightsholders. The interaction between these exceptions and AI training — specifically commercial AI training — was not settled law when the Directive was drafted, and remains contested in 2026.

The practical consequence for digital libraries is that their opt-out infrastructure is inadequate to the scale of the problem. The CDSM Directive's opt-out mechanism requires rightsholders — or institutions acting on their behalf — to signal objection in a machine-readable format that AI crawlers respect. The existing infrastructure for this is the robots.txt protocol and, more recently, the emerging ai.txt convention. Neither is enforceable in a legally robust sense. Neither has been adopted universally. And neither addresses the retroactive problem of training data already collected before any opt-out was signalled.

What the regulatory landscape actually provides

The January 2026 analysis by Kyrychenko, Mudryi, and Chaklosh, examining the regulatory landscape across the EU, US, and Asia-Pacific, reaches a conclusion that will be unsurprising to practitioners but is worth stating precisely: current regulatory frameworks are "predominantly reactive rather than proactive." Compliance with training data transparency requirements is assessed "largely on the basis of self-reported documentation. Verification typically occurs only in response to external triggers, such as regulatory complaints or litigation initiated by rights-holders."

In the United States, no federal legislation specifically governing AI training data has been enacted as of May 2026. California Assembly Bill 412, which would have required developers of generative AI systems to document copyrighted materials used in training and provide disclosure mechanisms, did not advance. The US Copyright Office's Part 3 report on Copyright and Artificial Intelligence, published in preliminary form in May 2025, articulates the question — whether training on copyrighted material without licence constitutes fair use — without resolving it, noting that the matter will ultimately require judicial determination.

In the EU, the AI Act's transparency requirements are in force, but enforcement mechanisms at the level of individual training dataset components remain underspecified. The interaction between the AI Act and the CDSM Directive's TDM provisions is the subject of ongoing regulatory guidance that has not yet produced the clarity that digital libraries need to determine their obligations and rights.

In the Asia-Pacific region, Japan's copyright exception for TDM is among the broadest in the world, explicitly permitting extraction of copyrighted content for information analysis purposes regardless of commercial intent. This has made Japan an attractive jurisdiction for AI training data collection, with implications for the global regulatory arbitrage that the more restrictive EU approach is attempting to prevent.

The specific burden this places on digital library professionals

The practical burden on digital library professionals is not primarily legal in character — most libraries are not parties to copyright litigation and are unlikely to become so. It is professional, institutional, and infrastructural.

Provenance documentation has become a first-order concern. Digital libraries that have digitised holdings, that operate institutional repositories, and that provide programmatic access to their collections through APIs need to understand and document the provenance chain of their digital objects at a level of specificity that most existing catalogue systems were not designed to support. Which works were digitised under what authority? Which are in the public domain, under which jurisdiction's law? Which are licensed under open access terms that do or do not permit machine learning use? The metadata infrastructure of most digital libraries cannot currently answer these questions at scale.

Opt-out infrastructure needs deliberate attention. Libraries that wish to signal their position on AI training use of their holdings need to implement machine-readable opt-out mechanisms that go beyond robots.txt — which controls crawling but not training use of already-crawled content — and engage with emerging standards. The AI Act's transparency requirements create, in principle, a mechanism through which libraries could verify whether their holdings appear in disclosed training summaries. In practice, those summaries are currently insufficiently granular to support systematic verification.

Licensing and terms of service review is overdue at most institutions. The terms under which digital libraries provide access to their holdings — and the terms under which they acquire content from publishers and rightsholders — were not drafted with AI training in mind. Revisiting those terms to address machine learning use, in consultation with legal counsel and in coordination with consortia such as HathiTrust and DPLA, is a concrete action within the field's capacity.

What the field can do that the regulatory framework cannot

The regulatory framework will eventually provide clearer answers on the questions it currently leaves open. Court decisions in the US and EU on training data and fair use will establish doctrine. The AI Act's implementation guidance will mature. New legislative interventions will address gaps that current law contains.

What the regulatory framework cannot do is position digital libraries as active, principled participants in the data governance conversation rather than passive subjects of decisions made by AI developers and regulators. That positioning requires the field to do several things that are within its capacity.

The first is to develop and advocate for data provenance standards that are specific to digital library holdings and that are interoperable with AI training transparency requirements. The FAIR principles — Findable, Accessible, Interoperable, Reusable — provide a foundational framework, but they were not designed for this use case and require extension. The JCDL community is well-placed to develop that extension.

The second is to engage with AI developers directly and institutionally, from a position of informed advocacy rather than reactive compliance. Several major AI laboratories have expressed willingness to negotiate licensing arrangements with content providers. Digital libraries have holdings that AI developers need. That is a basis for negotiation that the field has not fully exploited.

The third is to support the development of detection and verification tools that allow libraries to determine whether their holdings have been included in training data, with or without authorisation. Research on training data attribution — identifying which documents contributed to a model's outputs — is an active area of machine learning research. The digital library field should be a participant in and funder of that research.

The prior question that the training data crisis has made urgent

Underneath the specific questions about copyright, training data, and opt-out mechanisms lies a question that the digital library field has been approaching for some years without fully confronting: what is the institutional identity of a digital library in a world where AI systems can perform at scale the information retrieval, synthesis, and analysis functions that libraries have always understood as their core service?

Shadow libraries attracted tens of millions of users not because they were shadow libraries but because they provided access to knowledge efficiently. The AI systems trained on their contents now provide a different form of access — not to the works themselves, but to the knowledge those works contain, synthesised and delivered without attribution. The copyright problem is real and requires response. But it sits on top of a deeper question about what libraries are for when the knowledge they hold can be extracted, synthesised, and delivered by systems that do not require the library to exist.

That question deserves its own treatment. The JCDL community, convening in June, will have more to say about it. The copyright crisis is the occasion for the conversation; the question is whether the field will use that occasion to articulate a response that is about more than compliance.

Keep reading

More from Web Innovations