Skip to content
JCDL 2004
JCDL.2004
Digital Libraries Summit
The Persistent Identifier Problem Is Not Solved and COMET Is Trying to Explain Why
← All posts

The Persistent Identifier Problem Is Not Solved and COMET Is Trying to Explain Why

The Persistent Identifier Problem Is Not Solved and COMET Is Trying to Explain Why

A persistent identifier is, at its most basic, a promise. It is a string of characters assigned to a digital object with the institutional commitment that the string will continue to resolve to that object — or to accurate information about its current location — indefinitely. The promise is what makes scholarly citation meaningful in the digital environment. Without it, a reference in a published paper is a URL that will eventually break, not a permanent link to the resource it describes.

The field has built extensive infrastructure around this promise. The Digital Object Identifier system, maintained by the International DOI Foundation and operated through registration agencies including Crossref and DataCite, underpins scholarly communication globally — every published research article with a DOI, every dataset deposited in a compliant repository, every funding record linked to a research output through ORCID and Crossref metadata. The Handle System, which provides the technical infrastructure for DOI resolution, operates through a distributed network of servers that resolve the persistent identifier to the current URL of the object it identifies. The Archival Resource Key system, developed by the California Digital Library and now maintained through the ARK Alliance, had registered over 1,700 institutions by 2025.

The infrastructure exists. What COMET — the Collaborative Metadata Initiative, launched in November 2024 by the University of California's California Digital Library — identified in its initial phase is that the infrastructure has a quality problem. And the quality problem is, at its core, a metadata problem.

The Persistent Identifier Problem Is Not Solved and COMET Is Trying to Explain Why

What COMET Found and Why It Matters

The COMET initiative was launched with a specific, well-defined diagnosis: only record owners can update DOI metadata, meaning that institutions with enrichments to contribute — corrections to author affiliations, updated abstracts, additional subject classifications, links to related datasets or funding records — must maintain their own improved versions in separate systems. The result is duplicated effort and fragmented representation of research outputs across the scholarly communication ecosystem.

The California Digital Library's February 2026 update on COMET's progress describes the initiative's 2026 focus as formalising the processes and structures that emerged from its pilot phase while extending collaborations to new partners and domains. The initial pilot drew together a taskforce of institutions that contributed validated metadata improvements to DOI records — a proof of concept for community-contributed metadata enrichment at the level of the identifier record itself.

The significance of this is easier to understand if you have tried to do systematic literature analysis across Crossref metadata. A substantial proportion of DOI records in Crossref are incomplete in ways that matter for research infrastructure: missing author ORCID links, incorrect or absent funder information, subject classifications that do not align with the controlled vocabulary systems used by the discovery tools that ingest the metadata. These gaps are not the result of institutional bad faith — they are the result of the identifier registration process being designed for speed and compliance rather than metadata quality, with no systematic mechanism for improvement after registration.

COMET addresses this by creating pathways for community contribution. The approach is borrowed from the library cataloging tradition — the recognition that no single institution has complete information about all of its resources, and that collective maintenance produces better metadata than isolated institutional effort. Applied to the DOI system, this means that a library with expertise in a particular subject domain can contribute improved subject classifications to DOI records from that domain, regardless of which institution originally registered the identifier.

ARK and the Decentralised Alternative

While DOIs operate through a centralised registration and resolution infrastructure, the ARK system takes a fundamentally different approach. ARKs are designed around the principle, articulated by the system's original architect, that persistence "is purely a matter of service and is neither inherent in an object nor conferred on it by a particular naming syntax." The technical architecture reflects this: ARKs use a decentralised infrastructure in which each Name Assigning Authority — identified by a Name Assigning Authority Number, or NAAN — maintains its own resolution service while the global N2T resolver operated by the California Digital Library provides a fallback resolution point.

The practical consequence is that ARK adoption has a lower barrier than DOI adoption for institutions outside the major publishing and research data infrastructure ecosystems. The 1,700+ registered institutions by 2025 include national libraries, cultural heritage organisations, indigenous knowledge archives, government data repositories, and small scholarly publishers that would not typically be DataCite or Crossref members. The ARK Alliance's international community-building effort, which began as "ARKs in the Open" in 2018, has produced a governance structure and a technical documentation base that supports adoption across a wide range of institutional contexts.

The 2025 paper "Integrating ARK Persistent Identifiers into Research Data Infrastructure," published in Technical Science Integrated Research, makes the case for ARK adoption in the context of the growing open science mandate landscape. The paper argues that ARKs' adaptability to both machine-readable and human-readable representations makes them suitable for a wide range of research outputs, including evolving datasets and non-traditional materials — the kind of outputs that DOI infrastructure handles less elegantly because they do not map cleanly onto the article-centric metadata model that Crossref was designed around.

The complementarity of ARK and DOI is the field's current practical position: they are not competitors but components of a more resilient identifier ecosystem. EZID, the California Digital Library's identifier management service, provides creation and management for both ARKs and DOIs through a single interface. N2T serves as the global resolver that ensures both identifier types remain actionable over time.

The Unresolved Challenges

The persistent identifier infrastructure in 2026 is more mature, more widely adopted, and more technically capable than at any previous point. It is also facing challenges that the current architecture does not fully address.

Link rot at the repository level remains a significant problem. A DOI resolves to a URL. If the institution maintaining that URL ceases to operate, or migrates to a new system without updating the DOI record, the identifier ceases to function as a persistent identifier in any meaningful sense — it resolves to a broken page rather than to the object. A 2025 paper in The Journal of Academic Librarianship on accelerating DOI use in academic libraries identifies this as a systemic concern, noting that digital research projects tied to limited funding windows are often not hosted on library infrastructure, creating additional complications for long-term resolution maintenance.

The Canadian Persistent Identifier Advisory Committee's 2025 findings document fragmented identifier infrastructure across funding organisations and a general lack of urgency in the broader community around systematic PID implementation — a diagnosis that applies beyond Canada to most national research infrastructures outside the major US and European systems.

Identifier metadata quality — the problem COMET is directly addressing — remains the field's most tractable near-term challenge. The metadata associated with a persistent identifier is what makes it useful for discovery and for automated processing by research infrastructure tools. A DOI with incomplete metadata is technically persistent but functionally limited.

AI system requirements represent an emerging pressure on identifier infrastructure that the current architecture was not designed to accommodate. As research synthesis systems, agentic AI tools, and automated literature analysis pipelines increasingly consume scholarly metadata at scale, the quality and completeness of identifier metadata becomes more consequential. A metadata gap that a human researcher might notice and correct is a systematic bias in an automated system that processes millions of records without noticing.

What the Field Needs to Build

The persistent identifier infrastructure requires investment in three areas that current institutional priorities do not reliably support.

First, systematic metadata quality assessment — the development of automated tools that evaluate identifier records against established quality criteria and flag records requiring enrichment. COMET is building toward this; the field needs to support it with the sustained institutional commitment required to make community-contributed metadata improvement the norm rather than the exception.

Second, identifier infrastructure for non-traditional research outputs — datasets, software, preprints, grey literature, oral histories, indigenous knowledge records — that exist outside the DOI system's article-centric assumptions. ARK adoption is expanding in this space, but the governance and sustainability infrastructure that makes DOI resolution reliable over decades has not yet been fully replicated for the broader ARK ecosystem.

Third, integration between identifier infrastructure and the emerging AI processing layer — ensuring that the metadata associated with persistent identifiers is structured and complete enough to be reliably processed by the automated systems that will increasingly mediate access to research outputs. The identifier infrastructure was designed for human-mediated discovery. The systems consuming it in 2026 are not primarily human.

The field has built an extraordinary amount of infrastructure. Maintaining it, improving it, and extending it to the new contexts that scholarly communication requires is the work that persists after the infrastructure is built.

Keep reading

More from Technology