AlphaFold Filled the Shelves — and Left the Protein Database With a Curation Problem

The shelves were nearly empty

It is worth holding the scale of the old scarcity in mind, because it explains the euphoria that followed. A protein's function is largely determined by how its amino-acid chain folds into a shape, so the shape is the thing biomedicine most wants to know — for understanding disease, for designing drugs. Yet for fifty years, getting that shape meant the slow, expensive, physical work of the wet lab. The experimental record grew, and it was trustworthy, but it covered a vanishing fraction of known proteins. The gap between what had been sequenced and what had been structurally observed was the defining bottleneck of the field.

Then they overflowed overnight

AlphaFold ended the scarcity. DeepMind's AlphaFold2 demonstrated in 2020 that a neural network could predict protein structure from sequence with accuracy that, in many cases, approached experimental methods — a result that effectively closed a fifty-year-old challenge. In partnership with EMBL's European Bioinformatics Institute, the AlphaFold Protein Structure Database launched in 2021 with roughly 360,000 predicted structures and, within a year, expanded to over 200 million — covering very nearly every protein catalogued by science, drawn from more than a million organisms. The work was recognized with the 2024 Nobel Prize in Chemistry, and that same year AlphaFold 3 extended prediction to the interactions between proteins and other molecules, with a public server opening the capability to researchers worldwide.

By any measure of access, this is one of the great achievements in the history of scientific infrastructure. A resource that took half a century to fill at a few hundred thousand entries grew by three orders of magnitude in months, free to anyone. The supply problem is over. What replaced it is subtler, and the field has been slower to name it.

A library of predictions is not a library of observations

Here is the distinction that the sheer abundance tends to obscure. The Protein Data Bank holds observations — structures determined by measuring real molecules. The AlphaFold database holds predictions — extraordinarily good hypotheses about what those structures probably are, generated by a model. Both are valuable. They are not the same kind of object, and an archive that lets them blur together is no longer telling its users the truth about what it contains.

This is the curation problem in one sentence: when a repository fills with predictions that look exactly like observations — the same elegant ribbon diagrams, the same file formats, the same easy reuse — the most important piece of information about any given structure becomes not the structure itself but its provenance. Was this shape measured, or was it inferred? A field that loses the ability to answer that question quickly, at scale, for every object it serves, has not gained 200 million structures. It has gained 200 million claims of unequal and unmarked standing.

Confidence is metadata, and metadata is what gets ignored

To its great credit, AlphaFold does not pretend to certainty. Every prediction ships with confidence estimates — a per-residue score indicating how reliable each part of the model is, and a measure of expected error between regions. These are precisely the right signals. A well-folded, high-confidence domain and a low-confidence stretch that the model is essentially guessing at are visibly different in the data, if you look.

The trouble is the oldest one in information science: metadata that exists is not the same as metadata that is used. A confident-looking three-dimensional ribbon is psychologically persuasive in a way a numerical confidence score is not, and downstream the danger compounds. When a predicted structure is pulled into an analysis pipeline, a figure, or a training set for yet another model, the confidence scores are exactly the thing most likely to be stripped away — leaving a guess traveling through the literature with all the visual authority of a measurement. Intrinsically disordered regions, where proteins have no single stable shape and where the model's confidence is correctly low, are a standing trap for anyone who treats the prettiest available structure as ground truth.

The provenance problem the field already knows how to name

This is the point at which structural biology's new difficulty becomes recognizable as something the digital-libraries community has theorized for decades. Distinguishing derived objects from source objects; carrying provenance as a first-class, non-optional property; maintaining fixity so a record's status cannot silently change; versioning when the generating process is updated — these are not novel challenges invented by AlphaFold. They are the core curatorial commitments behind FAIR data and behind every serious preservation program.

And the versioning piece is not hypothetical. Models improve; predictions change. A structure generated by one version of AlphaFold is not guaranteed to match the one a later version produces, which means a citation to "the AlphaFold structure" without a model version and access date is the structural-biology equivalent of citing a webpage with no date — a reference to a moving target. Reproducibility, in a world of model-generated data, depends on provenance metadata that the excitement of abundance makes it tempting to skip.

What good curation looks like here

The remedy is not to distrust AlphaFold, which would be foolish, but to treat its output with the curatorial seriousness its scale demands. Predicted and experimental structures should be unmistakably labeled as such, everywhere they travel, not only in the database of origin. Confidence should ride with the structure as first-class metadata that downstream tools are built to preserve rather than discard. The generating model's version and date belong in every citation, exactly as edition and date belong in a bibliographic record. And the conceptual relationship should be kept straight: the prediction database is best understood as a vast hypothesis-generation layer sitting over the experimental record, accelerating the questions worth asking — not as a replacement that quietly overwrites the distinction between what science has measured and what it has guessed.

AlphaFold did not end structural biology's data problem. It moved it — from acquisition to curation, from the wet lab to the archive. The shelves are full now, fuller than anyone dared hope a decade ago. Whether that abundance becomes durable knowledge or a beautifully rendered fog depends on whether the field treats its new library the way librarians have always insisted serious collections must be treated: with provenance attached, confidence preserved, and observation never silently confused with inference. That is not a constraint on the achievement. It is what will let the achievement last.

AlphaFold Filled the Shelves — and Left the Protein Database With a Curation Problem

The shelves were nearly empty

Then they overflowed overnight

A library of predictions is not a library of observations

Confidence is metadata, and metadata is what gets ignored

The provenance problem the field already knows how to name

What good curation looks like here

More from Health and Biology

The Experiment That Cannot Be Repeated — Biomedicine's Reproducibility Crisis and the Data We Fail to Keep

FHIR Solves the Syntax Problem and Leaves the Hard One Untouched

Genomic Data Portals and the FAIR Gap the Field Has Stopped Talking About