Skip to content
JCDL 2004
JCDL.2004
Digital Libraries Summit
Vector Embeddings Are Not Meaning — What Semantic Search Actually Does to a Digital Library
← All posts

Vector Embeddings Are Not Meaning — What Semantic Search Actually Does to a Digital Library

For most of the field's history, a library catalog was honest about its limitations. You typed words, and it found records containing those words. When it failed, it failed legibly: the term was wrong, the spelling was off, the subject heading was not the one the cataloger had chosen. The system's ignorance was visible, and visible ignorance is the kind a trained searcher can work around. Semantic search broke that contract, and mostly for the better. You can now type a clumsy natural-language question into a modern discovery layer and watch it return documents that share not a single keyword with your query but are, unmistakably, about the thing you meant. It feels like the catalog has finally learned to understand. It has not. What it has learned to do is measure distance in a space built from statistics — and the gap between those two things is the most consequential thing a digital library professional can understand about the technology now entering their stacks.

The promise that seduced the field

The appeal is real and worth stating plainly, because dismissing it would be dishonest. Keyword retrieval has a brittle floor: it cannot connect "myocardial infarction" to "heart attack," or a query about "raising chickens" to a document titled "backyard poultry husbandry," unless someone anticipated the synonymy in advance. Decades of controlled vocabularies and thesauri existed precisely to paper over that brittleness, at enormous human cost.

Vector search appears to dissolve the problem. Train a model on a large corpus, and it will place "heart attack" and "myocardial infarction" close together on its own, having never been told they are related. For a field that has spent a century building synonym rings by hand, that looks like magic. The temptation is to treat it as comprehension. That temptation is the whole danger.

What an embedding actually is

Strip away the marketing and an embedding is a coordinate. A neural model, trained to predict words from their neighbors across billions of examples, learns to represent each word, sentence, or document as a long list of numbers — a point in a space of hundreds or thousands of dimensions. Things that tended to appear in similar contexts end up near each other; things that did not end up far apart. "Search" then becomes geometry: convert the query to a point, and return the documents whose points are closest, usually measured by the angle between the vectors and found, at scale, through approximate nearest-neighbor algorithms that trade exactness for speed.

That is the entire trick, and it is genuinely powerful. But notice what it is built from. The model never learned what a heart attack is. It learned that the strings "heart attack" and "myocardial infarction" keep similar company. The proximity it reports is a summary of how language was used in its training corpus — nothing more, and nothing that has been checked against the world.

Proximity is not aboutness

Here is the load-bearing distinction, the one the field cannot afford to blur. Geometric closeness in embedding space is correlation of usage. The relevance a library is supposed to deliver is aboutness — whether a document genuinely concerns the user's actual question. These overlap often enough to be useful and diverge often enough to be dangerous.

An embedding will happily rank a passage as "similar" because it shares tone, topic-adjacent vocabulary, and rhetorical shape with the query, while being about something subtly but importantly different — the opposing position in a debate, a superficially related disease with a different mechanism, a satirical treatment of a serious subject. Keyword search failed loudly and let you correct it. Vector search fails quietly and plausibly, returning a confident, well-ranked list whose logic you cannot inspect. For casual discovery, that trade is fine. For a clinician, a legal researcher, or a scholar building an argument on the completeness of a literature search, "plausibly relevant" is not the same standard as "relevant," and the system no longer tells you which one it delivered.

This tension between semantic approximation and actual intent is increasingly visible far beyond libraries and academic retrieval systems. Modern digital platforms of every kind — recommendation feeds, AI assistants, streaming systems, search engines, and online entertainment ecosystems — are now built on predictive relevance models that attempt to anticipate user behavior before the user fully articulates it themselves. The result is a web optimized less for explicit navigation and more for continuous probabilistic inference.

Even seemingly simple interaction layers, such as persistent account systems and frictionless access flows like Betwest login, exist inside this larger design philosophy. The objective is not merely convenience. It is continuity: reducing the number of moments in which a user pauses, reevaluates intent, or exits the system entirely. In practice, retrieval, recommendation, authentication, and engagement design are increasingly converging into the same underlying logic of behavioral retention.

The black box moves into the catalog

A controlled vocabulary has an underrated virtue: you can argue with it. A subject heading is a human decision, recorded, contestable, and improvable. If a cataloger assigned the wrong term, a searcher can see the term, understand why the record surfaced, and route around the error. The reasoning is on the surface.

An embedding offers no such surface. Ask it why it ranked a document third rather than thirtieth and there is no answer in any language a person speaks — only the arithmetic of a vector comparison. The expertise of the reference librarian, historically built on understanding how the index worked well enough to outwit it, has nothing to grip. This is not a minor ergonomic complaint. A library is, among other things, an accountable system for connecting people to knowledge, and accountability requires that retrieval decisions be at least in principle explicable. A retrieval substrate that cannot explain itself is in tension with that mission, however good its results look on average.

Bias becomes geometry

Every embedding model inherits the statistical regularities of the corpus it was trained on, and that includes the corpus's distortions. If a body of text systematically associates certain professions, regions, languages, or populations with certain attributes, those associations are encoded as geometry — as the actual distances the search engine will use to decide what you see.

In a recommender for shopping, that is a commercial problem. In a library that aspires, however imperfectly, to neutral and equitable access to the record, it is something graver: the bias is no longer in a human cataloging decision that can be reviewed and revised, but in the high-dimensional substrate beneath retrieval, invisible and unaudited. Collections built to broaden access can, through an embedding trained on a skewed corpus, quietly narrow it again — surfacing the well-represented and burying the marginal, with no heading anyone can point to and fix.

What the field should actually do

None of this is an argument against vector search. It is an argument against mistaking it for an oracle, and the practical response follows directly from the diagnosis.

Treat the embedding as one signal, not the authority. The strongest retrieval systems in production are hybrid — combining dense vector similarity with classical lexical methods — precisely because each catches what the other misses, and because lexical matching restores a thread of legibility. Keep human-curated metadata as ground truth and provenance, not as a quaint relic the model has superseded; the subject heading is the auditable layer that the embedding cannot provide. Evaluate retrieval against real, documented relevance judgments rather than the demo-day impression that the results "look right." And extend to the embedding model the same provenance discipline the field already demands of its objects: which model, which version, trained on what, updated when. A discovery layer whose behavior silently changed because someone swapped the embedding model is a preservation and reproducibility problem wearing a machine-learning costume — and it is squarely the field's problem to govern.

The embedding is a brilliant instrument and a poor authority. It has given digital libraries a genuinely new capability: finding the document you meant rather than the words you typed. What it has not given them is understanding, and the institutions that carry the record forward will be the ones that keep that distinction sharp — that use the instrument without surrendering to it the thing only a library is supposed to guarantee, which is that you can trust, and check, what you were shown.

Keep reading

More from Technology