Cross-encoder vs bi-encoder

Architecture · ~8 min read

These two architectures are the heart of modern retrieval. A bi-encoder turns each text into a vector independently — fast and scalable, perfect for first-stage search. A cross-encoder reads the query and document together and outputs a relevance score — slower, but far more accurate, which is exactly what a reranker needs.

How a bi-encoder works

A bi-encoder (also called a dual encoder) passes the query and each document through the same model separately, producing one fixed-length vector per text. Relevance is then just the cosine similarity (or dot product) between two vectors.

query ───▶ [encoder] ───▶ q⃗  ┐
                              ├─▶ cosine(q⃗, d⃗) = score
doc   ───▶ [encoder] ───▶ d⃗  ┘

The crucial property: document vectors don’t depend on the query. You can embed your whole corpus once, store the vectors in an index, and at query time only embed the query and look up nearest neighbours. That’s what makes vector search fast enough for millions of documents. The downside is that the query and document never “see” each other, so the score is a blunt instrument.

How a cross-encoder works

A cross-encoder concatenates the query and document into one sequence — [CLS] query [SEP] document [SEP] — and runs the pair through the transformer together. Self-attention lets every query token interact with every document token, and the model outputs a single relevance score.

query + doc ───▶ [encoder, full cross-attention] ───▶ relevance score

This is dramatically more accurate because the model can reason about the relationship between the texts, not just their surface similarity. The catch: the score depends on the specific pair, so you can’t precompute anything. Every (query, document) combination is a fresh forward pass — which is why you only run a cross-encoder on a shortlist, never on the whole corpus. That shortlisted use is reranking.

Side-by-side comparison

PropertyBi-encoderCross-encoder
InputQuery and doc encoded separatelyQuery and doc encoded together
OutputA vector per textOne relevance score per pair
Precompute corpus?Yes — embed once, reuseNo — must score at query time
SpeedVery fast (vector lookup)Slow (one model call per candidate)
AccuracyGood for recallExcellent for precision
Scales to millions of docs?YesNo — only a shortlist
Typical roleFirst-stage retrievalSecond-stage reranking

Why you use both

They’re complementary, not competing. The bi-encoder’s job is recall: cheaply pull a few dozen candidates that probably contain the answer. The cross-encoder’s job is precision: carefully reorder that shortlist so the best candidates are unambiguously on top.

Bi-encoder finds the haystack’s promising corner. Cross-encoder finds the needle in it.

Trying to use only one is usually a mistake: a cross-encoder alone can’t scan a million documents in time, and a bi-encoder alone leaves quality on the table. The standard answer is the two-stage pipeline — retrieve with the bi-encoder, rerank with the cross-encoder.

A concrete example

Take the query “Does the free plan include API access?” and two candidates:

A bi-encoder may rank A highly — it’s densely on-topic about “plans” and “pricing”, sharing lots of vocabulary with the query. But B actually answers the question. A cross-encoder, reading the query and each candidate together, can tell that B resolves the specific ask and push it to the top. That gap is the whole reason rerankers exist.

Feel the difference yourself

Our demo runs a cross-encoder in your browser. Paste a tricky query and watch which passage it promotes.

Open the live demo →

Keep reading