Cross-encoder vs bi-encoder

Architecture · ~8 min read

These two architectures are the heart of modern retrieval. A bi-encoder turns each text into a vector independently — fast and scalable, perfect for first-stage search. A cross-encoder reads the query and document together and outputs a relevance score — slower, but far more accurate, which is exactly what a reranker needs.

On this page

How a bi-encoder works
How a cross-encoder works
Side-by-side comparison
Why you use both
A concrete example

How a bi-encoder works

A bi-encoder (also called a dual encoder) passes the query and each document through the same model separately, producing one fixed-length vector per text. Relevance is then just the cosine similarity (or dot product) between two vectors.

query ───▶ [encoder] ───▶ q⃗  ┐
                              ├─▶ cosine(q⃗, d⃗) = score
doc   ───▶ [encoder] ───▶ d⃗  ┘

The crucial property: document vectors don’t depend on the query. You can embed your whole corpus once, store the vectors in an index, and at query time only embed the query and look up nearest neighbours. That’s what makes vector search fast enough for millions of documents. The downside is that the query and document never “see” each other, so the score is a blunt instrument.

How a cross-encoder works

A cross-encoder concatenates the query and document into one sequence — [CLS] query [SEP] document [SEP] — and runs the pair through the transformer together. Self-attention lets every query token interact with every document token, and the model outputs a single relevance score.

query + doc ───▶ [encoder, full cross-attention] ───▶ relevance score

This is dramatically more accurate because the model can reason about the relationship between the texts, not just their surface similarity. The catch: the score depends on the specific pair, so you can’t precompute anything. Every (query, document) combination is a fresh forward pass — which is why you only run a cross-encoder on a shortlist, never on the whole corpus. That shortlisted use is reranking.

Side-by-side comparison

Property	Bi-encoder	Cross-encoder
Input	Query and doc encoded separately	Query and doc encoded together
Output	A vector per text	One relevance score per pair
Precompute corpus?	Yes — embed once, reuse	No — must score at query time
Speed	Very fast (vector lookup)	Slow (one model call per candidate)
Accuracy	Good for recall	Excellent for precision
Scales to millions of docs?	Yes	No — only a shortlist
Typical role	First-stage retrieval	Second-stage reranking

Why you use both

They’re complementary, not competing. The bi-encoder’s job is recall: cheaply pull a few dozen candidates that probably contain the answer. The cross-encoder’s job is precision: carefully reorder that shortlist so the best candidates are unambiguously on top.

Bi-encoder finds the haystack’s promising corner. Cross-encoder finds the needle in it.

Trying to use only one is usually a mistake: a cross-encoder alone can’t scan a million documents in time, and a bi-encoder alone leaves quality on the table. The standard answer is the two-stage pipeline — retrieve with the bi-encoder, rerank with the cross-encoder.

A concrete example

Take the query “Does the free plan include API access?” and two candidates:

A: “Our pricing has free, pro and enterprise plans, billed monthly.”
B: “API access is available on every plan, including the free tier.”

A bi-encoder may rank A highly — it’s densely on-topic about “plans” and “pricing”, sharing lots of vocabulary with the query. But B actually answers the question. A cross-encoder, reading the query and each candidate together, can tell that B resolves the specific ask and push it to the top. That gap is the whole reason rerankers exist.

Feel the difference yourself

Our demo runs a cross-encoder in your browser. Paste a tricky query and watch which passage it promotes.

Open the live demo →

Keep reading

What is a reranker?

Start with the fundamentals.

How to add reranking to RAG

Put both encoders to work in a pipeline.