What is a reranker?
A reranker is a model that takes a query and a list of candidate documents and reorders them by how relevant each one actually is to the query. It almost always runs as a second stage: something fast retrieves a broad set of candidates, then the reranker carefully re-scores the top of that list.
Why ranking order is a problem
Modern search and retrieval-augmented generation (RAG) usually start with a vector database. You embed your documents once, embed the query at request time, and fetch the nearest neighbours by cosine similarity. This is fast and scales to millions of documents — but the ordering it produces is only roughly right.
The reason is structural. To stay fast, the retriever embeds the query and each document independently, with no chance for the two texts to interact. A passage that merely shares vocabulary with the query can score as highly as one that genuinely answers it. So the correct document is often in the top 50 — just not at position 1, 2 or 3, which is exactly where it needs to be when you only feed a few passages to an LLM.
Retrieval is good at recall (“is the answer somewhere in the shortlist?”). It’s mediocre at precision (“is the answer at the very top?”). Reranking fixes precision.
The two-stage retrieval pattern
Rerankers exist because of a speed-versus-accuracy trade-off. The accurate way to compare a query and a document is to run them through a model together — but doing that against every document in your corpus would be impossibly slow. So we split the work:
Stage 1 — Retrieve (fast, approximate)
vector search / BM25 over the whole corpus → top 50–100 candidates
Stage 2 — Rerank (slow per item, but only on the shortlist)
cross-encoder scores each (query, candidate) pair → reorder → keep top 3–10
The retriever casts a wide net cheaply; the reranker applies an expensive, accurate model only to the handful of candidates that survived. You get most of the quality of running the big model everywhere, at a tiny fraction of the cost.
How a rerank model scores relevance
Most rerankers are cross-encoders. The query and a candidate document are concatenated into a single input and passed through a transformer, which outputs one number: a relevance score. Because every token of the query can attend to every token of the document, the model can judge things a similarity score can’t — negation, specificity, whether the passage truly answers the question rather than just mentioning the topic.
You run this once per candidate, then sort by score. The output is typically turned into a 0–1 value (via a sigmoid) so it’s easy to threshold or display.
The key contrast: a bi-encoder embeds query and document separately and compares vectors — fast, but coarse. A cross-encoder reads them together and scores the pair — slow per item, but far more accurate. Rerankers are the cross-encoders you apply to a short list.
When you should (and shouldn’t) rerank
Reranking pays off when:
- You feed retrieved context to an LLM and want the most relevant passages first (classic RAG).
- Your top-k looks “mostly right but mis-ordered”, or the right answer is in the top 50 but not the top 5.
- You can afford ~tens of milliseconds of extra latency for a meaningful jump in answer quality.
- You want to send fewer passages to the LLM (cheaper prompts) without losing the good one.
It’s less useful when:
- Your retriever already returns near-perfect ordering for your domain.
- You’re extremely latency-bound and can’t add a second model call.
- Your candidate set is tiny (rerank 3 items and there’s little to reorder).
Try it yourself
The fastest way to build intuition is to watch a reranker work. Our in-browser demo loads a real cross-encoder and scores your own query against passages you paste — entirely on your device, no API key, no cost. Drop in a couple of off-topic lines and watch them sink.
See a reranker reorder text live
No setup. A cross-encoder runs in your browser and re-scores passages in milliseconds.
Open the live demo →