How to add reranking to your RAG pipeline
The fix for mediocre RAG answers is rarely a bigger LLM — it's usually a better retrieval order. A reranker sits between your vector search and your LLM call, re-scores the candidates, and ensures only the most relevant passages land in the prompt.
The ordering problem in RAG
A typical RAG system embeds your documents and stores the vectors. At query time it fetches the k-nearest vectors to the query embedding and stuffs those chunks into the LLM prompt. The problem: cosine similarity between independent embeddings is a coarse relevance signal. The correct chunk might be in the top 20 results, but sitting at position 14 — outside the 5 you actually send to the model.
This is the "lost in the middle" problem in reverse: the right answer was never at the top in the first place. Reranking fixes it by applying a more expensive, more accurate relevance model to the shortlist the retriever already found.
The retrieve-rerank-generate pattern
User query
│
▼
┌───────────────────────────────────────────────────┐
│ Stage 1 — Retrieve │
│ Embed query → vector search / BM25 │
│ → top 50–100 candidate chunks (fast, recall) │
└───────────────────┬───────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ Stage 2 — Rerank │
│ Score each (query, chunk) pair with cross-encoder│
│ Sort by score → keep top 5–10 (slow, precise) │
└───────────────────┬───────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ Stage 3 — Generate │
│ Feed top-k reranked chunks + query to LLM │
│ → grounded answer │
└───────────────────────────────────────────────────┘
The retriever handles scale (millions of documents at millisecond speed). The reranker handles quality (precise ordering of a few dozen candidates). The LLM handles synthesis. Each stage does only what it's good at.
Code walkthrough
Here's a self-contained Python example using a local bge-reranker. In production you'd swap rank_documents for a call to Cohere, Jina, or Voyage if you prefer hosted APIs.
With a local cross-encoder (sentence-transformers)
from sentence_transformers import CrossEncoder
# Load once at startup — reuse across requests
reranker = CrossEncoder("BAAI/bge-reranker-base", max_length=512)
def rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]:
"""Return top_n candidates reranked by relevance to query."""
pairs = [(query, doc) for doc in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(scores, candidates), reverse=True)
return [doc for _, doc in ranked[:top_n]]
# --- In your RAG pipeline ---
raw_chunks = vector_db.search(query, top_k=50) # retrieve wide
best_chunks = rerank(query, raw_chunks, top_n=5) # rerank tight
answer = llm.complete(build_prompt(query, best_chunks)) # generate
With the Cohere hosted API
import cohere
co = cohere.Client("YOUR_API_KEY")
def rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]:
result = co.rerank(
model="rerank-v3.5",
query=query,
documents=candidates,
top_n=top_n,
)
return [candidates[r.index] for r in result.results]
With Jina Reranker API
import requests
def rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]:
resp = requests.post(
"https://api.jina.ai/v1/rerank",
headers={"Authorization": "Bearer YOUR_KEY"},
json={
"model": "jina-reranker-v2-base-multilingual",
"query": query,
"documents": candidates,
"top_n": top_n,
},
).json()
indices = [r["index"] for r in resp["results"]]
return [candidates[i] for i in indices]
Choosing top-k values
You have two k values to tune: how many to retrieve and how many to keep after reranking.
| Parameter | Typical range | Notes |
|---|---|---|
retrieval_k | 20–100 | More = better recall, slower reranker. 50 is a common default. Don't go below 20 or you may miss the right chunk entirely. |
rerank_top_n | 3–10 | Fewer = cheaper prompt, but higher risk of excluding a useful chunk. Start at 5; tune based on your context window and answer quality. |
Rule of thumb: retrieve at least 5× what you plan to keep. If you want 5 final chunks, retrieve at least 25–50. The reranker can only fix order, not conjure chunks that weren't retrieved at all.
Latency trade-offs
Reranking adds a model call to your pipeline. The cost depends on the approach:
| Approach | P50 latency (50 docs) | Cost |
|---|---|---|
| Cohere / Jina / Voyage API | 80–200 ms | Per-call pricing (~$0.0002–0.002 / 1k chunks) |
| bge-reranker on CPU (small) | 200–600 ms | Your infra cost; free per-call |
| bge-reranker on GPU | 15–60 ms | GPU cost; free per-call |
| Local tiny model (e.g. jina-tiny) | 30–120 ms CPU | Free |
For most RAG applications, 100–300 ms total pipeline latency is fine and the quality gain is worth it. If your SLA is very tight, either host on GPU, use a tiny model, or cap retrieval_k at 20–30 instead of 50.
Cache aggressively: if the same query recurs (e.g. in a customer support bot), cache the reranked results by (query, corpus version) hash. The reranker becomes effectively free for repeat queries.
Which reranker to pick
The short version:
- Want to self-host, English-only, free: bge-reranker-v2-m3 — strong, widely deployed.
- Want a hosted API, best multilingual quality: Cohere Rerank v3.5.
- Want open weights + hosted API + tiny browser-runnable model: Jina Reranker v2.
- Optimising for retrieval-specific quality: Voyage Rerank 2.
See the full model comparison →
See reranking in action
Paste your own query and candidates. A cross-encoder scores them in your browser — zero API cost.
Open the live demo →