How to add reranking to your RAG pipeline

Practical · ~10 min read

The fix for mediocre RAG answers is rarely a bigger LLM — it's usually a better retrieval order. A reranker sits between your vector search and your LLM call, re-scores the candidates, and ensures only the most relevant passages land in the prompt.

The ordering problem in RAG

A typical RAG system embeds your documents and stores the vectors. At query time it fetches the k-nearest vectors to the query embedding and stuffs those chunks into the LLM prompt. The problem: cosine similarity between independent embeddings is a coarse relevance signal. The correct chunk might be in the top 20 results, but sitting at position 14 — outside the 5 you actually send to the model.

This is the "lost in the middle" problem in reverse: the right answer was never at the top in the first place. Reranking fixes it by applying a more expensive, more accurate relevance model to the shortlist the retriever already found.

The retrieve-rerank-generate pattern

User query
    │
    ▼
┌───────────────────────────────────────────────────┐
│  Stage 1 — Retrieve                               │
│  Embed query → vector search / BM25               │
│  → top 50–100 candidate chunks  (fast, recall)    │
└───────────────────┬───────────────────────────────┘
                    │
                    ▼
┌───────────────────────────────────────────────────┐
│  Stage 2 — Rerank                                 │
│  Score each (query, chunk) pair with cross-encoder│
│  Sort by score → keep top 5–10  (slow, precise)   │
└───────────────────┬───────────────────────────────┘
                    │
                    ▼
┌───────────────────────────────────────────────────┐
│  Stage 3 — Generate                               │
│  Feed top-k reranked chunks + query to LLM        │
│  → grounded answer                                │
└───────────────────────────────────────────────────┘

The retriever handles scale (millions of documents at millisecond speed). The reranker handles quality (precise ordering of a few dozen candidates). The LLM handles synthesis. Each stage does only what it's good at.

Code walkthrough

Here's a self-contained Python example using a local bge-reranker. In production you'd swap rank_documents for a call to Cohere, Jina, or Voyage if you prefer hosted APIs.

With a local cross-encoder (sentence-transformers)

from sentence_transformers import CrossEncoder

# Load once at startup — reuse across requests
reranker = CrossEncoder("BAAI/bge-reranker-base", max_length=512)

def rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]:
    """Return top_n candidates reranked by relevance to query."""
    pairs = [(query, doc) for doc in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, candidates), reverse=True)
    return [doc for _, doc in ranked[:top_n]]

# --- In your RAG pipeline ---
raw_chunks = vector_db.search(query, top_k=50)   # retrieve wide
best_chunks = rerank(query, raw_chunks, top_n=5)  # rerank tight
answer = llm.complete(build_prompt(query, best_chunks))  # generate

With the Cohere hosted API

import cohere

co = cohere.Client("YOUR_API_KEY")

def rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]:
    result = co.rerank(
        model="rerank-v3.5",
        query=query,
        documents=candidates,
        top_n=top_n,
    )
    return [candidates[r.index] for r in result.results]

With Jina Reranker API

import requests

def rerank(query: str, candidates: list[str], top_n: int = 5) -> list[str]:
    resp = requests.post(
        "https://api.jina.ai/v1/rerank",
        headers={"Authorization": "Bearer YOUR_KEY"},
        json={
            "model": "jina-reranker-v2-base-multilingual",
            "query": query,
            "documents": candidates,
            "top_n": top_n,
        },
    ).json()
    indices = [r["index"] for r in resp["results"]]
    return [candidates[i] for i in indices]

Choosing top-k values

You have two k values to tune: how many to retrieve and how many to keep after reranking.

ParameterTypical rangeNotes
retrieval_k20–100More = better recall, slower reranker. 50 is a common default. Don't go below 20 or you may miss the right chunk entirely.
rerank_top_n3–10Fewer = cheaper prompt, but higher risk of excluding a useful chunk. Start at 5; tune based on your context window and answer quality.

Rule of thumb: retrieve at least 5× what you plan to keep. If you want 5 final chunks, retrieve at least 25–50. The reranker can only fix order, not conjure chunks that weren't retrieved at all.

Latency trade-offs

Reranking adds a model call to your pipeline. The cost depends on the approach:

ApproachP50 latency (50 docs)Cost
Cohere / Jina / Voyage API80–200 msPer-call pricing (~$0.0002–0.002 / 1k chunks)
bge-reranker on CPU (small)200–600 msYour infra cost; free per-call
bge-reranker on GPU15–60 msGPU cost; free per-call
Local tiny model (e.g. jina-tiny)30–120 ms CPUFree

For most RAG applications, 100–300 ms total pipeline latency is fine and the quality gain is worth it. If your SLA is very tight, either host on GPU, use a tiny model, or cap retrieval_k at 20–30 instead of 50.

Cache aggressively: if the same query recurs (e.g. in a customer support bot), cache the reranked results by (query, corpus version) hash. The reranker becomes effectively free for repeat queries.

Which reranker to pick

The short version:

See the full model comparison →

See reranking in action

Paste your own query and candidates. A cross-encoder scores them in your browser — zero API cost.

Open the live demo →

Keep reading