Free 40-page Claude guide — setup, 120 prompt codes, MCP servers, AI agents. Download free →
CLSkills
AI/ML IntegrationadvancedNew

RAG Reranking

Share

Add reranking with cross-encoders to improve RAG retrieval

Works with OpenClaude

You are an AI/ML engineer implementing retrieval-augmented generation (RAG) systems. The user wants to add reranking with cross-encoders to improve retrieval quality by rescoring initially retrieved documents.

What to check first

  • Verify you have a working retriever pipeline that returns initial candidate documents (check your vector store query returns at least 10-20 results)
  • Run pip list | grep sentence to confirm sentence-transformers is installed; if not, install with pip install sentence-transformers
  • Confirm your RAG pipeline has a retrieval step that returns documents with scores before the reranking step

Steps

  1. Import the CrossEncoder class from sentence_transformers and initialize with a cross-encoder model like cross-encoder/ms-marco-MiniLM-L-6-v2 (faster) or cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 (multilingual)
  2. Prepare your retrieval function to return top-k candidates (typically 20-50 documents) from your vector store before reranking
  3. Create a list of (document, query) pairs by formatting each retrieved document with your search query as a tuple
  4. Pass the query-document pairs to the CrossEncoder's predict method with convert_to_numpy=True to get reranking scores
  5. Sort the documents by cross-encoder scores in descending order and select the top reranked results (typically top 3-5)
  6. Replace the original document scores with cross-encoder scores in your final output
  7. Integrate the reranking step into your RAG pipeline between retrieval and LLM prompt generation
  8. Measure improvement using metrics like NDCG or MRR on your evaluation dataset before and after reranking

Code

from sentence_transformers import CrossEncoder
from typing import List, Tuple
import numpy as np

class RAGReranker:
    def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
        """Initialize cross-encoder for reranking."""
        self.reranker = CrossEncoder(model_name, max_length=512)
    
    def rerank(
        self,
        query: str,
        documents: List[dict],
        top_k: int = 5
    ) -> List[dict]:
        """
        Rerank documents using cross-encoder.
        
        Args:
            query: Search query string
            documents: List of dicts with 'content' and optional 'score', 'source' keys
            top_k: Number of top documents to return after reranking
            
        Returns:
            Reranked documents sorted by cross-encoder score
        """
        if not documents:
            return []
        
        # Prepare query-document pairs for cross-encoder
        doc_texts = [doc.get("content", "") for doc in documents]
        query_doc_pairs = [[query, doc_text

Note: this example was truncated in the source. See the GitHub repo for the latest full version.

Common Pitfalls

  • Forgetting to handle rate limits — Anthropic returns 429 errors that need exponential backoff
  • Hardcoding the model name in 50 places — use a single config so you can swap models in one place
  • Not setting a timeout on API calls — a hanging request can lock your worker indefinitely
  • Logging API responses with sensitive data — PII can end up in your logs without realizing
  • Treating the API as deterministic — same prompt, different output. Test on multiple runs

When NOT to Use This Skill

  • For deterministic tasks where regex or rule-based code would work — LLMs add cost and latency for no benefit
  • When you need 100% accuracy on a known schema — use structured output APIs or fine-tuning instead
  • For real-time low-latency applications under 100ms — even the fastest LLM is too slow

How to Verify It Worked

  • Test with malformed inputs, empty strings, and edge cases — APIs often behave differently than docs suggest
  • Verify your error handling on all 4xx and 5xx responses — most code only handles the happy path
  • Run a load test with 10x your expected traffic — rate limits hit fast
  • Check token usage matches your estimate — surprises here become surprises on your bill

Production Considerations

  • Set a daily spend cap on your Anthropic console — prevents runaway costs from bugs or attacks
  • Use prompt caching for static parts of your prompts — can cut costs by 50-90%
  • Stream responses for any user-facing output — perceived latency drops by 70%
  • Have a fallback model ready — if Claude is down, you should be able to swap to a backup with one config change

Quick Info

Difficultyadvanced
Version1.0.0
AuthorClaude Skills Hub
ragrerankingcross-encoders

Install command:

curl -o ~/.claude/skills/rag-reranking.md https://clskills.in/skills/ai-ml/rag-reranking.md

Related AI/ML Integration Skills

Other Claude Code skills in the same category — free to download.

Want a AI/ML Integration skill personalized to YOUR project?

This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.