Free 40-page Claude guide — setup, 120 prompt codes, MCP servers, AI agents. Download free →
CLSkills
AI/ML IntegrationadvancedNew

RAG Hybrid Search

Share

Combine dense and sparse retrieval for hybrid RAG search

Works with OpenClaude

You are an ML engineer implementing production RAG systems. The user wants to combine dense vector retrieval with sparse (BM25) retrieval to improve search quality and recall in retrieval-augmented generation pipelines.

What to check first

  • Verify you have a vector database installed: pip list | grep -E "(pinecone|weaviate|milvus|qdrant)"
  • Check that you have a sparse search library: pip list | grep -E "(elasticsearch|whoosh|rank_bm25)"
  • Confirm your embedding model is available: test with from sentence_transformers import SentenceTransformer

Steps

  1. Install required dependencies: pip install rank-bm25 sentence-transformers numpy for BM25 sparse retrieval and embeddings
  2. Load your embedding model (e.g., all-MiniLM-L6-v2) to generate dense vectors for documents
  3. Tokenize and build a BM25 corpus from your document corpus for sparse retrieval
  4. Create a normalization function to scale both dense similarity scores (0-1) and sparse BM25 scores to comparable ranges
  5. Implement a hybrid ranking function that takes dense_score * weight_dense + sparse_score * weight_sparse
  6. For each query, compute both dense embedding similarity and BM25 ranking in parallel
  7. Merge the ranked results by document ID, using the hybrid score as the final ranking metric
  8. Return the top-k documents sorted by combined hybrid score

Code

import numpy as np
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
from typing import List, Tuple, Dict

class HybridRAGSearch:
    def __init__(self, model_name: str = "all-MiniLM-L6-v2", 
                 dense_weight: float = 0.6, sparse_weight: float = 0.4):
        self.embedding_model = SentenceTransformer(model_name)
        self.dense_weight = dense_weight
        self.sparse_weight = sparse_weight
        self.bm25 = None
        self.documents = []
        self.dense_embeddings = None
        
    def index_documents(self, documents: List[str]):
        """Build both dense and sparse indices from documents."""
        self.documents = documents
        
        # Dense indexing: compute embeddings for all documents
        print("Computing dense embeddings...")
        self.dense_embeddings = self.embedding_model.encode(
            documents, convert_to_numpy=True
        )
        
        # Sparse indexing: tokenize and build BM25 corpus
        print("Building BM25 index...")
        tokenized_corpus = [doc.lower().split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized_corpus)
        
    def _normalize_scores(self, scores:

Note: this example was truncated in the source. See the GitHub repo for the latest full version.

Common Pitfalls

  • Forgetting to handle rate limits — Anthropic returns 429 errors that need exponential backoff
  • Hardcoding the model name in 50 places — use a single config so you can swap models in one place
  • Not setting a timeout on API calls — a hanging request can lock your worker indefinitely
  • Logging API responses with sensitive data — PII can end up in your logs without realizing
  • Treating the API as deterministic — same prompt, different output. Test on multiple runs

When NOT to Use This Skill

  • For deterministic tasks where regex or rule-based code would work — LLMs add cost and latency for no benefit
  • When you need 100% accuracy on a known schema — use structured output APIs or fine-tuning instead
  • For real-time low-latency applications under 100ms — even the fastest LLM is too slow

How to Verify It Worked

  • Test with malformed inputs, empty strings, and edge cases — APIs often behave differently than docs suggest
  • Verify your error handling on all 4xx and 5xx responses — most code only handles the happy path
  • Run a load test with 10x your expected traffic — rate limits hit fast
  • Check token usage matches your estimate — surprises here become surprises on your bill

Production Considerations

  • Set a daily spend cap on your Anthropic console — prevents runaway costs from bugs or attacks
  • Use prompt caching for static parts of your prompts — can cut costs by 50-90%
  • Stream responses for any user-facing output — perceived latency drops by 70%
  • Have a fallback model ready — if Claude is down, you should be able to swap to a backup with one config change

Quick Info

Difficultyadvanced
Version1.0.0
AuthorClaude Skills Hub
raghybrid-searchretrieval

Install command:

curl -o ~/.claude/skills/rag-hybrid-search.md https://clskills.in/skills/ai-ml/rag-hybrid-search.md

Related AI/ML Integration Skills

Other Claude Code skills in the same category — free to download.

Want a AI/ML Integration skill personalized to YOUR project?

This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.