Free 40-page Claude guide — setup, 120 prompt codes, MCP servers, AI agents. Download free →
CLSkills
AI/ML Integrationadvanced

RAG Pipeline

Share

Build Retrieval-Augmented Generation pipeline

Works with OpenClaude

You are an AI/ML engineer building production-grade Retrieval-Augmented Generation (RAG) systems. The user wants to construct a complete RAG pipeline that retrieves relevant documents and augments language model responses with retrieved context.

What to check first

  • Verify you have a vector database installed: pip list | grep -E "pinecone|weaviate|qdrant|chromadb"
  • Confirm LLM access: Check API keys for OpenAI, Anthropic, or local model endpoints
  • Check embedding model availability: python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"

Steps

  1. Initialize a vector store by choosing a backend (Pinecone, Weaviate, Qdrant, or ChromaDB) and create a client connection with authentication credentials
  2. Load and chunk your documents using a text splitter with overlap (e.g., RecursiveCharacterTextSplitter with chunk_size=1000, chunk_overlap=200)
  3. Generate embeddings for each chunk using a sentence transformer model (sentence-transformers/all-MiniLM-L6-v2 or OpenAI's text-embedding-3-small)
  4. Store embeddings and metadata in your vector database with the document text and source information
  5. Create a retriever that performs similarity search to find top-k relevant chunks given a query (set k=3-5 for context window efficiency)
  6. Build a prompt template that formats retrieved documents as context before the user's question
  7. Chain the retriever with an LLM using RetrievalQA or custom prompt + LLM call, ensuring retrieved documents populate the context
  8. Implement a re-ranking layer (optional but recommended) using cross-encoder models to reorder retrieved results by relevance

Code

from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings, HuggingFaceEmbeddings
from langchain.vectorstores import Pinecone, Chroma
from langchain.llms import OpenAI, ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
import pinecone
import os

# Load documents
loader = DirectoryLoader("./documents", glob="*.txt", loader_cls=TextLoader)
raw_documents = loader.load()

# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", " ", ""]
)
documents = text_splitter.split_documents(raw_documents)

# Initialize embeddings

Note: this example was truncated in the source. See the GitHub repo for the latest full version.

Common Pitfalls

  • Forgetting to handle rate limits — Anthropic returns 429 errors that need exponential backoff
  • Hardcoding the model name in 50 places — use a single config so you can swap models in one place
  • Not setting a timeout on API calls — a hanging request can lock your worker indefinitely
  • Logging API responses with sensitive data — PII can end up in your logs without realizing
  • Treating the API as deterministic — same prompt, different output. Test on multiple runs

When NOT to Use This Skill

  • For deterministic tasks where regex or rule-based code would work — LLMs add cost and latency for no benefit
  • When you need 100% accuracy on a known schema — use structured output APIs or fine-tuning instead
  • For real-time low-latency applications under 100ms — even the fastest LLM is too slow

How to Verify It Worked

  • Test with malformed inputs, empty strings, and edge cases — APIs often behave differently than docs suggest
  • Verify your error handling on all 4xx and 5xx responses — most code only handles the happy path
  • Run a load test with 10x your expected traffic — rate limits hit fast
  • Check token usage matches your estimate — surprises here become surprises on your bill

Production Considerations

  • Set a daily spend cap on your Anthropic console — prevents runaway costs from bugs or attacks
  • Use prompt caching for static parts of your prompts — can cut costs by 50-90%
  • Stream responses for any user-facing output — perceived latency drops by 70%
  • Have a fallback model ready — if Claude is down, you should be able to swap to a backup with one config change

Quick Info

Difficultyadvanced
Version1.0.0
AuthorClaude Skills Hub
airagretrieval

Install command:

curl -o ~/.claude/skills/rag-pipeline.md https://claude-skills-hub.vercel.app/skills/ai-ml/rag-pipeline.md

Related AI/ML Integration Skills

Other Claude Code skills in the same category — free to download.

Want a AI/ML Integration skill personalized to YOUR project?

This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.