Free 40-page Claude guide — setup, 120 prompt codes, MCP servers, AI agents. Download free →
CLSkills
AI/ML IntegrationadvancedNew

RAG Chunking

Share

Implement smart document chunking strategies for RAG

Works with OpenClaude

You are a machine learning engineer specializing in retrieval-augmented generation (RAG) systems. The user wants to implement smart document chunking strategies that balance semantic coherence, token limits, and retrieval performance.

What to check first

  • Verify your document encoding model's token limit (e.g., tiktoken.encoding_for_model("gpt-3.5-turbo") to get tokenizer)
  • Check your embedding model's max sequence length (typically 512-8192 tokens depending on model)
  • Confirm your target chunk overlap percentage and desired chunk size in tokens, not characters

Steps

  1. Choose a chunking strategy: recursive character splitting for general text, semantic splitting for domain documents, or sliding window for time-series
  2. Install required dependencies: pip install langchain tiktoken openai for token counting and semantic chunking
  3. Initialize a tokenizer matching your embedding model: encoding = tiktoken.encoding_for_model("text-embedding-3-small")
  4. Calculate actual token count for each chunk using len(encoding.encode(chunk)) instead of character length
  5. Implement recursive splitting starting with larger separators (paragraphs \n\n, sentences \n, words ) to preserve semantic boundaries
  6. Apply overlap strategy: use 20-30% token overlap to maintain context across chunk boundaries
  7. Validate chunks don't exceed embedding model limits and contain minimum context (50-100 tokens minimum)
  8. Test chunk quality by checking retrieval precision: chunks should answer single, complete questions

Code

import tiktoken
from typing import List, Tuple
import re

class SmartChunker:
    def __init__(self, model_name: str = "text-embedding-3-small", 
                 target_chunk_tokens: int = 512, 
                 overlap_ratio: float = 0.2,
                 min_chunk_tokens: int = 50):
        self.encoding = tiktoken.encoding_for_model(model_name)
        self.target_chunk_tokens = target_chunk_tokens
        self.overlap_tokens = int(target_chunk_tokens * overlap_ratio)
        self.min_chunk_tokens = min_chunk_tokens
        self.separators = ["\n\n", "\n", ". ", " ", ""]
    
    def count_tokens(self, text: str) -> int:
        """Count tokens in text using model's tokenizer."""
        return len(self.encoding.encode(text))
    
    def recursive_split(self, text: str, separators: List[str] = None) -> List[str]:
        """Recursively split text using progressively smaller separators."""
        if separators is None:
            separators = self.separators
        
        splits = []
        good_separators = []
        
        for sep in separators:
            if sep == "":
                good_separators = [sep]
                break
            if sep in text:
                good_separators

Note: this example was truncated in the source. See the GitHub repo for the latest full version.

Common Pitfalls

  • Forgetting to handle rate limits — Anthropic returns 429 errors that need exponential backoff
  • Hardcoding the model name in 50 places — use a single config so you can swap models in one place
  • Not setting a timeout on API calls — a hanging request can lock your worker indefinitely
  • Logging API responses with sensitive data — PII can end up in your logs without realizing
  • Treating the API as deterministic — same prompt, different output. Test on multiple runs

When NOT to Use This Skill

  • For deterministic tasks where regex or rule-based code would work — LLMs add cost and latency for no benefit
  • When you need 100% accuracy on a known schema — use structured output APIs or fine-tuning instead
  • For real-time low-latency applications under 100ms — even the fastest LLM is too slow

How to Verify It Worked

  • Test with malformed inputs, empty strings, and edge cases — APIs often behave differently than docs suggest
  • Verify your error handling on all 4xx and 5xx responses — most code only handles the happy path
  • Run a load test with 10x your expected traffic — rate limits hit fast
  • Check token usage matches your estimate — surprises here become surprises on your bill

Production Considerations

  • Set a daily spend cap on your Anthropic console — prevents runaway costs from bugs or attacks
  • Use prompt caching for static parts of your prompts — can cut costs by 50-90%
  • Stream responses for any user-facing output — perceived latency drops by 70%
  • Have a fallback model ready — if Claude is down, you should be able to swap to a backup with one config change

Quick Info

Difficultyadvanced
Version1.0.0
AuthorClaude Skills Hub
ragchunkingai

Install command:

curl -o ~/.claude/skills/rag-chunking.md https://clskills.in/skills/ai-ml/rag-chunking.md

Related AI/ML Integration Skills

Other Claude Code skills in the same category — free to download.

Want a AI/ML Integration skill personalized to YOUR project?

This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.