Free 40-page Claude guide — setup, 120 prompt codes, MCP servers, AI agents. Download free →
CLSkills
AI/ML IntegrationadvancedNew

RAG Multimodal

Share

Build multimodal RAG with images, tables, and PDFs

Works with OpenClaude

You are an AI/ML engineer building production multimodal RAG systems that handle images, tables, and PDFs simultaneously. The user wants to construct a complete multimodal RAG pipeline that ingests diverse content types, chunks them intelligently, embeds them with vision-aware models, and retrieves relevant context across modalities.

What to check first

  • Verify you have pdf2image, pytesseract, and pillow installed: pip list | grep -E "(pdf2image|pytesseract|pillow)"
  • Confirm Tesseract OCR is system-installed: tesseract --version (or brew install tesseract on macOS)
  • Check you have a multimodal embedding model access (e.g., OpenAI's text-embedding-3-large, Hugging Face's CLIP, or local bge-visualbed)
  • Verify vector database installation: pip list | grep -E "(chromadb|weaviate|qdrant)"

Steps

  1. Install multimodal dependencies: pip install pdf2image pytesseract pillow langchain chromadb openai transformers torch torchvision pillow-heif
  2. Extract text from PDFs using pdf2image + OCR for scanned content; extract tables using pytesseract with layout analysis or camelot-py
  3. Detect and isolate images from PDFs and documents using bounding box detection or PIL image operations
  4. Create separate chunk types: TextChunk, ImageChunk, TableChunk with metadata tags (source, page, type, bbox)
  5. Load a multimodal embedding model (e.g., sentence-transformers/clip-vit-b-32 or OpenAI's vision embedding)
  6. Generate embeddings for text chunks directly; for images, encode raw image bytes or paths; for tables, encode both visual representation and OCR text as hybrid embeddings
  7. Store chunks in vector DB with modality metadata: {content, embedding, type, source, page_num, modality}
  8. Build retriever that searches across all modalities and re-ranks by relevance using a multimodal scorer
  9. Implement prompt template that handles mixed modality context (text snippets + base64 images + table markdown)
  10. Test end-to-end with a PDF containing text, tables, and images; verify correct chunk retrieval and LLM context window management

Code

import os
import base64
from io import BytesIO
from pathlib import Path
from typing import List, Dict, Any
import chromadb
from chromadb.config import Settings
from pdf2image import convert_from_path
from PIL import Image
import pytesseract
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.schema import Document
import pypdf
from openai import

Note: this example was truncated in the source. See the GitHub repo for the latest full version.

Common Pitfalls

  • Forgetting to handle rate limits — Anthropic returns 429 errors that need exponential backoff
  • Hardcoding the model name in 50 places — use a single config so you can swap models in one place
  • Not setting a timeout on API calls — a hanging request can lock your worker indefinitely
  • Logging API responses with sensitive data — PII can end up in your logs without realizing
  • Treating the API as deterministic — same prompt, different output. Test on multiple runs

When NOT to Use This Skill

  • For deterministic tasks where regex or rule-based code would work — LLMs add cost and latency for no benefit
  • When you need 100% accuracy on a known schema — use structured output APIs or fine-tuning instead
  • For real-time low-latency applications under 100ms — even the fastest LLM is too slow

How to Verify It Worked

  • Test with malformed inputs, empty strings, and edge cases — APIs often behave differently than docs suggest
  • Verify your error handling on all 4xx and 5xx responses — most code only handles the happy path
  • Run a load test with 10x your expected traffic — rate limits hit fast
  • Check token usage matches your estimate — surprises here become surprises on your bill

Production Considerations

  • Set a daily spend cap on your Anthropic console — prevents runaway costs from bugs or attacks
  • Use prompt caching for static parts of your prompts — can cut costs by 50-90%
  • Stream responses for any user-facing output — perceived latency drops by 70%
  • Have a fallback model ready — if Claude is down, you should be able to swap to a backup with one config change

Quick Info

Difficultyadvanced
Version1.0.0
AuthorClaude Skills Hub
ragmultimodalai

Install command:

curl -o ~/.claude/skills/rag-multimodal.md https://clskills.in/skills/ai-ml/rag-multimodal.md

Related AI/ML Integration Skills

Other Claude Code skills in the same category — free to download.

Want a AI/ML Integration skill personalized to YOUR project?

This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.