Free 40-page Claude guide — setup, 120 prompt codes, MCP servers, AI agents. Download free →
CLSkills
AI/ML Integrationadvanced

Fine-Tune Data Prep

Share

Prepare data for model fine-tuning

Works with OpenClaude

You are a machine learning engineer preparing datasets for model fine-tuning. The user wants to prepare and validate data in the correct format for fine-tuning LLMs or other models.

What to check first

  • Run pip list | grep -E "datasets|transformers|jsonlines" to verify you have the required libraries installed
  • Check your raw data file format (CSV, JSON, JSONL, or plain text) and file size with ls -lh your_data.csv
  • Confirm the model's input/output requirements (token limits, format expectations) in the target model's documentation

Steps

  1. Load your raw data using the datasets library with load_dataset() and inspect it with .info() and .select(range(5))
  2. Create train/validation splits using .train_test_split(test_size=0.2, seed=42) to maintain reproducibility
  3. Define a preprocessing function that tokenizes text and formats input-output pairs as {"input_ids": [...], "attention_mask": [...], "labels": [...]}
  4. Apply preprocessing with .map() across the dataset, setting batched=True and batch_size=1000 for efficiency
  5. Handle token length constraints by truncating with max_length=2048 and padding with pad_to_max_length=True
  6. Remove rows where labels are all -100 (ignored tokens) using .filter() to eliminate invalid training examples
  7. Save the processed dataset in JSONL format using to_json(orient="records") or use .save_to_disk() for HuggingFace format
  8. Validate the output by spot-checking 3-5 examples to confirm proper tokenization and label alignment

Code

from datasets import load_dataset, DatasetDict
from transformers import AutoTokenizer
import json

# Load raw data
dataset = load_dataset('csv', data_files={'train': 'train_data.csv'})
print(dataset['train'].info())

# Define preprocessing function
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")

def preprocess_function(examples):
    """Format input-output pairs for fine-tuning."""
    inputs = [f"Input: {inp}\nOutput: " for inp in examples['prompt']]
    outputs = examples['completion']
    
    # Tokenize inputs and labels separately
    model_inputs = tokenizer(
        inputs,
        max_length=1024,
        truncation=True,
        padding="max_length"
    )
    
    labels = tokenizer(
        outputs,
        max_length=512,
        truncation=True,
        padding="max_length"
    )
    
    # Combine: input tokens + output tokens as labels
    model_inputs['labels'] = labels['input_ids']
    
    return model_inputs

# Apply preprocessing
tokenized_dataset = dataset.map(

Note: this example was truncated in the source. See the GitHub repo for the latest full version.

Common Pitfalls

  • Forgetting to handle rate limits — Anthropic returns 429 errors that need exponential backoff
  • Hardcoding the model name in 50 places — use a single config so you can swap models in one place
  • Not setting a timeout on API calls — a hanging request can lock your worker indefinitely
  • Logging API responses with sensitive data — PII can end up in your logs without realizing
  • Treating the API as deterministic — same prompt, different output. Test on multiple runs

When NOT to Use This Skill

  • For deterministic tasks where regex or rule-based code would work — LLMs add cost and latency for no benefit
  • When you need 100% accuracy on a known schema — use structured output APIs or fine-tuning instead
  • For real-time low-latency applications under 100ms — even the fastest LLM is too slow

How to Verify It Worked

  • Test with malformed inputs, empty strings, and edge cases — APIs often behave differently than docs suggest
  • Verify your error handling on all 4xx and 5xx responses — most code only handles the happy path
  • Run a load test with 10x your expected traffic — rate limits hit fast
  • Check token usage matches your estimate — surprises here become surprises on your bill

Production Considerations

  • Set a daily spend cap on your Anthropic console — prevents runaway costs from bugs or attacks
  • Use prompt caching for static parts of your prompts — can cut costs by 50-90%
  • Stream responses for any user-facing output — perceived latency drops by 70%
  • Have a fallback model ready — if Claude is down, you should be able to swap to a backup with one config change

Quick Info

Difficultyadvanced
Version1.0.0
AuthorClaude Skills Hub
aifine-tuningdata

Install command:

curl -o ~/.claude/skills/fine-tune-data-prep.md https://claude-skills-hub.vercel.app/skills/ai-ml/fine-tune-data-prep.md

Related AI/ML Integration Skills

Other Claude Code skills in the same category — free to download.

Want a AI/ML Integration skill personalized to YOUR project?

This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.