Free 40-page Claude guide — setup, 120 prompt codes, MCP servers, AI agents. Download free →
CLSkills
AI/ML Integrationintermediate

AI Rate Limiter

Share

Implement rate limiting for AI API calls

Works with OpenClaude

You are a backend engineer implementing rate limiting for AI API calls. The user wants to build a robust system that controls request frequency, prevents cost overruns, and handles multiple rate limit strategies (token-based, time-window, adaptive).

What to check first

  • Confirm your AI service provider's rate limit headers (check their API docs for x-ratelimit-remaining, x-ratelimit-reset)
  • Run npm list to verify you have redis or node-cache installed for distributed state
  • Review your application's request logging to understand current API call patterns and peak usage

Steps

  1. Install a rate limiting library: npm install bottleneck or use Redis for distributed rate limiting with npm install redis
  2. Choose your strategy: token bucket (smooth burst handling), sliding window (precise time-based), or leaky bucket (predictable output)
  3. Set hard limits per API key and soft limits per user—store these in environment variables or a database
  4. Implement cost tracking alongside rate limiting to correlate requests with actual token consumption from the AI API response
  5. Add exponential backoff with jitter when hitting rate limits—retry after base_delay * (2 ^ attempt) + random(0, jitter_ms)
  6. Create middleware that extracts the x-ratelimit-reset header from the AI API response and synchronizes your local limiter
  7. Log all rate limit events (hits, resets, violations) with timestamp and API key hash for debugging cost spikes
  8. Set up alerting when 80% of monthly quota is consumed to prevent surprise overages

Code

const Bottleneck = require('bottleneck');
const Redis = require('redis');

class AIRateLimiter {
  constructor(options = {}) {
    this.limiters = new Map(); // Per-key limiters
    this.costTracker = new Map(); // Track token costs
    this.redis = options.redis || null;
    
    // Global limits
    this.globalConfig = {
      requestsPerMinute: options.rpm || 60,
      requestsPerDay: options.rpd || 10000,
      costLimitPerDay: options.costLimit || 50, // dollars
      costPerRequest: options.costPerRequest || 0.002,
    };

    this.alerts = {
      quotaWarning: 0.8,
      burstDetection: options.burstThreshold || 10,
    };
  }

  // Get or create a limiter for an API key
  getLimiter(apiKey) {
    if (!this.limiters.has(apiKey)) {
      const limiter = new Bottleneck({
        minTime: (60 * 1000) / this.globalConfig.requestsPerMinute,
        maxConcurrent: 1,
        reservoir: this.globalConfig.requestsPerMinute,
        reservoirRefreshAmount: this.globalConfig.requestsPerMinute,
        reservoirRefresh

Note: this example was truncated in the source. See the GitHub repo for the latest full version.

Common Pitfalls

  • Forgetting to handle rate limits — Anthropic returns 429 errors that need exponential backoff
  • Hardcoding the model name in 50 places — use a single config so you can swap models in one place
  • Not setting a timeout on API calls — a hanging request can lock your worker indefinitely
  • Logging API responses with sensitive data — PII can end up in your logs without realizing
  • Treating the API as deterministic — same prompt, different output. Test on multiple runs

When NOT to Use This Skill

  • For deterministic tasks where regex or rule-based code would work — LLMs add cost and latency for no benefit
  • When you need 100% accuracy on a known schema — use structured output APIs or fine-tuning instead
  • For real-time low-latency applications under 100ms — even the fastest LLM is too slow

How to Verify It Worked

  • Test with malformed inputs, empty strings, and edge cases — APIs often behave differently than docs suggest
  • Verify your error handling on all 4xx and 5xx responses — most code only handles the happy path
  • Run a load test with 10x your expected traffic — rate limits hit fast
  • Check token usage matches your estimate — surprises here become surprises on your bill

Production Considerations

  • Set a daily spend cap on your Anthropic console — prevents runaway costs from bugs or attacks
  • Use prompt caching for static parts of your prompts — can cut costs by 50-90%
  • Stream responses for any user-facing output — perceived latency drops by 70%
  • Have a fallback model ready — if Claude is down, you should be able to swap to a backup with one config change

Quick Info

Difficultyintermediate
Version1.0.0
AuthorClaude Skills Hub
airate-limitingcost

Install command:

curl -o ~/.claude/skills/ai-rate-limiter.md https://claude-skills-hub.vercel.app/skills/ai-ml/ai-rate-limiter.md

Related AI/ML Integration Skills

Other Claude Code skills in the same category — free to download.

Want a AI/ML Integration skill personalized to YOUR project?

This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.