Free 40-page Claude guide — setup, 120 prompt codes, MCP servers, AI agents. Download free →
CLSkills
AI/ML Integrationintermediate

AI Streaming

Share

Implement streaming AI responses

Works with OpenClaude

You are an AI/ML engineer implementing real-time streaming responses from language models. The user wants to handle server-sent events (SSE) or chunked HTTP responses from AI APIs to display token-by-token output without blocking the UI.

What to check first

  • Verify your AI provider supports streaming (check their API documentation for stream=true parameter or streaming endpoint)
  • Run npm list to confirm you have fetch available (Node 18+) or an HTTP client like axios installed
  • Check your frontend framework's async handling (React hooks, Vue composition API, or vanilla JS event listeners)

Steps

  1. Identify your AI provider's streaming endpoint and required headers (e.g., OpenAI uses /chat/completions with stream: true)
  2. Set up a fetch request with stream: true in the options and proper authorization headers
  3. Create a ReadableStream response handler that reads chunks using .getReader()
  4. Decode incoming chunks using TextDecoder since streaming data arrives as Uint8Array buffers
  5. Parse SSE-formatted data (lines starting with data:) and extract JSON payloads
  6. Extract individual token deltas from each chunk (e.g., delta.content in OpenAI API)
  7. Update UI state in real-time as tokens arrive, concatenating them into a growing response string
  8. Handle stream termination (look for [DONE] marker or finish_reason field) and clean up resources

Code

async function streamAIResponse(prompt, onChunk, onComplete, onError) {
  const apiKey = process.env.OPENAI_API_KEY;
  const decoder = new TextDecoder();
  let fullResponse = '';

  try {
    const response = await fetch('https://api.openai.com/v1/chat/completions', {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'Authorization': `Bearer ${apiKey}`,
      },
      body: JSON.stringify({
        model: 'gpt-4',
        messages: [{ role: 'user', content: prompt }],
        stream: true,
        temperature: 0.7,
      }),
    });

    if (!response.ok) {
      throw new Error(`API error: ${response.status} ${response.statusText}`);
    }

    const reader = response.body.getReader();
    let buffer = '';

    while (true) {
      const { done, value } = await reader.read();
      if (done) break;

      buffer += decoder.decode(value, { stream: true });
      const lines = buffer.split('\n');
      buffer = lines.pop() || '';

      for (const line of lines) {
        if (line.startsWith('data: ')) {
          const data = line.slice(6);
          
          if (data === '[DONE]') {
            onComplete(fullResponse

Note: this example was truncated in the source. See the GitHub repo for the latest full version.

Common Pitfalls

  • Forgetting to handle rate limits — Anthropic returns 429 errors that need exponential backoff
  • Hardcoding the model name in 50 places — use a single config so you can swap models in one place
  • Not setting a timeout on API calls — a hanging request can lock your worker indefinitely
  • Logging API responses with sensitive data — PII can end up in your logs without realizing
  • Treating the API as deterministic — same prompt, different output. Test on multiple runs

When NOT to Use This Skill

  • For deterministic tasks where regex or rule-based code would work — LLMs add cost and latency for no benefit
  • When you need 100% accuracy on a known schema — use structured output APIs or fine-tuning instead
  • For real-time low-latency applications under 100ms — even the fastest LLM is too slow

How to Verify It Worked

  • Test with malformed inputs, empty strings, and edge cases — APIs often behave differently than docs suggest
  • Verify your error handling on all 4xx and 5xx responses — most code only handles the happy path
  • Run a load test with 10x your expected traffic — rate limits hit fast
  • Check token usage matches your estimate — surprises here become surprises on your bill

Production Considerations

  • Set a daily spend cap on your Anthropic console — prevents runaway costs from bugs or attacks
  • Use prompt caching for static parts of your prompts — can cut costs by 50-90%
  • Stream responses for any user-facing output — perceived latency drops by 70%
  • Have a fallback model ready — if Claude is down, you should be able to swap to a backup with one config change

Quick Info

Difficultyintermediate
Version1.0.0
AuthorClaude Skills Hub
aistreamingreal-time

Install command:

curl -o ~/.claude/skills/ai-streaming.md https://claude-skills-hub.vercel.app/skills/ai-ml/ai-streaming.md

Related AI/ML Integration Skills

Other Claude Code skills in the same category — free to download.

Want a AI/ML Integration skill personalized to YOUR project?

This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.