· nextjs / llm / streaming

How to stream LLM responses in Next.js with the Vercel AI SDK

Stream LLM responses token-by-token in Next.js using AI SDK v6. Covers route handler, client hook, and the two Vercel timeout traps most tutorials skip.

By

1,307 words · 7 min read

Waiting four seconds for a blank textarea to suddenly fill with text is one of the worst UX patterns in AI apps. Streaming fixes it. You can ship a working Next.js chat with token-by-token output in about 50 lines using Vercel AI SDK v6 — and the tricky parts are not the code, they are the two Vercel function timeout limits that catch most tutorials off guard.

Who this is for

Next.js developers adding an LLM chat interface to an existing app, or starting a new one from scratch. You need basic TypeScript familiarity. You do not need prior experience with streams, ReadableStream, or EventSource.

Why use the SDK instead of rolling your own

The manual path — ReadableStream on the server, EventSource on the client — is not difficult, but it is tedious. You end up writing a framing protocol (line-delimited JSON or SSE), a parser on the client, reconnect logic, error signaling over the stream, and status management in React state. The AI SDK handles all of that. It also gives you provider portability: swapping OpenAI for Groq or Anthropic is one import change.

The v5/v6 API (released July 2025, current as of June 2026) broke backward compatibility with v4 in three specific ways that matter here:

v4v5/v6
useChat from aiuseChat from @ai-sdk/react
handleSubmit + inputsendMessage({ text })
message.content (string)message.parts[].text (array)

Every tutorial indexed before mid-2025 uses v4 shapes. If you follow one of those, the TypeScript compiler will not catch the mismatch at message.content — it silently renders nothing.

Streaming setup

npm install ai @ai-sdk/openai @ai-sdk/react

Add your key to .env.local:

OPENAI_API_KEY=sk-...

That is the entire dependency surface. No separate SSE library, no extra webpack config.

Step 1: Route handler

Create app/api/chat/route.ts:

import { openai } from '@ai-sdk/openai';
import { convertToModelMessages, streamText, UIMessage } from 'ai';

export const maxDuration = 30;

export async function POST(req: Request) {
  const { messages }: { messages: UIMessage[] } = await req.json();

  const result = streamText({
    model: openai('gpt-4o-mini'),
    system: 'You are a helpful assistant.',
    messages: convertToModelMessages(messages),
  });

  return result.toUIMessageStreamResponse();
}

Three things worth understanding here.

UIMessage[] is the SDK’s wire format for messages — it carries parts, metadata, and role in a shape the client hook already knows how to produce and consume. You pass it in, you pass it to convertToModelMessages, done.

convertToModelMessages translates UIMessage[] into the model-native format (OpenAI chat messages, Anthropic messages, etc.). The conversion is provider-aware, so switching providers does not require you to touch this call.

toUIMessageStreamResponse() wraps the stream in the SDK’s custom SSE protocol and sets the correct Content-Type and Cache-Control headers. The client hook on the other side knows how to parse this protocol — you do not write a parser.

The maxDuration = 30 export is a Next.js route segment config that caps this function’s maximum execution time. Vercel’s fluid-compute default is 300 seconds — more on plan-specific ceilings in the “What can go wrong” section.

Step 2: Client component

Create or replace app/page.tsx:

'use client';
import { useChat } from '@ai-sdk/react';
import { DefaultChatTransport } from 'ai';
import { useState } from 'react';

export default function ChatPage() {
  const { messages, sendMessage, status } = useChat({
    transport: new DefaultChatTransport({ api: '/api/chat' }),
  });
  const [input, setInput] = useState('');

  return (
    <div>
      {messages.map(msg => (
        <div key={msg.id}>
          <strong>{msg.role === 'user' ? 'You' : 'AI'}:</strong>{' '}
          {msg.parts.map((p, i) =>
            p.type === 'text' ? <span key={i}>{p.text}</span> : null
          )}
        </div>
      ))}
      <form
        onSubmit={e => {
          e.preventDefault();
          sendMessage({ text: input });
          setInput('');
        }}
      >
        <input
          value={input}
          onChange={e => setInput(e.target.value)}
          disabled={status !== 'ready'}
        />
        <button type="submit" disabled={status !== 'ready'}>
          Send
        </button>
      </form>
    </div>
  );
}

The three v5/v6 differences from above each show up here.

Import path: useChat comes from @ai-sdk/react, not ai. If you use the wrong import, the hook exists (there’s a re-export stub) but sendMessage won’t be defined and you’ll get a runtime error.

sendMessage: replaces the v4 handleSubmit + input pattern. You call it with { text: input } and it handles the HTTP request, streaming parse, and state update. There is no separate controlled input to wire through the hook.

message.parts: each message carries an array of parts with a type discriminator. Text content is p.type === 'text' with the text in p.text. If you render message.content, you get undefined for AI messages — no TypeScript error, just blank output that looks like a streaming glitch.

The status field cycles through 'ready', 'submitted', 'streaming', and back to 'ready'. Disabling the input during non-'ready' states prevents double-sends.

TTFT: Groq vs OpenAI

First-token latency is the number that determines whether streaming feels snappy or sluggish. The time from send to first visible character is called TTFT (time to first token).

Groq runs on custom LPU hardware optimized for inference throughput. In practice this produces noticeably lower TTFT than general-purpose GPU infrastructure — GPT-4o runs on the latter. The tradeoff is model capability: GPT-4o outperforms Llama 3.1 8B on reasoning and instruction-following. For a chat interface where perceived responsiveness matters more than reasoning depth, Groq is worth benchmarking against your own workload. TTFT varies by prompt length, model load, and region, so measure in your target environment rather than relying on published averages.

To switch, replace the provider import and model name — the route handler body stays identical:

import { groq } from '@ai-sdk/groq';
// ...
model: groq('llama-3.1-8b-instant'),

What can go wrong

Vercel Hobby timeout (300 seconds)

As of Vercel’s 2025 fluid-compute rollout, Hobby plan serverless functions cap at 300 seconds by default and maximum — up from the legacy 10-second limit. That covers almost every LLM response. The maxDuration export in your route is respected up to that ceiling.

One caveat: fluid compute is enabled per-project in the Vercel dashboard, not automatically applied to all existing projects. Check Settings → Functions → Fluid Compute. Without it, the legacy limit applies — and the old fix (Edge runtime, which has its own 300-second limit) remains valid for those projects. For a pure API-forwarding route like this chat handler, Edge runtime works fine; it only breaks if you use Node.js built-ins (fs, crypto, http, etc.).

Vercel Pro timeout (800 seconds)

Pro accounts can set maxDuration up to 800 seconds with fluid compute enabled (as of the 2025 rollout). That covers any realistic LLM workload. If you hit a timeout on Pro, the bottleneck is almost certainly the model or upstream API, not Vercel.

Cloudflare Workers bundle size

If you’re deploying to Cloudflare Workers (an alternative to Vercel), watch your bundle size. The free tier caps at 3 MB compressed; the ai package adds roughly 300 KB min+gzip after tree-shaking, leaving room for app code but worth watching if you stack multiple heavy dependencies. If you hit the limit:

  • Upgrade to Cloudflare Workers Paid for the 10 MB limit, or
  • Use the AI SDK server-only import (import { streamText } from 'ai/server') to exclude the React client code from the Workers bundle.

The bundle error from wrangler appears at deploy time, not at runtime.

Conclusion

The full implementation is two files and about 50 lines. The AI SDK v6 handles the streaming protocol, the SSE framing, and the React state management — your job is connecting streamText on the server to useChat on the client and avoiding the three v4 API shapes that no longer exist. Vercel Hobby’s 300-second fluid-compute limit is sufficient for most LLM chat. If you need up to 800 seconds, or fluid compute is not yet enabled on your project, Pro removes the ceiling.

References