SSE vs WebSockets: Choosing the Right Transport
Most LLM streaming use cases only require one-way communication from server to client: the server streams tokens as they are generated. Two transports handle this, with meaningfully different tradeoffs:
- Server-Sent Events (SSE): HTTP-based, one-directional (server to client), built-in reconnection, works through proxies and load balancers with no special configuration, supported natively in every browser. The right choice for 80% of LLM streaming use cases.
- WebSockets: Full-duplex, persistent bidirectional connection, lower overhead for high-frequency messages, requires special proxy configuration (sticky sessions or WebSocket-aware load balancer). The right choice when the client also needs to stream data to the server — voice audio, real-time collaborative editing, or tool approval workflows.
SSE Implementation: Server Side
An SSE endpoint is a long-lived HTTP response with Content-Type: text/event-stream. Each token from the LLM is written to the response as it arrives:
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio
app = FastAPI()
async def token_stream(prompt: str):
async for chunk in llm_client.astream(prompt):
token = chunk.choices[0].delta.content or ""
if token:
# SSE format: "data: <payload>\n\n"
yield f"data: {json.dumps({'token': token})}\n\n"
await asyncio.sleep(0) # yield control to event loop
yield "data: [DONE]\n\n"
@app.post("/stream")
async def stream_response(request: PromptRequest):
return StreamingResponse(
token_stream(request.prompt),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"X-Accel-Buffering": "no", # disable Nginx buffering
},
)SSE Implementation: Client Side
async function streamCompletion(prompt: string, onToken: (t: string) => void) {
const response = await fetch("/stream", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ prompt }),
});
const reader = response.body!.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value, { stream: true });
for (const line of chunk.split("\n")) {
if (!line.startsWith("data: ")) continue;
const data = line.slice(6);
if (data === "[DONE]") return;
const { token } = JSON.parse(data);
onToken(token); // update UI
}
}
}The Production Edge Cases
The happy path implementation above works in development. These are the edge cases that appear under production load:
- Proxy and load balancer buffering: Nginx, AWS ALB, and Cloudflare all buffer responses by default. A buffered SSE response delivers all tokens at once at the end — defeating the purpose. Set X-Accel-Buffering: no for Nginx, and configure your CDN to pass through streaming responses.
- Client disconnection mid-stream: The user closes the tab while the model is generating. Without handling this, the server continues generating and billing for tokens nobody will read. Detect client disconnection and cancel the upstream LLM request.
- Partial JSON in chunks: The streaming transport may split an event across multiple read() calls. Always buffer and split on newlines, not on read boundaries.
- Error mid-stream: The LLM API returns an error after streaming has started. You cannot change the HTTP status code once the response body has started. Communicate errors through the SSE event stream itself using a typed error event.
- Token rate limiting: Displaying tokens character-by-character at 100+ tokens/second can cause excessive DOM re-renders. Batch tokens into 50–100ms display intervals for smooth rendering without UI jank.
# Detect client disconnection and cancel LLM request
from starlette.requests import Request
@app.post("/stream")
async def stream_response(request: Request, body: PromptRequest):
async def generate():
try:
async for chunk in llm_client.astream(body.prompt):
# Check if client disconnected
if await request.is_disconnected():
break
token = chunk.choices[0].delta.content or ""
if token:
yield f"data: {json.dumps({'token': token})}\n\n"
except Exception as e:
# Surface errors through the stream
yield f"data: {json.dumps({'error': str(e)})}\n\n"
finally:
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")Handling Streaming in React
The simplest React pattern accumulates tokens into state. The key detail is that token updates must be appended, not replaced, and the component must handle the stream lifecycle correctly:
function ChatMessage({ prompt }: { prompt: string }) {
const [content, setContent] = useState("");
const [streaming, setStreaming] = useState(true);
useEffect(() => {
let cancelled = false;
streamCompletion(prompt, (token) => {
if (!cancelled) {
setContent((prev) => prev + token);
}
}).finally(() => {
if (!cancelled) setStreaming(false);
});
// Cancel stream if component unmounts mid-generation
return () => { cancelled = true; };
}, [prompt]);
return (
<div>
{content}
{streaming && <span className="animate-pulse">▋</span>}
</div>
);
}Streaming with Next.js App Router
Next.js App Router supports streaming natively via React Suspense. For LLM streaming specifically, a Route Handler with a ReadableStream response is the cleanest pattern:
// app/api/chat/route.ts
import { OpenAI } from "openai";
export async function POST(req: Request) {
const { prompt } = await req.json();
const client = new OpenAI();
const stream = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: prompt }],
stream: true,
});
const encoder = new TextEncoder();
const readable = new ReadableStream({
async start(controller) {
for await (const chunk of stream) {
const token = chunk.choices[0]?.delta?.content ?? "";
if (token) {
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({ token })}\n\n`)
);
}
}
controller.enqueue(encoder.encode("data: [DONE]\n\n"));
controller.close();
},
});
return new Response(readable, {
headers: {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
},
});
}