Async Endpoints and Streaming AI Responses

Synchronous AI endpoints block while waiting for Claude to respond. Async endpoints handle many requests concurrently. Streaming lets users see output as it generates instead of waiting for the full response.

~1 hour Hands-on Precision AI Academy

Today's Objective

Two versions of the same AI endpoint: a standard async endpoint and a streaming endpoint that sends tokens to the client as they arrive — exactly how ChatGPT works.

code

from fastapi import FastAPI
import anthropic

app = FastAPI()

# Sync: blocks a thread while waiting for Claude
@app.post("/sync-summarize")
def sync_summarize(text: str):
    client = anthropic.Anthropic()
    # ... call Claude synchronously

# Async: frees the thread while waiting for Claude
@app.post("/async-summarize")
async def async_summarize(text: str):
    client = anthropic.AsyncAnthropic()
    # ... await Claude asynchronously

Build an Async AI Endpoint

The Anthropic SDK has an AsyncAnthropic client for async code. The pattern is nearly identical to the sync version — just add async/await:

main.py

python

from fastapi import FastAPI
from pydantic import BaseModel
import anthropic

app = FastAPI()
client = anthropic.AsyncAnthropic()

class SummarizeReq(BaseModel):
    text: str

@app.post("/summarize")
async def summarize(req: SummarizeReq):
    msg = await client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=512,
        messages=[{"role": "user", "content": req.text}]
    )
    return {"summary": msg.content[0].text}

Streaming Responses

Streaming sends tokens to the client as they're generated instead of waiting for the full response. This is how ChatGPT feels fast — it starts showing you text immediately while Claude is still generating.

streaming.py

python

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import anthropic

app = FastAPI()
client = anthropic.Anthropic()

def generate_stream(text: str):
    with client.messages.stream(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{"role": "user", "content": text}]
    ) as stream:
        for text_chunk in stream.text_stream:
            yield text_chunk

@app.post("/stream")
def stream_response(text: str):
    return StreamingResponse(
        generate_stream(text),
        media_type="text/plain"
    )

Test the stream endpoint with curl: curl -X POST "http://localhost:8000/stream" -d "text=Explain quantum computing" — you'll see tokens arrive one by one.

Want live instruction and hands-on projects? Join the AI bootcamp — 3 days, 5 cities.

Day 3 Checkpoint

Before moving on, confirm understanding of these key concepts:

What is the core concept introduced in this lesson?
How does the main technique or tool work in practice?
What common mistakes should be avoided?
How would this apply in a real-world project?
What is the next logical step to build on this knowledge?