In This Article
- The Claude Model Family: Opus 4, Sonnet 4, Haiku 4.5
- Getting Started: API Key and Python SDK Setup
- The Messages API: System Prompts and Turns
- Tool Use and Function Calling: Building AI Agents
- Vision: Analyzing Images with Claude
- Long Context: The 1 Million Token Window
- Streaming Responses
- Message Batches for Cost Reduction
- Rate Limits and Pricing Tiers
- Claude for Enterprise: AWS Bedrock and Google Vertex AI
- Building a Real App: Customer Support Bot Walkthrough
- Frequently Asked Questions
Key Takeaways
- Which Claude model should I use for my application in 2026? The right Claude model depends on your use case and budget. Claude Opus 4 delivers the highest reasoning quality for complex tasks like legal analy...
- How much does the Claude API cost in 2026? Anthropic prices Claude on a per-token basis with separate rates for input and output tokens.
- Can I use Claude on AWS or Google Cloud instead of the Anthropic API directly? Yes. Claude models are available through both Amazon Bedrock and Google Cloud Vertex AI, giving enterprise teams the option to run Claude inference...
- What is tool use in the Claude API and how does it work? Tool use (also called function calling) is the mechanism that lets Claude interact with external systems — databases, APIs, calculators, code inter...
Anthropic's Claude has matured from a research project into a production-ready API powering thousands of applications — from customer support automation to medical research summarization to full agentic software development pipelines. In 2026, the Claude API offers capabilities that would have seemed implausible two years ago: a one-million-token context window, native vision, structured tool use, and sub-second response times on the Haiku tier.
This guide is written for developers who want to build with Claude seriously — not just run a "Hello, World" in a notebook. We will cover the model family, the core API patterns, and the more advanced features (tool use, vision, batching, streaming) that separate production applications from toy demos. We will also walk through building a real customer support bot from scratch.
Whether you are building your first LLM-powered feature or migrating from another provider, everything you need is here.
The Claude Model Family: Opus 4, Sonnet 4, Haiku 4.5
Claude's 2026 model family has three tiers: Opus 4 for maximum reasoning on complex multi-step tasks, Sonnet 4 as the production default balancing intelligence and cost, and Haiku 4.5 for high-volume low-latency workloads. Most teams start with Sonnet 4 and only upgrade specific tasks to Opus where quality gaps are measurable.
As of 2026, the Claude model family has three tiers. Each is built for a different point on the cost-speed-intelligence tradeoff curve. Choosing the wrong model is one of the most common mistakes developers make when building with Claude — it either blows up their cost structure or produces responses that are too slow or not smart enough for the task.
Maximum Intelligence
Best for complex reasoning, research synthesis, and multi-step agentic workflows where quality is the only metric that matters.
Best All-Around
The right choice for most production apps. Strong intelligence, fast inference, and reasonable cost — the default for customer-facing products.
Speed and Scale
Designed for high-volume workloads where latency and cost dominate. Classification, lightweight summarization, preprocessing pipelines.
Claude Opus 4: When to Use It
Opus 4 is Anthropic's most capable model. It delivers the strongest performance on complex multi-step reasoning, nuanced writing, legal and financial document analysis, and tasks that require synthesizing contradictory information into a coherent answer. It is also the model best suited for "agentic" use cases — tasks where Claude needs to plan a multi-step workflow, use tools in sequence, and recover gracefully from errors along the way.
The tradeoff is cost. Opus 4 is significantly more expensive per token than Sonnet 4 or Haiku 4.5. For most interactive applications, the latency is also higher. Use Opus when the quality of the output has direct business value that justifies the premium — high-stakes analysis, document review, or any task where a subpar response would require human rework.
Claude Sonnet 4: The Production Default
Sonnet 4 is the model most teams should reach for first. It sits in the middle of the cost-intelligence curve and does so extremely well — the gap between Sonnet 4 and Opus 4 on everyday tasks is often imperceptible to end users, while the cost difference is substantial. Sonnet 4 is fast enough for real-time chat, smart enough for most coding and analysis tasks, and cheap enough to run at scale without budgetary anxiety.
For new projects, the recommended approach is to prototype with Sonnet 4, run your quality evaluation suite, and only upgrade to Opus 4 for the specific tasks where Sonnet 4 demonstrably falls short.
Claude Haiku 4.5: Speed at Scale
Haiku 4.5 is the fastest and most affordable model in the family. It is designed for workloads where you are making tens of thousands of API calls per day and latency below 500ms is a product requirement. Common use cases include classification and intent detection, extracting structured data from large document corpora, generating short responses in embedded UI components, and preprocessing inputs before routing to a more powerful model.
"The model selection decision is really a product decision in disguise. What does your user notice? What does your P&L notice? Those two questions almost always point you to Sonnet."
Getting Started: API Key and Python SDK Setup
To start using the Claude API, create an account at console.anthropic.com, generate an API key, install the official Python SDK with pip install anthropic, and set your key as the ANTHROPIC_API_KEY environment variable — never hardcode it. Your first working API call is four lines of Python.
Before you write a single line of code, you need an Anthropic API key. Create an account at console.anthropic.com, navigate to API Keys, and generate a key. Store it as an environment variable — never hard-code it in your source files.
# Install the Anthropic Python SDK
pip install anthropic
# Set your API key as an environment variable
export ANTHROPIC_API_KEY="sk-ant-..."
The official Anthropic Python SDK handles authentication, retries, and error handling out of the box. Once installed, your first API call is four lines:
import anthropic
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[
{"role": "user", "content": "Explain gradient descent in two sentences."}
]
)
print(message.content[0].text)
SDK vs. Direct HTTP
You can call the Claude API directly over HTTP if you prefer, but the official SDK is the recommended approach for Python projects. It handles automatic retries on rate-limit errors, proper streaming buffer management, structured error types, and keeps up with API version changes. The Node.js SDK (@anthropic-ai/sdk) provides the same patterns for TypeScript and JavaScript projects.
The Messages API: System Prompts and Turns
Every Claude API interaction goes through the Messages endpoint. A request requires three parameters: model, max_tokens, and a messages array of role/content objects. The optional system parameter defines Claude's persona and constraints — this is where you make Claude behave like your application rather than a generic assistant.
The Messages API is the core of Claude. Every interaction — whether a single question or a multi-turn conversation — goes through this endpoint. Understanding its structure is the foundation for everything else.
A Messages API request has three required parameters: model (which Claude variant to use), max_tokens (a hard ceiling on output length), and messages (the conversation history as an array of role/content objects). The optional system parameter sets Claude's persona, instructions, and constraints — this is where you define what your application is and how Claude should behave.
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
system="""You are a senior Python engineer at a fintech startup.
You write clean, production-ready code with type hints and docstrings.
You never use deprecated libraries. You always explain your reasoning.""",
messages=[
{
"role": "user",
"content": "How should I structure async database calls in FastAPI?"
},
{
"role": "assistant",
"content": "Use SQLAlchemy's async session with asyncpg..."
},
{
"role": "user",
"content": "Show me the dependency injection pattern."
}
]
)
print(response.content[0].text)
The conversation history is stateless — you pass the full context on every request. This means your application is responsible for storing and managing the message history. For production applications, store conversation turns in a database and reconstruct the messages array on each request. A sliding window strategy (keeping only the last N turns) helps manage costs for long conversations.
System Prompt Best Practices
- Be specific about role and constraints. "You are a customer support agent for Acme Corp" outperforms "You are a helpful assistant" every time.
- Define what Claude should NOT do. Negative constraints are just as important as positive instructions.
- Include output format instructions. If you need JSON, say so explicitly and show an example schema.
- Keep the system prompt stable. It is cached across requests with the same prompt, which reduces latency and cost.
Tool Use and Function Calling: Building AI Agents
Tool use lets Claude call external functions — databases, REST APIs, calculators, search engines — and incorporate results into its responses. You pass tool definitions alongside your message; when Claude decides a tool helps, it returns a tool_use block with structured arguments. Your code runs the function, returns a tool_result, and Claude uses it in its final answer.
Tool use is the feature that elevates Claude from a chat interface to a genuine AI agent. It lets Claude interact with external systems — databases, REST APIs, calculators, code interpreters, search engines, or any custom function you define — and incorporate the results into its responses.
The mechanism is straightforward. You pass a list of tool definitions alongside your message. Each definition includes a name, a plain-English description Claude uses to decide when to invoke the tool, and a JSON Schema describing the tool's input parameters. When Claude determines a tool would help, it returns a tool_use content block instead of plain text. Your code executes the function, returns the result in a tool_result message, and Claude uses that result in its final answer.
import anthropic
import json
tools = [
{
"name": "get_order_status",
"description": "Retrieves the current status of a customer order by order ID.",
"input_schema": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "The unique order identifier, e.g. ORD-12345"
}
},
"required": ["order_id"]
}
}
]
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
tools=tools,
messages=[{
"role": "user",
"content": "What's the status of my order ORD-98712?"
}]
)
# Check if Claude wants to call a tool
if response.stop_reason == "tool_use":
tool_block = response.content[0]
order_id = tool_block.input["order_id"]
# Execute the actual function
result = lookup_order_in_database(order_id)
# Return the result to Claude for final response
final_response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
tools=tools,
messages=[
{"role": "user", "content": "What's the status of my order ORD-98712?"},
{"role": "assistant", "content": response.content},
{
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": tool_block.id,
"content": json.dumps(result)
}]
}
]
)
For more complex agents, Claude can call multiple tools in sequence — looking up data, running calculations, then querying a second system before composing a final answer. This multi-step tool use loop is the foundation of any serious Claude-based agent, from code execution environments to autonomous research assistants.
Vision: Analyzing Images with Claude
Claude's vision capability accepts images directly in the messages array — passed as a URL or base64-encoded string — alongside text. Claude can describe images, extract data from charts, read printed and handwritten text, and reason about visual content across JPEG, PNG, GIF, and WebP formats on Opus 4, Sonnet 4, and Haiku 4.5.
Claude's vision capability lets you pass images directly in the messages array, alongside text. Claude can describe images, extract data from charts and tables, read printed and handwritten text, compare multiple images, and reason about visual content in the same way it reasons about text.
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "url",
"url": "https://example.com/chart.png"
}
},
{
"type": "text",
"text": "Extract all data points from this bar chart as a JSON array."
}
]
}]
)
You can also pass images as base64-encoded strings, which is necessary for images that are not publicly accessible via URL. The supported formats are JPEG, PNG, GIF, and WebP. Claude Sonnet 4 and Opus 4 both support vision natively — Haiku 4.5 also supports images for lightweight classification and OCR workloads.
High-Value Vision Use Cases
- Document digitization: Extract structured data from scanned invoices, receipts, and forms
- Chart understanding: Parse data from screenshots of dashboards and reports
- UI analysis: Describe interface states for accessibility tools or automated testing
- Quality inspection: Flag visual defects in manufacturing images
- Medical imaging: Assist radiologists with preliminary scan descriptions (with appropriate disclaimers)
Long Context: The 1 Million Token Window
Claude Opus 4 and Sonnet 4 support a one million token context window — roughly 750,000 words or 10 full-length novels — enabling whole-document Q&A, cross-file code review, and multi-document synthesis without chunking or vector search. For retrieval tasks over the largest corpora, combining Claude's long context with a retrieval layer remains the most reliable production architecture.
Claude's context window in 2026 extends to one million tokens — the equivalent of roughly 750,000 words, or about ten full-length novels. This is not a marketing number. It is a genuinely transformative capability for a class of problems that was simply unsolvable with earlier context limits.
What does a million-token context window actually unlock? Consider these real production use cases: loading an entire legal contract corpus for compliance review; passing a full codebase for architectural analysis; ingesting a year of financial transcripts for competitive intelligence; or maintaining a complete, untruncated conversation history for a complex customer support case.
Long Context Patterns That Work
- Whole-document Q&A: Load a 300-page report and ask specific questions — no chunking or vector search required
- Cross-file code review: Pass an entire Python package and ask Claude to find architectural issues or security vulnerabilities
- Multi-document synthesis: Feed Claude 20 research papers and ask it to identify areas of consensus and disagreement
- Full conversation memory: Maintain complete interaction history for high-stakes support cases without losing context
One important caveat: while Claude can technically process one million tokens, performance on tasks requiring recall of information buried in the middle of very long contexts can degrade compared to information at the beginning or end. For mission-critical retrieval tasks over enormous corpora, a hybrid approach — combining Claude's long context with a retrieval layer — is still often the most reliable architecture.
Streaming Responses
Streaming sends partial response chunks to your client as Claude generates them — users see output begin appearing within milliseconds instead of waiting for the full response. Use the SDK's stream() context manager for automatic buffer management and cleanup; for web apps, forward the SSE stream directly to the browser rather than buffering server-side.
By default, the Messages API returns a complete response after Claude finishes generating. For interactive applications — chat interfaces, code editors, real-time assistants — waiting for the full response before showing anything to the user creates a poor experience. Streaming sends partial response chunks to your client as they are generated, so users see output begin appearing within milliseconds.
import anthropic
client = anthropic.Anthropic()
# Use the stream() context manager for automatic cleanup
with client.messages.stream(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": "Write a Python function to parse ISO 8601 dates."
}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
# Access the final complete message after streaming
final_message = stream.get_final_message()
The SDK's stream() context manager is the recommended approach — it handles event parsing, error recovery, and cleanup automatically. For web applications serving streaming to browser clients, you will typically want to forward the SSE stream directly to your frontend via a server-side endpoint, rather than buffering the full response server-side and then sending it.
Message Batches for Cost Reduction
The Message Batches API delivers a 50% cost reduction compared to real-time API calls by processing requests asynchronously, typically within 24 hours. For document classification, bulk summarization, and overnight enrichment jobs — any workload that does not need an immediate response — batching paired with Haiku 4.5 is the most cost-efficient pattern available in 2026.
The Message Batches API processes large volumes of requests asynchronously and delivers a 50% cost reduction compared to real-time API calls. If you have workloads that do not need an immediate response — document classification, bulk summarization, overnight data enrichment — batching is the most impactful cost optimization available.
batch = client.messages.batches.create(
requests=[
{
"custom_id": f"review-{i}",
"params": {
"model": "claude-haiku-4-5",
"max_tokens": 64,
"messages": [{
"role": "user",
"content": f"Classify sentiment (positive/negative/neutral): {review}"
}]
}
}
for i, review in enumerate(customer_reviews)
]
)
print(f"Batch ID: {batch.id}")
print(f"Status: {batch.processing_status}")
Batches are typically processed within 24 hours. You poll the batch endpoint to check status, then retrieve results when processing completes. Each result is keyed by the custom_id you provided, making it straightforward to match results back to your original records. For classification or enrichment jobs running on thousands of documents, batching combined with Haiku 4.5 is the most cost-efficient pattern available.
Rate Limits and Pricing Tiers
Anthropic rate limits scale automatically with your usage tier — new accounts start with lower requests-per-minute and tokens-per-minute caps that increase as your spend grows. For most teams building a prototype, default limits are not a constraint. High-volume production workloads should contact Anthropic directly. Haiku 4.5 is the lowest-cost tier; Opus 4 is premium; Sonnet 4 sits in the middle.
Anthropic's rate limits scale with your usage tier. New accounts start at lower limits on requests per minute (RPM) and tokens per minute (TPM). As you increase usage and spend, your limits scale automatically. For most teams building an initial prototype, the default limits are not a constraint. For high-volume production workloads, contact Anthropic directly to discuss your requirements.
| Model | Context Window | Best For | Cost Tier |
|---|---|---|---|
| Opus 4 | 1M tokens | Complex reasoning, agents, research | Premium |
| Sonnet 4 | 1M tokens | Production apps, chat, coding | Mid-tier |
| Haiku 4.5 | 200K tokens | Classification, summarization, preprocessing | Lowest |
Cost Optimization Strategies
- Use the right model for the task. Routing classification tasks to Haiku while sending complex reasoning to Sonnet can cut costs by 70%+ without sacrificing quality.
- Enable prompt caching. System prompts and repeated context are cached, reducing input token costs on high-frequency requests with stable context.
- Use batching for async workloads. The 50% cost reduction from the Batches API is significant at scale.
- Set max_tokens carefully. You are billed for output tokens generated, not the max_tokens ceiling — but setting it too low will truncate responses on complex tasks.
Claude for Enterprise: AWS Bedrock and Google Vertex AI
Enterprise teams that need data residency, IAM-based access controls, or consolidated cloud billing can run Claude through AWS Bedrock or Google Cloud Vertex AI. Both use the same Messages API parameter structure as the direct Anthropic API — switching providers requires minimal code changes. Bedrock uses the AnthropicBedrock client with standard AWS IAM credentials; Vertex uses AnthropicVertex pointed at your GCP project.
Enterprise teams often prefer — or are required — to run AI inference within their existing cloud infrastructure rather than sending data to a third-party API. Both AWS Bedrock and Google Cloud Vertex AI offer Claude models as managed services, letting you use Claude under your existing cloud billing agreements with the data residency and access controls your security team requires.
Claude on AWS Bedrock
AWS Bedrock gives you access to Claude through the standard boto3 client. No Anthropic API key is required — authentication uses your standard IAM credentials. This is the right path if your infrastructure lives on AWS and your team already manages permissions through IAM roles.
import anthropic
# Uses IAM credentials automatically from your AWS environment
client = anthropic.AnthropicBedrock(
aws_region="us-east-1"
)
message = client.messages.create(
model="anthropic.claude-sonnet-4-5-20251101-v1:0",
max_tokens=1024,
messages=[{
"role": "user",
"content": "Summarize the key risks in this contract."
}]
)
Claude on Google Vertex AI
Google Vertex AI access uses the AnthropicVertex client, pointed at your GCP project and region. The same Messages API structure applies — switching between Anthropic direct, Bedrock, and Vertex requires minimal code changes, typically just swapping the client constructor and model identifier.
from anthropic import AnthropicVertex
client = AnthropicVertex(
region="us-east5",
project_id="your-gcp-project-id"
)
message = client.messages.create(
model="claude-sonnet-4-5@20251101",
max_tokens=1024,
messages=[{
"role": "user",
"content": "Analyze this financial statement."
}]
)
Building a Real App: Customer Support Bot Walkthrough
This walkthrough builds a production customer support bot combining three Claude API features: a system prompt defining the bot's persona, a tool for order lookup via function calling, and a streaming response loop managing multi-turn conversation state. This is the pattern most teams deploy in their first week of a Claude integration.
Let us put all of these pieces together and build something real: a customer support bot that looks up order status, handles multi-turn conversations, and streams its responses to the frontend. This is the kind of application that teams build in week one of a Claude integration.
The architecture has three components: a system prompt that defines the bot's persona and behavior, a tool definition for order lookup, and a streaming response loop that handles multi-turn state. Here is the core implementation:
import anthropic
import json
from typing import Generator
SYSTEM_PROMPT = """You are Maya, a customer support specialist for ShopDirect.
You are warm, concise, and solution-oriented. You never make up
information — if you don't know something, you say so and offer
to escalate to a human agent. You always address customers by
name when you know it."""
ORDER_TOOL = {
"name": "get_order_status",
"description": "Look up a customer's order status and tracking info.",
"input_schema": {
"type": "object",
"properties": {
"order_id": {"type": "string", "description": "Order ID"},
"customer_email": {"type": "string", "description": "Customer email"}
},
"required": ["order_id"]
}
}
class SupportBot:
def __init__(self):
self.client = anthropic.Anthropic()
self.history = []
def chat(self, user_message: str) -> Generator:
self.history.append({
"role": "user",
"content": user_message
})
# First pass: check if Claude wants to use a tool
response = self.client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
system=SYSTEM_PROMPT,
tools=[ORDER_TOOL],
messages=self.history
)
if response.stop_reason == "tool_use":
self._handle_tool_call(response)
# Re-run with tool result to get final answer
response = self.client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
system=SYSTEM_PROMPT,
tools=[ORDER_TOOL],
messages=self.history
)
assistant_text = response.content[0].text
self.history.append({
"role": "assistant",
"content": assistant_text
})
return assistant_text
def _handle_tool_call(self, response):
tool_block = next(
b for b in response.content
if b.type == "tool_use"
)
result = fetch_order_from_db(tool_block.input)
self.history.append({
"role": "assistant",
"content": response.content
})
self.history.append({
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": tool_block.id,
"content": json.dumps(result)
}]
})
This is a production-ready pattern. The SupportBot class manages conversation history, handles the tool use loop, and can be dropped into any web framework. Add streaming to the final response pass, wire the fetch_order_from_db function to your actual order management system, and you have a deployable customer support bot.
What to Build Next
- Add a
create_refund_requesttool to let the bot initiate refunds without human intervention - Add a
search_help_centertool that performs semantic search over your documentation - Implement a conversation summarization step that compresses long histories to save tokens
- Add a fallback escalation path that hands off to a human agent when Claude's confidence is low
Build Real AI Apps at Precision AI Academy
Our 3-day intensive bootcamp takes you from API basics to production-grade Claude applications. Hands-on labs, real projects, and the skills employers actually want.
Reserve Your Seat — $1,490The bottom line: The Claude API is one of the most capable and developer-friendly LLM APIs available in 2026. Start with Sonnet 4 as your production default, use tool use to connect Claude to your real data and systems, enable streaming for interactive UIs, and route bulk async workloads through the Batches API for a 50% cost reduction. Most teams are shipping meaningful AI features within a week of getting their first API key.
Frequently Asked Questions
Which Claude model should I use for my application?
The right model depends on your use case and budget. Claude Opus 4 delivers the highest reasoning quality for complex tasks like legal analysis, research synthesis, and multi-step agentic workflows — use it when quality is non-negotiable and cost is secondary. Claude Sonnet 4 is the best all-around choice for most production applications: it balances strong intelligence with fast inference and reasonable cost per token, making it ideal for customer support bots, coding assistants, and document analysis pipelines. Claude Haiku 4.5 is the right choice when you need maximum speed and minimum cost — classification tasks, lightweight summarization, and high-volume preprocessing where sub-second latency matters.
How much does the Claude API cost in 2026?
Anthropic prices Claude on a per-token basis with separate rates for input and output tokens. Haiku 4.5 is the most affordable tier. Sonnet 4 sits in the mid-tier at a moderate premium over Haiku. Opus 4 is the premium tier priced for workloads where response quality is the primary constraint. The Message Batches API delivers a 50% cost reduction for non-real-time asynchronous workloads. For current published prices, check anthropic.com/pricing — rates are updated periodically as the model family evolves.
Can I use Claude on AWS or Google Cloud?
Yes. Claude models are available through both Amazon Bedrock and Google Cloud Vertex AI. AWS Bedrock access uses the AnthropicBedrock client — no Anthropic API key required, just standard AWS IAM credentials. Google Vertex AI access works through the AnthropicVertex client pointed at your GCP project and region. Both integrations support the same Messages API parameter structure as the direct Anthropic API, so switching between providers requires minimal code changes.
What is tool use in the Claude API?
Tool use (also called function calling) lets Claude interact with external systems — databases, REST APIs, calculators, code interpreters — and incorporate the results into its responses. You pass a list of tool definitions with each API call. When Claude decides a tool would help, it returns a tool_use content block with structured arguments instead of plain text. Your code executes the actual function, passes the result back in a tool_result message, and Claude incorporates that result into its final answer. This cycle is the foundation of all Claude-based AI agents.