How to Build an AI Agent with Python in 2026

Our Take: AI agents are the single most important skill shift in software engineering right now. If you can build a tool-using agent loop, you can automate work that used to require entire teams. This guide gives you the real architecture — not toy demos, but the same patterns running in production at companies processing millions of agent tasks per month. Start with the 40-line version, understand every line, then scale up.

In This Article

  1. What Is an AI Agent (And What It Isn't)
  2. The Agent Architecture: Observe-Think-Act
  3. Build a Minimal Agent in 40 Lines
  4. Adding Tools Your Agent Can Call
  5. Giving Your Agent Memory
  6. Planning and Multi-Step Reasoning
  7. LangChain vs. Direct SDK — When to Use What
  8. Production Deployment Patterns
  9. Cost Management and Optimization
  10. 5 Mistakes That Kill Agent Reliability

What Is an AI Agent (And What It Isn't)

An AI agent is not a chatbot with extra steps. A chatbot takes a message and returns a message. An agent takes a goal and executes a plan — calling APIs, reading files, querying databases, writing code, checking its own work, and looping until the job is done.

The difference is the loop. A chatbot is one pass: input → output. An agent runs a cycle:

1
Observe — Read the current state. What data do I have? What did my last action return?
2
Think — Given my goal and what I've observed, what should I do next?
3
Act — Call a tool, run code, make an API request, or return the final result.
4
Repeat — Feed the action result back in and loop until done.

In 2026, the best agents run on models like Claude Opus, GPT-4o, and Gemini 2.0 Ultra — models that support structured tool use natively. The model doesn't just generate text that looks like a function call; it returns a structured JSON object that your code can parse and execute deterministically.

Key Insight The quality of your agent is 80% determined by the quality of your tools and tool descriptions — not the model. A mediocre model with great tools beats a great model with bad tools every time.

The Agent Architecture: Observe-Think-Act

Every production agent — from Claude Code to Devin to custom enterprise agents — runs the same core loop. Here's the architecture:

The Core Agent Loop
  1. Send a message to the LLM with: system prompt + conversation history + available tools
  2. The LLM responds with either a text message (done) or a tool call (keep going)
  3. If tool call: execute the tool, append the result to conversation history, go to step 1
  4. If text message: return the result to the user

That's it. Every agent framework — LangChain, CrewAI, AutoGen, custom implementations — is a variation on this loop. The differences are in how they handle tool routing, memory, error recovery, and multi-agent coordination. But the loop is always the same.

Build a Minimal Agent in 40 Lines

Let's build a working agent with the Anthropic Python SDK. This agent can use tools, loop until it's done, and handle multi-step tasks. No framework needed.

First, install the SDK:

Terminal
pip install anthropic

Now the agent:

agent.py
import anthropic
import json

client = anthropic.Anthropic()  # uses ANTHROPIC_API_KEY env var

tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"}
            },
            "required": ["city"]
        }
    }
]

def execute_tool(name, input):
    """Route tool calls to actual implementations."""
    if name == "get_weather":
        # In production, this calls a real API
        return f"72°F, sunny in {input['city']}"
    return "Unknown tool"

def run_agent(user_message):
    messages = [{"role": "user", "content": user_message}]

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            tools=tools,
            messages=messages
        )

        # Collect assistant response
        messages.append({"role": "assistant", "content": response.content})

        # If the model is done (no tool calls), return the text
        if response.stop_reason == "end_turn":
            return "".join(b.text for b in response.content if b.type == "text")

        # Execute each tool call and send results back
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = execute_tool(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result
                })

        messages.append({"role": "user", "content": tool_results})

# Run it
print(run_agent("What's the weather in Denver and NYC?"))

Run this and the agent will make two tool calls — one for Denver, one for NYC — then synthesize the results into a natural response. That's an agent. It decided what tools to call, called them, and used the results to answer.

Adding Tools Your Agent Can Call

The power of an agent comes from its tools. Here are the tool categories that matter most in production:

Tool Category Examples When to Use
Data Retrieval SQL queries, API calls, file reads, web scraping Agent needs information it doesn't have
Data Mutation Write files, update databases, send emails, create tickets Agent needs to take action in the world
Computation Run Python code, math calculations, data transformations Agent needs precise calculations (LLMs are bad at math)
Search Vector search, web search, document search Agent needs to find relevant information in large datasets
Verification Run tests, lint code, validate schemas, check URLs Agent needs to verify its own work

The most important rule for tool design: write tool descriptions like you're explaining them to a new hire. The model reads these descriptions to decide when and how to call each tool. Vague descriptions produce bad tool selection. Specific descriptions with examples produce reliable agents.

Good vs. Bad Tool Descriptions
# BAD — vague, no guidance on when to use it
{
    "name": "search",
    "description": "Search for things"
}

# GOOD — specific, explains when and how
{
    "name": "search_knowledge_base",
    "description": "Search the company knowledge base for internal documentation, policies, and procedures. Use this when the user asks about company-specific information that wouldn't be in your training data. Returns the top 5 most relevant document chunks with source URLs. Input should be a natural language query, not keywords."
}

Giving Your Agent Memory

Agents without memory forget everything between runs. There are three types of memory that matter:

1. Conversation Memory (Short-Term)

This is the conversation history — the messages array in our code above. The model sees everything from the current session. This is free (it's just the context window) but limited by the model's context length.

2. Summary Memory (Medium-Term)

When conversations get long, compress older messages into summaries. This preserves important context without burning your entire context window on old messages.

Summary Memory Pattern
def compress_history(messages, keep_recent=10):
    """Summarize old messages, keep recent ones verbatim."""
    if len(messages) <= keep_recent:
        return messages

    old = messages[:-keep_recent]
    recent = messages[-keep_recent:]

    summary = client.messages.create(
        model="claude-haiku-4-5-20251001",  # cheap model for summaries
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"Summarize this conversation concisely:\n{json.dumps(old)}"
        }]
    )

    return [
        {"role": "user", "content": f"[Previous context: {summary.content[0].text}]"},
        *recent
    ]

3. Persistent Memory (Long-Term)

Store facts, preferences, and learned information in a vector database or simple file store. When starting a new conversation, retrieve relevant memories and inject them into the system prompt.

Practical Tip: Start with just conversation memory. Add summary compression when your conversations regularly exceed 50 messages. Add persistent memory only when your agent needs to remember things across separate sessions. Each layer adds complexity — don't over-engineer early.

Planning and Multi-Step Reasoning

Simple agents react to each step independently. Better agents plan before acting. The difference is dramatic for complex tasks.

The simplest planning pattern is plan-then-execute: ask the model to write a step-by-step plan first, then execute each step.

Plan-Then-Execute Pattern
PLANNING_PROMPT = """Before taking any action, write a brief plan:
1. What is the user's goal?
2. What information do I need?
3. What tools will I call, and in what order?
4. How will I verify the result?

Write the plan, then execute it step by step."""

def run_planning_agent(user_message):
    messages = [
        {"role": "user", "content": f"{PLANNING_PROMPT}\n\nTask: {user_message}"}
    ]
    return agent_loop(messages)  # same loop as before

For more complex tasks, use ReAct (Reasoning + Acting) — the agent writes its reasoning before each action, creating a visible chain of thought that improves decision-making and makes debugging easier.

LangChain vs. Direct SDK — When to Use What

This is the most common question in 2026 agent development. Here's the honest answer:

Approach Best For Avoid When
Direct SDK (Anthropic/OpenAI) Single-purpose agents, full control needed, performance-critical, learning how agents work You need 15+ integrations (vector stores, doc loaders, etc.) and don't want to build them
LangChain / LangGraph Multi-agent orchestration, complex workflows with branching, rapid prototyping with many integrations Simple agents, performance-critical paths, you need to understand every line of code
Claude Agent SDK Production agents on Anthropic models, built-in guardrails, managed tool execution Multi-model agents, you need framework-agnostic code

Our recommendation: start with the direct SDK. Build the 40-line agent above. Understand every line. Then, when you hit a real complexity wall — not an imagined one — reach for a framework. Most production agents we see in the wild are 100-300 lines of direct SDK code. They don't need a framework.

Production Deployment Patterns

Moving an agent from a script to production requires handling three things that don't exist in demos:

1. Error Recovery

Tools fail. APIs time out. Models hallucinate tool names. Your agent loop needs to handle all of this gracefully.

Robust Tool Execution
def execute_tool_safely(name, input, max_retries=2):
    for attempt in range(max_retries + 1):
        try:
            result = execute_tool(name, input)
            return {"status": "success", "result": result}
        except Exception as e:
            if attempt == max_retries:
                return {"status": "error", "error": str(e)}
            time.sleep(1)  # brief backoff

2. Timeout and Cost Guardrails

An agent can loop forever if something goes wrong. Always set maximum iterations and spending limits.

Agent Guardrails
MAX_ITERATIONS = 25
MAX_TOKENS_TOTAL = 100_000

def run_agent_safe(user_message):
    messages = [{"role": "user", "content": user_message}]
    total_tokens = 0

    for i in range(MAX_ITERATIONS):
        response = client.messages.create(...)
        total_tokens += response.usage.input_tokens + response.usage.output_tokens

        if total_tokens > MAX_TOKENS_TOTAL:
            return "Agent stopped: token budget exceeded"

        if response.stop_reason == "end_turn":
            return extract_text(response)

        # ... tool execution loop

    return "Agent stopped: max iterations reached"

3. Observability

You need to see what your agent did, why, and how long each step took. Log every tool call, every model response, and every decision point. Tools like LangSmith, Helicone, and Braintrust make this easier, but even structured logging to a file works.

Cost Management and Optimization

$0.01–0.05
Simple agent run (Sonnet, 3-5 tool calls)
$0.50–2.00
Complex agent run (Opus, 20-50 tool calls)
80–90%
Cost reduction with prompt caching
50%
Savings with batch API (non-urgent tasks)

The three biggest cost levers:

5 Mistakes That Kill Agent Reliability

1
Too many tools — Giving an agent 50 tools is like giving a new employee 50 apps on their first day. They'll pick the wrong one. Start with 5-7 focused tools and add more only when needed.
2
Vague tool descriptions — "Search for data" tells the model nothing. "Search the PostgreSQL database for customer records by email, name, or account ID. Returns up to 10 matching rows with all columns." tells it everything.
3
No error handling in tools — When a tool throws an exception, the agent gets a cryptic Python traceback. Return structured error messages: {"error": "Customer not found", "suggestion": "Try searching by email instead of name"}
4
No iteration limits — An agent without guardrails will loop until it hits your API spending limit. Always cap iterations and token usage.
5
Testing with toy examples only — Your agent works on "what's the weather?" but breaks on real tasks. Test with messy, ambiguous, multi-step inputs that mirror actual usage. Build an eval suite early.

Ready to Master AI Agents?

Our hands-on bootcamp covers agent architecture, tool use, production deployment, and more — with real code, not slides. 5 cities. $1,490. 40 students max.

Reserve Your Seat