Build with APIs, Not Chat Apps

There's a real difference between using ChatGPT, Gemini, Claude, or HuggingChat interactively and building against an API. When you build against an API, you manage request/response structure, errors and retries, multi-tenant state, and how tools are called and validated. The model becomes a component in your system, not a product you interact with.

This lesson bridges that gap. You'll take the prompt contracts you wrote in the previous lesson and implement them as API-backed services. By the end, you'll have a working service that calls a model, enforces output schemas, handles failures, and separates state between users.

What you'll learn

Make programmatic model API calls with structured message sequences
Enforce output shape with structured outputs / JSON schema
Handle basic model-call failures and understand where rate limits and retries fit
Manage conversation state across requests using session or conversation IDs
Make a basic tool call through the API
Compare the same operation across the supported provider variants: OpenAI, Gemini, Anthropic, Hugging Face, Ollama (Local), and Ollama (Cloud)

Concepts

Structured outputs: a feature of model APIs that constrains the model's response to follow a predictable format. There are two levels:

JSON mode (response_format={"type": "json_object"}): guarantees valid JSON but does not enforce a specific schema. You still rely on the prompt to guide the structure, and must validate the shape yourself.
Schema-constrained mode (response_format={"type": "json_schema", ...}): guarantees the response matches an exact JSON schema you define. The API rejects responses that do not conform. This is the stronger guarantee.

In this lesson we'll start with JSON mode (simpler, works with a wider range of models) and then see how schema-constrained mode tightens the contract. Both turn your prompt contracts into machine-checkable outputs.

Tool calling / function calling: the model's ability to request execution of a specific function with structured arguments, rather than just generating text. When the model encounters a task it cannot do alone (look up a user, read a file, search a database), it can emit a tool call with the function name and arguments. Your code executes the function and returns the result. The model then continues with that result in context.

Conversation state: in an API-backed system, you manage conversation history explicitly. Each request includes the full message sequence (system, user, assistant, tool messages). The API is stateless. If you do not send the history, the model does not remember it. This means you control what the model sees, which is powerful but requires deliberate state management.

Rate limiting: API providers enforce limits on how many requests you can make per minute or per day. When you exceed the limit, the API returns an error (usually HTTP 429). Your code must handle this gracefully: back off, wait, and retry. We'll cover rate limiting in more depth as a safety and cost topic in the Security Basics lesson.

Walkthrough

Project setup

Use Choosing a Provider if you need help mapping this lesson to OpenAI Platform, Gemini API, Anthropic's developer platform, Hugging Face, or Ollama Cloud. The concepts in this lesson are provider-agnostic even when individual code blocks use a specific SDK surface.

You are extending the FastAPI project you built in Python and FastAPI. If you still have your ai-eng-foundations/ directory with app.py and test_app.py, continue from there. If not, go back and complete that lesson first. This one builds directly on it.

Add the model SDKs to your existing project:

cd ai-eng-foundations
source .venv/bin/activate
pip install openai google-genai anthropic ollama huggingface_hub

The huggingface_hub package is included because some Hugging Face variants later in this lesson use InferenceClient directly. For the simplest Hugging Face path, you can also reuse the openai SDK with Hugging Face's router base URL and HF_TOKEN.

Set your API keys:

export OPENAI_API_KEY="sk-..."

You do not need every provider path configured to finish this lesson. Install the providers you actually plan to use, and use Choosing a Provider for the exact hosted-platform distinction:

OpenAI Platform means platform.openai.com, not chatgpt.com
Gemini API means https://aistudio.google.com/ plus https://ai.google.dev/, not gemini.google.com
Anthropic means https://platform.claude.com/ plus https://platform.claude.com/docs/, not claude.ai
Hugging Face means https://huggingface.co/settings/tokens for HF_TOKEN and https://huggingface.co/docs/inference-providers/index for the API, not huggingface.co/chat
GitHub Models means GitHub-hosted inference with GITHUB_TOKEN and publisher/model IDs, not GitHub Copilot SDK
Ollama Cloud means https://ollama.com/api, not local Ollama on localhost

If you choose Gemini for this lesson, the least-friction path is usually to use google-genai directly with GEMINI_API_KEY. If you choose Hugging Face, the least-friction path is usually to keep the openai SDK and point it at Hugging Face's OpenAI-compatible router with HF_TOKEN. If you choose GitHub Models, the least-friction path is to keep the openai SDK and point it at https://models.github.ai/inference with the GitHub headers and GITHUB_TOKEN. If you choose Ollama, the least-friction path is usually to keep the lesson concepts the same and swap only the client/call layer to Ollama's API or Python client.

By the end of this lesson, your project will have grown to:

ai-eng-foundations/
├── app.py              # FastAPI service — extended with model-backed endpoints
├── conversations.py    # In-memory conversation store (new)
├── tools.py            # Tool definitions and implementations (new)
├── cross_provider.py   # Cross-provider comparison script (new)
├── test_app.py         # Tests — extended
└── requirements.txt

Your /health, /echo, and /summarize-request endpoints from lesson 1 are still there. You are adding model-backed endpoints alongside them.

Hosted APIs today, same patterns everywhere. In this lesson you call hosted APIs plus one explicit local Ollama variant where that is the clearer path. Each code block offers tabs for the supported provider variants. Later, if you self-host an open-weight model using a serving engine like vLLM, Ollama, or llama.cpp, you point the same style of client at your own endpoint instead of a hosted platform. The code surface changes. The concepts do not.

You've already built a retry/backoff wrapper in Python and FastAPI. In this lesson, focus on model call structure, schemas, tool flow, and conversation state. Reuse the retry pattern from lesson 1 around your model calls; the dedicated rate-limit and backoff treatment comes in Security Basics.

Your first programmatic model call

Add a summarizer endpoint to your existing app.py. Your /health, /echo, and /summarize-request endpoints from lesson 1 stay. You are adding alongside them. The Pydantic models are the same regardless of provider:

# Add these imports to the top of app.py
import json
from fastapi import HTTPException

# --- New models (add below your existing models) ---

class SummarizeTextRequest(BaseModel):
    text: str

class SummarizeTextResponse(BaseModel):
    summary: str
    word_count: int
    key_topics: list[str]

Now add the client setup and endpoint. Pick your provider:

from openai import OpenAI

# Add this after your existing app = FastAPI() line
client = OpenAI()  # reads OPENAI_API_KEY from environment


@app.post("/summarize", response_model=SummarizeTextResponse)
def summarize(request: SummarizeTextRequest):
    """Summarize free-form text through the OpenAI-backed API endpoint.

    Args:
        request: Request model containing the source text to summarize.

    Returns:
        A validated summary payload with the summary, source word count, and key topics.
    """
    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "You are a summarizer. Given text, return a JSON object with: "
                        "summary (2-3 sentences), word_count (of the original text), "
                        "and key_topics (list of 2-4 topics). "
                        "Return ONLY valid JSON, no other text."
                    ),
                },
                {"role": "user", "content": request.text},
            ],
            temperature=0,
            response_format={"type": "json_object"},
        )
        result = json.loads(response.choices[0].message.content)
        return SummarizeTextResponse(**result)
    except Exception as e:
        raise HTTPException(status_code=502, detail=f"Model call failed: {e}")

Run and test:

uvicorn app:app --reload

curl -X POST http://localhost:8000/summarize \
  -H "Content-Type: application/json" \
  -d '{"text": "FastAPI is a modern Python web framework for building APIs. It uses type hints for validation and generates OpenAPI documentation automatically. It is built on top of Starlette and Pydantic."}'

Expected output (content will vary, structure should not):

{
  "summary": "FastAPI is a Python web framework that uses type hints for validation and auto-generates API documentation. It is built on Starlette and Pydantic.",
  "word_count": 30,
  "key_topics": ["FastAPI", "Python", "API", "Pydantic"]
}

If you get a response with all three fields, the model call is working. If you get a 502, check your API key and network connection.

Multi-turn conversations

Add conversation state management. Create conversations.py:

# conversations.py

# In-memory store — fine for learning, not for production
_conversations: dict[str, list[dict]] = {}

SYSTEM_PROMPT_TEXT = "You are a helpful assistant. Keep responses concise."


def get_history(conversation_id: str) -> list[dict]:
    """Return one conversation thread, creating it on first access.

    Args:
        conversation_id: Stable identifier for the conversation thread.

    Returns:
        The mutable list that stores prior user and assistant messages for the thread.
    """
    if conversation_id not in _conversations:
        _conversations[conversation_id] = []
    return _conversations[conversation_id]


def append_message(conversation_id: str, role: str, content: str):
    """Append one chat turn to the in-memory conversation store.

    Args:
        conversation_id: Stable identifier for the conversation thread.
        role: Message role to record, usually ``user`` or ``assistant``.
        content: Text content for the new message.

    Returns:
        None. The conversation store is updated in place.
    """
    history = get_history(conversation_id)
    history.append({"role": role, "content": content})

Add the conversation endpoint to app.py. The conversation store keeps only the user and assistant turns. Each provider example below applies the system prompt in the way its API expects.

Pick your provider:

# Add to app.py
from conversations import get_history, append_message, SYSTEM_PROMPT_TEXT


class ChatRequest(BaseModel):
    conversation_id: str
    message: str

class ChatResponse(BaseModel):
    conversation_id: str
    response: str
    message_count: int


@app.post("/chat", response_model=ChatResponse)
def chat(request: ChatRequest):
    """Continue one conversation through the OpenAI-backed chat endpoint.

    Args:
        request: Chat request containing the conversation ID and latest user message.

    Returns:
        A response with the assistant reply and the updated message count for the thread.
    """
    # Get history, add user message
    history = get_history(request.conversation_id)
    append_message(request.conversation_id, "user", request.message)
    messages = [{"role": "system", "content": SYSTEM_PROMPT_TEXT}, *history]

    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
            temperature=0,
        )
        assistant_msg = response.choices[0].message.content
        append_message(request.conversation_id, "assistant", assistant_msg)

        return ChatResponse(
            conversation_id=request.conversation_id,
            response=assistant_msg,
            message_count=len(history),
        )
    except Exception as e:
        raise HTTPException(status_code=502, detail=f"Model call failed: {e}")

Test multi-turn behavior:

# First message
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"conversation_id": "test-1", "message": "My name is Kal-El."}'

# Second message — the model should remember the name
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"conversation_id": "test-1", "message": "What is my name?"}'

# Different conversation — the model should NOT know the name
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"conversation_id": "test-2", "message": "What is my name?"}'

The second call should respond with "Kal-El." The third call (different conversation ID) should not know the name. This confirms your service owns the conversation state, not the model.

Schema-constrained extraction

Add a structured extraction endpoint. This uses structured output constraints to force the model to return exactly the shape you specify. The Pydantic models are the same regardless of provider:

# Add to app.py

class BugReport(BaseModel):
    title: str
    steps_to_reproduce: list[str]
    expected_behavior: str
    actual_behavior: str
    severity: str  # "low", "medium", "high", "critical"

class ExtractRequest(BaseModel):
    text: str

Notice the difference from the OpenAI summarizer path above: it uses {"type": "json_object"} (JSON mode, which guarantees valid JSON, but the model decides the shape). This extraction endpoint uses schema-constrained mode, where the API guarantees the response matches your exact schema, including the severity enum. Some providers expose that through a different parameter name, but the concept is the same. Schema-constrained mode is the stronger contract. Use it when you know the exact output shape; use JSON mode when the shape is more flexible.

Pick your provider:

OpenAI's response_format with json_schema type provides the strictest schema-constrained output.

@app.post("/extract/bug-report", response_model=BugReport)
def extract_bug_report(request: ExtractRequest):
    """Extract a structured bug report with the OpenAI-backed endpoint.

    Args:
        request: Request model containing the raw bug description text.

    Returns:
        A validated ``BugReport`` parsed from the model response.
    """
    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "Extract a structured bug report from the user's text. "
                        "Return JSON with: title, steps_to_reproduce (list of strings), "
                        "expected_behavior, actual_behavior, severity (low/medium/high/critical). "
                        "If information is missing, use 'Not specified'."
                    ),
                },
                {"role": "user", "content": request.text},
            ],
            temperature=0,
            response_format={
                "type": "json_schema",
                "json_schema": {
                    "name": "bug_report",
                    "strict": True,
                    "schema": {
                        "type": "object",
                        "properties": {
                            "title": {"type": "string"},
                            "steps_to_reproduce": {
                                "type": "array",
                                "items": {"type": "string"},
                            },
                            "expected_behavior": {"type": "string"},
                            "actual_behavior": {"type": "string"},
                            "severity": {
                                "type": "string",
                                "enum": ["low", "medium", "high", "critical"],
                            },
                        },
                        "required": [
                            "title",
                            "steps_to_reproduce",
                            "expected_behavior",
                            "actual_behavior",
                            "severity",
                        ],
                        "additionalProperties": False,
                    },
                },
            },
        )
        result = json.loads(response.choices[0].message.content)
        return BugReport(**result)
    except Exception as e:
        raise HTTPException(status_code=502, detail=f"Model call failed: {e}")

Test it:

curl -X POST http://localhost:8000/extract/bug-report \
  -H "Content-Type: application/json" \
  -d '{"text": "When I click the submit button on the login page nothing happens. I expected it to log me in or show an error. This is blocking all QA testing."}'

Expected: a JSON object with all five fields populated, severity constrained to one of the four enum values. On OpenAI, Gemini, Anthropic, Hugging Face, and local Ollama, the structured-output setting is meant to guarantee the shape and the Pydantic model gives you a second validation layer. On the Ollama Cloud fallback above, you rely on JSON mode plus BugReport.model_validate_json(...) instead.

Your first tool call

Create tools.py with a tool the model can call:

# tools.py

# Fake user database
USERS = {
    "u-101": {"name": "Kal-El Chen", "email": "kal-el@example.com", "role": "engineer"},
    "u-102": {"name": "Sam Park", "email": "sam@example.com", "role": "designer"},
}


def lookup_user(user_id: str) -> dict:
    """Return a fake user profile for the requested ID.

    Args:
        user_id: Identifier to look up in the in-memory user table.

    Returns:
        The matching user record, or an error payload when the ID is not found.
    """
    if user_id in USERS:
        return USERS[user_id]
    return {"error": f"User {user_id} not found"}


# Tool definition for the model API
TOOL_DEFINITIONS = [
    {
        "type": "function",
        "function": {
            "name": "lookup_user",
            "description": "Look up a user by their ID and return their profile",
            "parameters": {
                "type": "object",
                "properties": {
                    "user_id": {
                        "type": "string",
                        "description": "The user ID, e.g. 'u-101'",
                    }
                },
                "required": ["user_id"],
            },
        },
    }
]

Add the tool-calling endpoint to app.py. The request/response models are the same regardless of provider:

# Add to app.py
import json
from tools import lookup_user, TOOL_DEFINITIONS


class ToolChatRequest(BaseModel):
    message: str

class ToolChatResponse(BaseModel):
    response: str
    tools_called: list[str]

Now add the endpoint. The tool-call flow differs more across providers than simple completions do. Each has its own request/response shape for tool definitions, tool-call messages, and tool results. Pick your provider:

@app.post("/chat-with-tools", response_model=ToolChatResponse)
def chat_with_tools(request: ToolChatRequest):
    """Answer one message, calling ``lookup_user`` when the model asks for it.

    Args:
        request: Tool-chat request containing the latest user message.

    Returns:
        The final assistant response plus the list of tools invoked during the turn.
    """
    messages = [
        {
            "role": "system",
            "content": "You are a helpful assistant. Use the lookup_user tool when the user asks about a person.",
        },
        {"role": "user", "content": request.message},
    ]

    try:
        # First call — model may request a tool
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
            tools=TOOL_DEFINITIONS,
            temperature=0,
        )

        msg = response.choices[0].message
        tools_called = []

        # If the model requested a tool call, execute it
        if msg.tool_calls:
            # Add the assistant's tool-call message to history
            messages.append(msg)

            for tool_call in msg.tool_calls:
                if tool_call.function.name == "lookup_user":
                    args = json.loads(tool_call.function.arguments)
                    result = lookup_user(args["user_id"])
                    tools_called.append("lookup_user")

                    # Add the tool result to history
                    messages.append({
                        "role": "tool",
                        "tool_call_id": tool_call.id,
                        "content": json.dumps(result),
                    })

            # Second call — model uses the tool result to answer
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=messages,
                tools=TOOL_DEFINITIONS,
                temperature=0,
            )
            msg = response.choices[0].message

        return ToolChatResponse(
            response=msg.content,
            tools_called=tools_called,
        )
    except Exception as e:
        raise HTTPException(status_code=502, detail=f"Model call failed: {e}")

Test the tool call flow:

# This should trigger a tool call
curl -X POST http://localhost:8000/chat-with-tools \
  -H "Content-Type: application/json" \
  -d '{"message": "What is the email address for user u-101?"}'
# Expected: response mentions "kal-el@example.com", tools_called: ["lookup_user"]

# This should NOT trigger a tool call
curl -X POST http://localhost:8000/chat-with-tools \
  -H "Content-Type: application/json" \
  -d '{"message": "What is 2 + 2?"}'
# Expected: response answers "4", tools_called: []

The key observation: you sent the tool definition. The model decided to call it and provided the arguments. Your code executed the function. You sent the result back. The model used it to answer. This is the foundation for everything in Module 3.

Cross-provider exercise

Make the same bug report extraction call against Anthropic. Create a small standalone script:

# cross_provider.py
import os, json
from openai import OpenAI
from anthropic import Anthropic

bug_text = (
    "When I click the submit button on the login page nothing happens. "
    "I expected it to log me in or show an error. "
    "This is blocking all QA testing."
)

system_prompt = (
    "Extract a structured bug report from the user's text. "
    "Return JSON with: title, steps_to_reproduce (list of strings), "
    "expected_behavior, actual_behavior, severity (low/medium/high/critical). "
    "If information is missing, use 'Not specified'. Return ONLY valid JSON."
)

# --- OpenAI ---
openai_client = OpenAI()
openai_resp = openai_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": bug_text},
    ],
    temperature=0,
    response_format={"type": "json_object"},
)
openai_result = json.loads(openai_resp.choices[0].message.content)
print("=== OpenAI ===")
print(json.dumps(openai_result, indent=2))

# --- Anthropic ---
anthropic_client = Anthropic()
anthropic_resp = anthropic_client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=system_prompt,
    output_config={
        "format": {
            "type": "json_schema",
            "schema": {
                "type": "object",
                "properties": {
                    "title": {"type": "string"},
                    "steps_to_reproduce": {
                        "type": "array",
                        "items": {"type": "string"},
                    },
                    "expected_behavior": {"type": "string"},
                    "actual_behavior": {"type": "string"},
                    "severity": {
                        "type": "string",
                        "enum": ["low", "medium", "high", "critical"],
                    },
                },
                "required": [
                    "title",
                    "steps_to_reproduce",
                    "expected_behavior",
                    "actual_behavior",
                    "severity",
                ],
                "additionalProperties": False,
            },
        },
    },
    messages=[
        {"role": "user", "content": bug_text},
    ],
)
anthropic_result = json.loads(anthropic_resp.content[0].text)
print("\n=== Anthropic ===")
print(json.dumps(anthropic_result, indent=2))

python cross_provider.py

Compare the two outputs. Notice:

Message format: OpenAI uses messages with a system role; Anthropic uses a separate system parameter
Response structure: OpenAI nests content under choices[0].message.content; Anthropic uses content[0].text
Structured output: OpenAI uses response_format; Anthropic uses output_config
Output content: both should produce a valid bug report, but field values may differ

The concepts are the same; the API surfaces differ. You now know enough about both to avoid lock-in assumptions.

Optional extension:

Gemini: port the OpenAI extraction half to google-genai and compare response_format with Gemini's response_mime_type + response_schema
Hugging Face: rerun the OpenAI half of this script with the Hugging Face router base URL and HF_TOKEN
Ollama: rerun the extraction against Ollama's chat API/client using the same system prompt and compare the response shape and JSON reliability
If OpenAI or Anthropic is not your primary provider, invert the exercise: start from the provider you do have, then port one smaller call to any second provider you can access

Exercises

Build the summarizer endpoint and confirm it returns structured JSON from a model call.
Build the multi-turn conversation endpoint. Verify the model remembers context within a conversation and does not bleed state between conversations.
Build the bug report extraction endpoint. Send it freeform text and confirm it returns all required fields.
Build the tool-calling endpoint with lookup_user. Verify the model calls the tool when appropriate and skips it when not needed.
Run cross_provider.py (or an equivalent script for the providers you actually have configured) and note at least three differences between two provider API surfaces.
Port one smaller endpoint or extraction script to a third provider path. Note what changed in client setup, model naming, structured-output support, and response parsing.

Completion checkpoint

You can:

Call a model API programmatically and parse the response
Enforce a JSON schema on the model's output using structured outputs
Handle invalid model responses and basic call failures without crashing, and know where retry/backoff logic belongs
Maintain conversation state across multiple requests using a conversation ID
Execute a tool call flow: model requests tool -> your code runs it -> model continues with result
Show the same operation working against at least two supported providers, and explain what changes when you port it to other provider surfaces

What's next

Retrieval Basics. Model calls alone will not answer repo-specific questions, so the next lesson builds the simplest retrieval pipeline and lets you watch it fail in useful ways.

References

Start here

OpenAI API docs — the primary API reference for message structure, tool calling, and structured outputs
Gemini API quickstart — Gemini setup, auth, and first requests on the direct API

Build with this

Anthropic API docs — Anthropic's API basics for messages, auth, and the surrounding platform surface
Gemini text generation guide — request structure, system instructions, and generation config on Gemini
Gemini function calling — explicit and automatic tool-calling patterns on Gemini
OpenAI function calling guide — step-by-step guide for tool calling
Anthropic tool use guide — Anthropic's equivalent for tool calling
Hugging Face chat completion docs — hosted chat-completion surface, including OpenAI-compatible usage
Hugging Face function calling guide — tool calling on Hugging Face Inference Providers
Ollama API introduction — the core API shape, including local vs cloud base URLs
Ollama tool calling — Ollama's guide to tool/function calling

Deep dive

OpenAI structured outputs guide — advanced patterns for schema-constrained outputs
Gemini structured output — JSON schema support and typed parsing on Gemini
Hugging Face: Structured outputs with LLMs — schema-constrained responses on Hugging Face-hosted models
Ollama: Structured outputs — schema-constrained responses on Ollama
OpenAI Responses API migration guide — if you are using older OpenAI SDK patterns, this covers the migration