· claude / anthropic / llm

Claude API 2026: Prompt Caching, Tool Use & Batches

A practical guide to the three Claude API features that separate toy prototypes from production integrations: prompt caching, tool use, and Message Batches API.

By

2,346 words · 12 min read

The Claude API in 2026 is a different product from what it was a year ago. Three features now make or break a production integration: prompt caching (which cuts costs by up to 90× on repeated context), tool use (which turns Claude into an agent that calls your code), and the Message Batches API (which processes thousands of requests at half price). If you are sending raw messages and not using any of these, you are leaving money on the table.

Who this is for

Backend developers who have already sent a request to an LLM API — OpenAI, Cohere, doesn’t matter — and are now onboarding to Anthropic’s SDK. The code examples are TypeScript-first, with Python equivalents for the patterns that differ meaningfully.

Pick your Claude API model

Three current models, one clear decision tree:

ModelAPI IDContextMax outputPrice (input / output per MTok)
Claude Opus 4.8claude-opus-4-81M tokens128k tokens$5 / $25
Claude Sonnet 4.6claude-sonnet-4-61M tokens64k tokens$3 / $15
Claude Haiku 4.5claude-haiku-4-5-20251001200k tokens64k tokens$1 / $5

One note on naming: starting with the 4.6 generation, model IDs dropped the date suffix. claude-sonnet-4-6 is still a pinned snapshot — it will not silently change under you. It is not an evergreen pointer.

Two aliases that are retiring on June 15, 2026: claude-sonnet-4-20250514 and claude-opus-4-20250514. If your code uses either, migrate to claude-sonnet-4-6 and claude-opus-4-8 now.

Which to use:

  • Haiku 4.5 for classification, extraction, and any task where you send thousands of requests and need cost control. It is the right model for the Batch API examples below.
  • Sonnet 4.6 for the majority of real work — code generation, analysis, agentic flows. Good balance of speed and quality.
  • Opus 4.8 for complex multi-step reasoning and high-autonomy agents where quality is the constraint, not cost.

For benchmark data on when Opus-class models justify the cost premium, see Claude Opus 4.7 for Coding — When the Big Model Wins.

Install the SDK:

npm install @anthropic-ai/sdk
pip install anthropic

Prompt caching

The idea is simple: if you have a large block of context that does not change between requests — a system prompt, a document, a code file — you can mark it with cache_control and Anthropic stores a compressed snapshot of the processed tokens server-side. Subsequent requests that include the same prefix read from the cache at 10% of the normal input cost.

Minimum threshold: Varies by model. Opus 4.8 and Sonnet 4.6 require 1,024 tokens; Haiku 4.5 requires 4,096 tokens. Shorter than that, the request processes normally — no error, but no cache either. If you use Haiku 4.5 with sub-4,096-token prompts, you get zero cache hits regardless of the cache_control marker.

Two TTL options:

Typecache_controlWrite costRead cost
5 minutes (default){ "type": "ephemeral" }1.25× base input0.10× base input
1 hour{ "type": "ephemeral", "ttl": "1h" }2.00× base input0.10× base input

For most conversational uses, 5-minute TTL is fine. For batch workloads (which take longer to process), use 1-hour TTL.

Here is a multi-turn chat that caches a large knowledge base:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

// Must be ≥1,024 tokens for Opus 4.8/Sonnet 4.6, ≥4,096 for Haiku 4.5
const KNOWLEDGE_BASE = `
You are a senior TypeScript engineer. You have deep knowledge of:
- Node.js ecosystem and async patterns
- REST API design and OpenAPI specifications
- Database query optimization and ORM usage
... (imagine many pages of documentation, code style guides, etc.)
`.repeat(50); // artificially extended for demo

async function chat(
  userMessage: string,
  history: Anthropic.MessageParam[] = []
): Promise<{ reply: string; cacheStats: object }> {
  const response = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    system: [
      {
        type: "text",
        text: KNOWLEDGE_BASE,
        cache_control: { type: "ephemeral" }, // cache the large knowledge base
      },
    ],
    messages: [...history, { role: "user", content: userMessage }],
  });

  const { input_tokens, cache_creation_input_tokens, cache_read_input_tokens, output_tokens } =
    response.usage;

  return {
    reply: response.content.find((b) => b.type === "text")?.text ?? "",
    cacheStats: {
      uncached: input_tokens,
      written: cache_creation_input_tokens, // non-zero on first call
      read: cache_read_input_tokens,         // non-zero on subsequent calls
      output: output_tokens,
    },
  };
}

// First call — writes the cache, pays 1.25× on the cached portion
const first = await chat("Explain async/await error handling patterns.");
console.log("First call stats:", first.cacheStats);

// Second call — reads from cache at 10% of base price
const second = await chat("Now show me a retry wrapper example.");
console.log("Second call stats:", second.cacheStats);

Gotcha that trips most people: input_tokens in the response usage does not mean total input once caching is active. It counts only the tokens after the last cache breakpoint. Compute real costs with:

total_input = cache_read_input_tokens + cache_creation_input_tokens + input_tokens

Gotcha two: place the cache_control marker on the last stable block. If you accidentally cache a block that includes a timestamp or a per-request user ID, the prefix hash changes every request and you get 0 cache hits. Cache the static part; keep the dynamic part outside the cached region.

Cache invalidation cascades downward: changing a tool definition invalidates everything below it (system, messages). Changing only a user message does not invalidate the system cache.


Tool use

Tool use turns Claude from a text generator into an agent. You define functions in a JSON schema, Claude decides when to call them, your code executes them, and you return the result. This loop runs until Claude produces an end_turn.

Define a tool

const tools: Anthropic.Tool[] = [
  {
    name: "get_weather",
    description:
      "Fetch current weather conditions for a city. Returns temperature in Celsius and a " +
      "short description. Use when the user asks about weather or needs to know current conditions.",
    input_schema: {
      type: "object" as const,
      properties: {
        city: {
          type: "string",
          description: "The city name, e.g. 'Tokyo', 'London', or 'San Francisco, CA'",
        },
      },
      required: ["city"],
    },
  },
];

Tool names must match ^[a-zA-Z0-9_-]{1,64}$. The description is the most important field — Claude reads it to decide whether to invoke the tool, so be specific about when to use it and when not to.

The agentic loop

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

async function executeTool(name: string, input: Record<string, unknown>): Promise<string> {
  if (name === "get_weather") {
    const city = input.city as string;
    // call a real weather API here in production
    return JSON.stringify({ city, temp_celsius: 18, description: "Overcast with light rain" });
  }
  throw new Error(`Unknown tool: ${name}`);
}

async function runAgent(userMessage: string): Promise<string> {
  const messages: Anthropic.MessageParam[] = [
    { role: "user", content: userMessage },
  ];

  while (true) {
    const response = await client.messages.create({
      model: "claude-sonnet-4-6",
      max_tokens: 1024,
      tools,
      messages,
    });

    messages.push({ role: "assistant", content: response.content });

    if (response.stop_reason === "end_turn") {
      const textBlock = response.content.find((b) => b.type === "text");
      return textBlock?.type === "text" ? textBlock.text : "";
    }

    if (response.stop_reason === "tool_use") {
      // IMPORTANT: tool_result blocks must come FIRST in the user message content array
      const toolResults: Anthropic.ToolResultBlockParam[] = [];

      for (const block of response.content) {
        if (block.type !== "tool_use") continue;

        let content: string;
        try {
          content = await executeTool(block.name, block.input as Record<string, unknown>);
        } catch (err) {
          content = `Error: ${(err as Error).message}`;
          toolResults.push({ type: "tool_result", tool_use_id: block.id, content, is_error: true });
          continue;
        }

        toolResults.push({ type: "tool_result", tool_use_id: block.id, content });
      }

      messages.push({ role: "user", content: toolResults });
    }
  }
}

const answer = await runAgent("What's the weather like in Tokyo right now?");
console.log(answer);

The ordering rule that causes 400 errors: when you push tool_result blocks back to Claude, they must come first in the user message content array, before any text blocks. Violating this returns a 400 invalid_request_error. Other LLM APIs do not have this constraint — it catches Claude newcomers every time.

Error handling: set is_error: true on the tool_result and include the error message as content. Claude will read it and decide whether to retry the tool call, use a fallback approach, or tell the user the tool failed. Do not throw in your agentic loop — catching and returning errors as tool results is what makes agents resilient.

Controlling tool choice:

  • { "type": "auto" } — Claude decides (default)
  • { "type": "any" } — Claude must call one of the provided tools
  • { "type": "tool", "name": "get_weather" } — force a specific tool
  • { "type": "none" } — tools defined but not callable this turn

Note: tool_choice: any and tool_choice: tool are incompatible with extended thinking.

Once your agent loop is running in production, The real cost of running an AI agent team in 2026 breaks down the full TCO — including oversight labor and retry waste that most cost estimates skip.


Message Batches API

The Batches API is for work where you do not need a live response — classification jobs, document processing pipelines, large-scale evaluation runs. You submit requests, poll for completion, and stream results. In exchange for the async constraint, you pay 50% of standard pricing.

ModelBatch inputBatch output
Opus 4.8$2.50/MTok$12.50/MTok
Sonnet 4.6$1.50/MTok$7.50/MTok
Haiku 4.5$0.50/MTok$2.50/MTok

Limits: up to 100,000 requests per batch, 256 MB max size, 24-hour processing window, results available for 29 days.

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

interface ReviewItem {
  id: string;
  text: string;
}

async function classifyReviewsBatch(reviews: ReviewItem[]): Promise<Record<string, string>> {
  // 1. Create the batch
  const batch = await client.messages.batches.create({
    requests: reviews.map((review) => ({
      custom_id: review.id,
      params: {
        model: "claude-haiku-4-5",
        max_tokens: 10,
        messages: [
          {
            role: "user" as const,
            content:
              `Classify this product review as POSITIVE, NEGATIVE, or NEUTRAL.\n` +
              `Reply with only the label.\n\nReview: "${review.text}"`,
          },
        ],
      },
    })),
  });

  console.log(`Batch created: ${batch.id} (${reviews.length} requests)`);

  // 2. Poll for completion
  let current = batch;
  while (current.processing_status !== "ended") {
    await new Promise((r) => setTimeout(r, 60_000)); // 60-second poll interval
    current = await client.messages.batches.retrieve(batch.id);
    const { processing, succeeded, errored, expired } = current.request_counts;
    console.log(
      `Status: ${current.processing_status} — processing: ${processing}, ` +
      `succeeded: ${succeeded}, errored: ${errored}, expired: ${expired}`
    );
  }

  // 3. Stream results (memory-efficient; order is NOT guaranteed)
  const sentimentMap: Record<string, string> = {};

  for await (const result of await client.messages.batches.results(batch.id)) {
    switch (result.result.type) {
      case "succeeded": {
        const content = result.result.message.content[0];
        sentimentMap[result.custom_id] =
          content.type === "text" ? content.text.trim() : "UNKNOWN";
        break;
      }
      case "errored":
        console.error(`Request ${result.custom_id} failed:`, result.result.error);
        sentimentMap[result.custom_id] = "ERROR";
        break;
      case "expired":
        console.warn(`Request ${result.custom_id} expired (24-hour window)`);
        sentimentMap[result.custom_id] = "EXPIRED";
        break;
    }
  }

  return sentimentMap;
}

const reviews: ReviewItem[] = [
  { id: "review-001", text: "Amazing product, exceeded all my expectations!" },
  { id: "review-002", text: "Stopped working after two weeks. Very disappointed." },
  { id: "review-003", text: "It does what it says on the box." },
  // ... thousands more
];

const results = await classifyReviewsBatch(reviews);
console.log(results);
// { "review-001": "POSITIVE", "review-002": "NEGATIVE", "review-003": "NEUTRAL" }

The ordering rule for batches: results come back in an arbitrary order, not the order you submitted them. Always match results to requests using custom_id. custom_id must match ^[a-zA-Z0-9_-]{1,64}$ and be unique within the batch.

Result types you will see:

  • succeeded — full Messages API response under .message
  • errored — check .error.error.type: invalid_request_error means fix your request; server errors are worth retrying
  • canceled — user canceled; not billed
  • expired — 24-hour window elapsed; not billed

Combining caching with batches: the two discounts stack. Add cache_control to your system prompt and you can get both 50% batch pricing and up to 90% cache savings. Cache hit rate on async batches is 30–98% — best-effort, not guaranteed, because requests process concurrently on separate workers. Use 1-hour TTL ({ "type": "ephemeral", "ttl": "1h" }) for large batches to maximize hit rates.

Extended output (300k tokens per request): add the header anthropic-beta: output-300k-2026-03-24 to your batch requests to unlock up to 300k output tokens. This is batch-only; the synchronous API cap stays at 64k–128k depending on model. Supported on Opus 4.8, Opus 4.7, Opus 4.6, and Sonnet 4.6. Not available on Bedrock, Vertex AI, or Microsoft Foundry.


Things that have changed since Claude 3.x

If you are migrating from Claude 3.x code, these are the most common surprises:

Context windows are now enormous. Opus 4.8 and Sonnet 4.6 have 1M token context. Code that chunks documents to fit an 8k or 32k window no longer needs to.

Model IDs are dateless but still pinned. claude-sonnet-4-6 is not an evergreen alias — it pins a specific snapshot. The date is absent from the string; the snapshot is not.

input_tokens is not what you think. Once you add caching, input_tokens counts only uncached tokens. Total input is the sum of all three usage fields.

1-hour cache TTL is new. Claude 3.x only had 5-minute TTL. For batch workloads that take more than 5 minutes to process, use "ttl": "1h" — otherwise your cache expires before the batch completes.

strict: true on tool definitions. New in the 4.x SDK. Guarantees Claude’s tool inputs always conform to your schema. Worth setting in production.

Caveats

Minimum token thresholds differ by model. Opus 4.8 and Sonnet 4.6 require 1,024 tokens; Haiku 4.5 requires 4,096 tokens. If you want a single safe threshold that works regardless of which Claude 4.x model you target, use 4,096 tokens.

Batch cache hit rates are unpredictable. 30–98% is the documented range. Budget for the worst case when estimating costs for batch + caching combined.

No Anthropic affiliate program. Sign up for API access at console.anthropic.com. The pricing numbers in this article are as of June 2026 — always check the current pricing page before building a cost model.

References