· cloudflare / cloudflare-workers / workers-ai

How to Set Up Cloudflare Workers AI: Step-by-Step Guide

Run inference at the edge with Workers AI: scaffold a Worker, bind the AI, call models, stream SSE, generate images. Includes pricing and rate limits.

By

1,552 words · 8 min read

Workers AI lets you call 78+ open-source models from a Cloudflare Worker with one env.AI.run() call. No GPU provisioning, no cold starts from provisioning a separate backend, no extra infrastructure bill. If your logic already lives in a Worker, adding AI inference is about ten lines of TypeScript.

This guide takes you from zero to a working Worker that generates text, streams responses, and handles embeddings and image generation. Each step shows the failure mode so you don’t have to discover it in production.

Who this is for

TypeScript developers who want edge-co-located inference without standing up an AI backend. You need a Cloudflare account and basic CLI skills. If you need frontier reasoning quality — GPT-4o, Claude — you’ll need to call those APIs directly; Workers AI doesn’t host proprietary models. Jump to the comparison section if you’re evaluating first. If you’re also weighing serverless options, Cloudflare Workers vs AWS Lambda covers where each one wins.

What models are available

78 models across text generation, embeddings, image generation, speech, and translation. Notable text-generation options:

ModelContextNotes
@cf/moonshot/kimi-k2.6262,100 tokensMulti-turn tool calling, vision inputs
@cf/zhipuai/glm-4.7-flash131,072 tokensMultilingual, function calling
@cf/google/gemma-3-12b-it80,000 tokensMultimodal, 140+ languages
@cf/mistralai/mistral-small-3.1-24b-instruct128,000 tokensVision, function calling
@cf/meta/llama-3.3-70b-instruct-fp8-fast24,000 tokensFP8 quantized, speed-optimized
@cf/meta/llama-3.1-8b-instruct-fast128,000 tokensSolid starter model for most tasks
@cf/qwen/qwq-32bReasoning specialist

Full catalog: developers.cloudflare.com/workers-ai/models/

Step 1: Check prerequisites

You need:

node -v

Failure mode: Node below 16.17 causes Wrangler to fail at install with a cryptic npm error. Verify the version before you start.

Step 2: Scaffold the project

C3 (Create Cloudflare CLI) is the recommended scaffolder:

npm create cloudflare@latest -- hello-ai

At the interactive prompts, select:

  • Template: Hello World example
  • Project type: Worker only
  • Language: TypeScript
  • Git: Yes
  • Deploy now: No
cd hello-ai

Your project structure after scaffold:

hello-ai/
├── wrangler.toml
├── src/
│   └── index.ts
├── package.json
└── tsconfig.json

Failure mode: If npm create cloudflare hangs, your npm registry may be slow. Try npx create-cloudflare@latest -- hello-ai directly.

Step 3: Bind the AI service

Workers AI is accessed through a binding declared in your project config. Open wrangler.toml and add the [ai] block:

name = "hello-ai"
main = "src/index.ts"
compatibility_date = "2025-12-01"

[ai]
binding = "AI"

If your project uses wrangler.jsonc instead:

{
  "name": "hello-ai",
  "main": "src/index.ts",
  "compatibility_date": "2024-09-23",
  "ai": {
    "binding": "AI"
  }
}

The binding name "AI" is conventional but arbitrary — it becomes env.AI in your Worker code. You can rename it.

Failure mode: Omitting the [ai] block makes env.AI undefined at runtime. You get Cannot read properties of undefined (reading 'run'). The first thing to check when that error appears: your wrangler config.

Step 4: Make a first inference call

Replace the contents of src/index.ts:

export interface Env {
  AI: Ai;
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct-fast', {
      prompt: 'What is the origin of the phrase Hello, World?',
    });

    return Response.json(response);
  },
} satisfies ExportedHandler<Env>;

Run locally:

npx wrangler dev

Open http://localhost:8787. The response is a JSON object with a response field containing the model’s output.

Critical: Wrangler’s local dev mode routes AI calls to Cloudflare’s real infrastructure — not a local emulator. These requests count against your neuron quota and rate limits.

Deploy when you’re ready:

npx wrangler login   # opens a browser for OAuth
npx wrangler deploy

Your Worker is live at https://hello-ai.<YOUR_SUBDOMAIN>.workers.dev.

Failure mode: wrangler login requires a browser for the OAuth flow. In headless CI environments, set the CLOUDFLARE_API_TOKEN environment variable instead and skip the login step.

Step 5: Stream responses

Add stream: true to the options. The response becomes a Server-Sent Events stream:

export interface Env {
  AI: Ai;
}

export default {
  async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise<Response> {
    const stream = await env.AI.run(
      '@cf/meta/llama-3.1-8b-instruct-fast',
      {
        prompt: 'Explain edge computing in simple terms.',
        stream: true,
      }
    );

    return new Response(stream, {
      headers: { 'content-type': 'text/event-stream' },
    });
  },
} satisfies ExportedHandler<Env>;

Browser-side consumption:

const source = new EventSource('/');

source.onmessage = (event) => {
  if (event.data === '[DONE]') {
    source.close();
    return;
  }
  const data = JSON.parse(event.data);
  document.getElementById('output').innerHTML += data.response;
};

Each SSE event carries a partial token in data.response. The stream ends with a [DONE] sentinel.

Failure mode: Omitting content-type: text/event-stream causes browsers to buffer the full response instead of streaming. Workers Paid plan defaults to a 30-second CPU time limit per request (Workers limits); for long-running generations, raise the limit in your wrangler config (up to 5 minutes).

Step 6: Add embeddings and image generation

Text embeddings

const embeddings = await env.AI.run('@cf/baai/bge-m3', {
  text: [
    'This is a story about an orange cloud',
    'This is a story about a llama',
  ],
});
return Response.json(embeddings);

@cf/baai/bge-m3 is the best general-purpose choice — multilingual, multi-granularity. For English-only workloads, BGE Large (1024-dim), Base (768-dim), and Small (384-dim) are lighter options.

Input is a string or array of strings. Output is embedding vectors in the same order.

Failure mode: Sending a single string where an array is expected returns a single vector, not an error. Check your input shape if you’re getting unexpected vector counts.

Image generation

const response = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
  prompt: 'a cyberpunk lizard',
  steps: 4,  // default: 4, max: 8 — higher steps = better quality, slower
});

// Return as base64 data URI
const dataURI = `data:image/jpeg;charset=utf-8;base64,${response.image}`;
return Response.json({ dataURI });

Or return the binary image directly:

const binaryString = atob(response.image);
const img = Uint8Array.from(binaryString, (m) => m.charCodeAt(0));
return new Response(img, {
  headers: { 'Content-Type': 'image/jpeg' },
});

flux-1-schnell returns a base64-encoded JPEG in response.image. For higher photorealism, @cf/black-forest-labs/flux-2-dev is available on the same API.

Failure mode: The steps parameter caps at 8. Exceeding it returns a 400 error. Setting it below 4 produces visibly degraded images.

Pricing and rate limits

Pricing

TierQuotaCost
Free (any plan)10,000 Neurons/day$0 — resets at 00:00 UTC
Paid overageUnlimited$0.011 per 1,000 Neurons

Neurons per million tokens vary significantly by model. Examples:

ModelInput neurons/M tokensOutput neurons/M tokens
Llama 3.2 1B2,45718,252
Llama 3.1 8B25,60875,147
DeepSeek R1 Distill 32B45,170443,756

At 10,000 neurons/day with Llama 3.1 8B (75,147 output neurons/M tokens), you get roughly 133,000 output tokens per day on the free plan — enough for development and light use.

Rate limits (requests per minute)

Task typeDefault RPM
Text generation300
Text embeddings3,000 (BGE-Large: 1,500)
Text-to-image720
Speech recognition720
Translation720
Image classification3,000
Summarization1,500

Wrangler local dev requests count against these limits.

Sources: pricing docs, limits docs

When to use Cloudflare Workers AI vs an external API

Workers AI makes sense when your logic already lives in Cloudflare — you want inference co-located with your KV lookups, D1 queries, or R2 uploads, and you don’t want to pay egress to a separate AI provider.

Reach for Anthropic or OpenAI directly when:

  • You need frontier reasoning quality. Workers AI runs open-weight models — no GPT-4o, no Claude.
  • Your per-output-token cost at scale matters more than latency. At high volume, $0.011 per 1,000 neurons can exceed hosted-API pricing for output-heavy workloads, depending on the model.
  • You need guaranteed model availability. Cloudflare can deprecate or swap models; the official catalog doesn’t carry SLAs on specific model versions.

Workers AI wins for:

  • Privacy-sensitive workloads: data stays on Cloudflare’s network, not a third-party LLM provider.
  • Low-latency edge inference: co-located with your Worker, no round trip to a separate region.
  • Simple tasks at free-tier scale: content moderation, summarization, embeddings for RAG over a Vectorize index.

Workers AI pairs well with Cloudflare D1 when you need a relational database in the same edge region — no egress fees between services.

The practical split: use Workers AI for inference that’s part of a Cloudflare-first backend. Use Anthropic or OpenAI when the quality ceiling matters more than the architecture.

REST API alternative

Workers AI is callable without a Worker, via the Cloudflare REST API:

curl https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/meta/llama-3.1-8b-instruct-fast \
  -H 'Authorization: Bearer {API_TOKEN}' \
  -d '{ "prompt": "Your question here" }'

Response:

{
  "result": { "response": "Model output here" },
  "success": true,
  "errors": [],
  "messages": []
}

Useful for prototyping, CI pipelines, or non-Worker backends. Same pricing as Worker-based calls.

Sources