· cloudflare / cloudflare-workers / workers-ai
How to Set Up Cloudflare Workers AI: Step-by-Step Guide
Run inference at the edge with Workers AI: scaffold a Worker, bind the AI, call models, stream SSE, generate images. Includes pricing and rate limits.
By Ethan
1,552 words · 8 min read
Workers AI lets you call 78+ open-source models from a Cloudflare Worker with one env.AI.run() call. No GPU provisioning, no cold starts from provisioning a separate backend, no extra infrastructure bill. If your logic already lives in a Worker, adding AI inference is about ten lines of TypeScript.
This guide takes you from zero to a working Worker that generates text, streams responses, and handles embeddings and image generation. Each step shows the failure mode so you don’t have to discover it in production.
Who this is for
TypeScript developers who want edge-co-located inference without standing up an AI backend. You need a Cloudflare account and basic CLI skills. If you need frontier reasoning quality — GPT-4o, Claude — you’ll need to call those APIs directly; Workers AI doesn’t host proprietary models. Jump to the comparison section if you’re evaluating first. If you’re also weighing serverless options, Cloudflare Workers vs AWS Lambda covers where each one wins.
What models are available
78 models across text generation, embeddings, image generation, speech, and translation. Notable text-generation options:
| Model | Context | Notes |
|---|---|---|
@cf/moonshot/kimi-k2.6 | 262,100 tokens | Multi-turn tool calling, vision inputs |
@cf/zhipuai/glm-4.7-flash | 131,072 tokens | Multilingual, function calling |
@cf/google/gemma-3-12b-it | 80,000 tokens | Multimodal, 140+ languages |
@cf/mistralai/mistral-small-3.1-24b-instruct | 128,000 tokens | Vision, function calling |
@cf/meta/llama-3.3-70b-instruct-fp8-fast | 24,000 tokens | FP8 quantized, speed-optimized |
@cf/meta/llama-3.1-8b-instruct-fast | 128,000 tokens | Solid starter model for most tasks |
@cf/qwen/qwq-32b | — | Reasoning specialist |
Full catalog: developers.cloudflare.com/workers-ai/models/
Step 1: Check prerequisites
You need:
- A Cloudflare account (free tier works): dash.cloudflare.com/sign-up/workers-and-pages
- Node.js 16.17.0 or higher
node -v
Failure mode: Node below 16.17 causes Wrangler to fail at install with a cryptic npm error. Verify the version before you start.
Step 2: Scaffold the project
C3 (Create Cloudflare CLI) is the recommended scaffolder:
npm create cloudflare@latest -- hello-ai
At the interactive prompts, select:
- Template: Hello World example
- Project type: Worker only
- Language: TypeScript
- Git: Yes
- Deploy now: No
cd hello-ai
Your project structure after scaffold:
hello-ai/
├── wrangler.toml
├── src/
│ └── index.ts
├── package.json
└── tsconfig.json
Failure mode: If npm create cloudflare hangs, your npm registry may be slow. Try npx create-cloudflare@latest -- hello-ai directly.
Step 3: Bind the AI service
Workers AI is accessed through a binding declared in your project config. Open wrangler.toml and add the [ai] block:
name = "hello-ai"
main = "src/index.ts"
compatibility_date = "2025-12-01"
[ai]
binding = "AI"
If your project uses wrangler.jsonc instead:
{
"name": "hello-ai",
"main": "src/index.ts",
"compatibility_date": "2024-09-23",
"ai": {
"binding": "AI"
}
}
The binding name "AI" is conventional but arbitrary — it becomes env.AI in your Worker code. You can rename it.
Failure mode: Omitting the [ai] block makes env.AI undefined at runtime. You get Cannot read properties of undefined (reading 'run'). The first thing to check when that error appears: your wrangler config.
Step 4: Make a first inference call
Replace the contents of src/index.ts:
export interface Env {
AI: Ai;
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct-fast', {
prompt: 'What is the origin of the phrase Hello, World?',
});
return Response.json(response);
},
} satisfies ExportedHandler<Env>;
Run locally:
npx wrangler dev
Open http://localhost:8787. The response is a JSON object with a response field containing the model’s output.
Critical: Wrangler’s local dev mode routes AI calls to Cloudflare’s real infrastructure — not a local emulator. These requests count against your neuron quota and rate limits.
Deploy when you’re ready:
npx wrangler login # opens a browser for OAuth
npx wrangler deploy
Your Worker is live at https://hello-ai.<YOUR_SUBDOMAIN>.workers.dev.
Failure mode: wrangler login requires a browser for the OAuth flow. In headless CI environments, set the CLOUDFLARE_API_TOKEN environment variable instead and skip the login step.
Step 5: Stream responses
Add stream: true to the options. The response becomes a Server-Sent Events stream:
export interface Env {
AI: Ai;
}
export default {
async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise<Response> {
const stream = await env.AI.run(
'@cf/meta/llama-3.1-8b-instruct-fast',
{
prompt: 'Explain edge computing in simple terms.',
stream: true,
}
);
return new Response(stream, {
headers: { 'content-type': 'text/event-stream' },
});
},
} satisfies ExportedHandler<Env>;
Browser-side consumption:
const source = new EventSource('/');
source.onmessage = (event) => {
if (event.data === '[DONE]') {
source.close();
return;
}
const data = JSON.parse(event.data);
document.getElementById('output').innerHTML += data.response;
};
Each SSE event carries a partial token in data.response. The stream ends with a [DONE] sentinel.
Failure mode: Omitting content-type: text/event-stream causes browsers to buffer the full response instead of streaming. Workers Paid plan defaults to a 30-second CPU time limit per request (Workers limits); for long-running generations, raise the limit in your wrangler config (up to 5 minutes).
Step 6: Add embeddings and image generation
Text embeddings
const embeddings = await env.AI.run('@cf/baai/bge-m3', {
text: [
'This is a story about an orange cloud',
'This is a story about a llama',
],
});
return Response.json(embeddings);
@cf/baai/bge-m3 is the best general-purpose choice — multilingual, multi-granularity. For English-only workloads, BGE Large (1024-dim), Base (768-dim), and Small (384-dim) are lighter options.
Input is a string or array of strings. Output is embedding vectors in the same order.
Failure mode: Sending a single string where an array is expected returns a single vector, not an error. Check your input shape if you’re getting unexpected vector counts.
Image generation
const response = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
prompt: 'a cyberpunk lizard',
steps: 4, // default: 4, max: 8 — higher steps = better quality, slower
});
// Return as base64 data URI
const dataURI = `data:image/jpeg;charset=utf-8;base64,${response.image}`;
return Response.json({ dataURI });
Or return the binary image directly:
const binaryString = atob(response.image);
const img = Uint8Array.from(binaryString, (m) => m.charCodeAt(0));
return new Response(img, {
headers: { 'Content-Type': 'image/jpeg' },
});
flux-1-schnell returns a base64-encoded JPEG in response.image. For higher photorealism, @cf/black-forest-labs/flux-2-dev is available on the same API.
Failure mode: The steps parameter caps at 8. Exceeding it returns a 400 error. Setting it below 4 produces visibly degraded images.
Pricing and rate limits
Pricing
| Tier | Quota | Cost |
|---|---|---|
| Free (any plan) | 10,000 Neurons/day | $0 — resets at 00:00 UTC |
| Paid overage | Unlimited | $0.011 per 1,000 Neurons |
Neurons per million tokens vary significantly by model. Examples:
| Model | Input neurons/M tokens | Output neurons/M tokens |
|---|---|---|
| Llama 3.2 1B | 2,457 | 18,252 |
| Llama 3.1 8B | 25,608 | 75,147 |
| DeepSeek R1 Distill 32B | 45,170 | 443,756 |
At 10,000 neurons/day with Llama 3.1 8B (75,147 output neurons/M tokens), you get roughly 133,000 output tokens per day on the free plan — enough for development and light use.
Rate limits (requests per minute)
| Task type | Default RPM |
|---|---|
| Text generation | 300 |
| Text embeddings | 3,000 (BGE-Large: 1,500) |
| Text-to-image | 720 |
| Speech recognition | 720 |
| Translation | 720 |
| Image classification | 3,000 |
| Summarization | 1,500 |
Wrangler local dev requests count against these limits.
Sources: pricing docs, limits docs
When to use Cloudflare Workers AI vs an external API
Workers AI makes sense when your logic already lives in Cloudflare — you want inference co-located with your KV lookups, D1 queries, or R2 uploads, and you don’t want to pay egress to a separate AI provider.
Reach for Anthropic or OpenAI directly when:
- You need frontier reasoning quality. Workers AI runs open-weight models — no GPT-4o, no Claude.
- Your per-output-token cost at scale matters more than latency. At high volume, $0.011 per 1,000 neurons can exceed hosted-API pricing for output-heavy workloads, depending on the model.
- You need guaranteed model availability. Cloudflare can deprecate or swap models; the official catalog doesn’t carry SLAs on specific model versions.
Workers AI wins for:
- Privacy-sensitive workloads: data stays on Cloudflare’s network, not a third-party LLM provider.
- Low-latency edge inference: co-located with your Worker, no round trip to a separate region.
- Simple tasks at free-tier scale: content moderation, summarization, embeddings for RAG over a Vectorize index.
Workers AI pairs well with Cloudflare D1 when you need a relational database in the same edge region — no egress fees between services.
The practical split: use Workers AI for inference that’s part of a Cloudflare-first backend. Use Anthropic or OpenAI when the quality ceiling matters more than the architecture.
REST API alternative
Workers AI is callable without a Worker, via the Cloudflare REST API:
curl https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/meta/llama-3.1-8b-instruct-fast \
-H 'Authorization: Bearer {API_TOKEN}' \
-d '{ "prompt": "Your question here" }'
Response:
{
"result": { "response": "Model output here" },
"success": true,
"errors": [],
"messages": []
}
Useful for prototyping, CI pipelines, or non-Worker backends. Same pricing as Worker-based calls.