GitHub Models 2026 — free LLM API for developers reviewed
We tested GitHub Models' free-tier LLM API: rate limits, OpenAI compatibility, and whether 150 calls a day is enough for a real side project.
By toolchew
1,747 words · 9 min read
GitHub Models gives every GitHub account free access to GPT-4o, Llama, DeepSeek, and a dozen other models through an OpenAI-compatible API. Use it if you’re prototyping a side project and want LLM access without a credit card. Don’t use it for production — the free tier’s terms of service explicitly bar it, and when you move to pay-as-you-go, you’re often better off going to the model provider directly.
Who this is for
Solo developers and small teams building a proof-of-concept who want zero-friction LLM access inside the GitHub ecosystem. If you have any real user traffic or plan to automate more than a few dozen calls per day, the rate limits will stop you before lunch.
What we tested
GitHub Models as of June 2026: the free-tier rate limits documented at docs.github.com/en/github-models, the OpenAI-compatible inference endpoint at https://models.github.ai, PAT-based authentication, and the browser playground. All rate-limit figures come from the official prototyping docs — we note where the source applies a “subject to change” caveat.
GitHub Models is still in public preview as of June 2026. No GA announcement has been made. Factor that into production planning.
Model catalog
GitHub’s catalog covers 15+ models across five providers as of June 2026:
| Provider | Models |
|---|---|
| OpenAI | GPT-4o, GPT-4o Mini, GPT-4.1, GPT-4.1 Mini, GPT-5 |
| Microsoft | Phi-4, Phi-4 Mini Instruct, Phi-4 Multimodal Instruct |
| Meta | Llama-4 Maverick 17B, Llama-3.3-70B |
| DeepSeek | DeepSeek-R1, DeepSeek-R1-0528, DeepSeek-V3-0324 |
| xAI | Grok-3, Grok-3 Mini |
That’s a credible lineup for prototyping. GPT-4o, Llama-3.3-70B, DeepSeek-R1, and Grok-3 are all here. You’re not missing a frontier model that matters for side-project use.
GitHub classifies models into tiers that govern free-tier rate limits: low complexity (smaller, faster models like GPT-4o Mini and Phi-4 Mini), high complexity (frontier models like GPT-4o and Llama-3.3-70B), and a specialized tier for reasoning-heavy models (DeepSeek-R1, Grok-3). The catalog API returns a rate_limit_tier field per model if you need to query it programmatically.
One important product distinction: GitHub Models is separate from GitHub Copilot. Copilot Chat has its own licensed model routing. GitHub Models is for building your own AI applications — different infrastructure, different billing, different catalog.
Rate limits in practice
Official free-tier limits as of June 2026, from the prototyping docs:
| Tier | Requests/min | Requests/day | Input tokens/req | Output tokens/req | Concurrent |
|---|---|---|---|---|---|
| Low complexity | 15 | 150 | 8,000 | 4,000 | 5 |
| High complexity | 10 | 50 | 8,000 | 4,000 | 2 |
| Reasoning (DeepSeek-R1, Grok-3) | 1–2 | 8–15 | 4,000 | 4,000 | 1 |
150 low-complexity calls per day is enough to validate an idea when a human is on the other end of the interaction. For any batch job, data pipeline, or automated agent loop, you’ll exhaust the daily budget in a few minutes.
The docs carry an explicit caveat: these limits are subject to change without notice, and no SLA is published. You’re building on a public preview product with no uptime or rate-limit guarantees.
When you hit the ceiling, three paths forward:
Upgrade your GitHub Copilot plan. GitHub Copilot Individual ($10/month) raises your GitHub Models rate limits above the free-tier defaults. It’s the fastest path to more headroom without leaving GitHub’s billing ecosystem.
Enable pay-as-you-go. GitHub bills pay-as-you-go usage in token units at $0.00001 per token unit. The per-token rates use a multiplier system: GPT-4o input tokens cost $2.50/1M (0.25× multiplier — 250,000 token units per 1M actual tokens at $0.00001/token unit), which matches OpenAI’s direct rate exactly. For other models, check the specific multipliers at docs.github.com/en/billing/reference/costs-for-github-models — cheaper models have lower multipliers, premium models higher.
Switch to a paid provider directly. If prototyping volume tells you the project has legs, evaluate the OpenAI API and Anthropic API directly. You lose the GitHub UX but gain more control over quotas, billing, and model selection without a proxied billing layer. If you want unified access to multiple providers while you evaluate options, see our OpenRouter vs. direct API comparison.
OpenAI-compatible endpoint
The inference endpoint at https://models.github.ai implements the OpenAI API schema. Pointing any existing OpenAI SDK client at GitHub Models is two lines:
from openai import OpenAI
import os
client = OpenAI(
base_url="https://models.github.ai/inference",
api_key=os.environ["GITHUB_TOKEN"]
)
response = client.chat.completions.create(
model="openai/gpt-4o-mini",
messages=[{"role": "user", "content": "Hello"}]
)
The drop-in experience works, with two gotchas to watch:
Model ID format. GitHub Models uses provider/model-name identifiers — openai/gpt-4o, not gpt-4o. Any code that passes a bare model name to the endpoint will get a validation error. It’s a one-line fix, but it’ll surprise you if you copy from an OpenAI tutorial.
Parameter support varies. Not all OpenAI request parameters carry over across all catalog models. Multimodal models like Phi-4 Multimodal Instruct expect specific input formats for image content. Before assuming text-only behavior, check the catalog endpoint (GET https://models.github.ai/catalog/models) for supported_input_modalities and supported_output_modalities per model.
Rate limit errors return HTTP 429 with response headers that identify which limit was hit — request rate vs. daily budget, minute vs. day. That granularity is useful: you can distinguish between calling too fast and exhausting your daily quota.
One changelog note: the original Azure-hosted endpoint (models.inference.ai.azure.com) is deprecated. Older tutorials and blog posts still reference it. If your code uses that base URL, update it to models.github.ai.
Developer experience
Authentication. Create a fine-grained PAT at github.com/settings/tokens with the models:read scope. Classic PATs work too — no additional scope needed. Set it as GITHUB_TOKEN in your environment, pass it as a Bearer token, and you’re in. That’s the fastest path from zero to a working API call of any hosted LLM service we’ve tested.
Playground. Zero setup — any signed-in GitHub user can open github.com/marketplace/models, pick a model, and start prompting in the browser. Since December 2024, the playground shows real-time latency, input token count, and output token count per request. That’s the fastest way to compare two models before writing any code. Use it to sanity-check model behavior and get a rough latency feel for your specific prompts.
VS Code. GitHub Models is accessible in VS Code through the AI Toolkit extension — a separate install from GitHub Copilot. AI Toolkit surfaces GitHub Models alongside Azure AI Foundry and local models in a Chat view. This is not the same as using Copilot Chat; the two products have separate infrastructure, separate billing, and separate model catalogs.
Latency
GitHub publishes no latency SLAs or benchmark data for the GitHub Models API endpoint. No third-party benchmarks specifically isolate the GitHub Models inference layer either — major benchmark sites test models at their native APIs.
What you can infer: free-tier inference runs on a shared Azure AI queue. Paid and enterprise tiers run on dedicated Azure AI deployments, which should reduce variance under concurrent load — but “should” is inference, not measurement. Use the playground’s per-request latency display to calibrate expectations for your actual prompts and models before committing to an architecture that depends on specific latency numbers.
GitHub Models vs. HuggingFace Inference Providers
The most common comparison for “free LLM API” is HuggingFace. The picture shifted in 2025 when HuggingFace rebranded its Inference API as Inference Providers, routing large frontier LLMs through third-party partners (Groq, Together AI, SambaNova). The comparison now looks like this:
| Dimension | GitHub Models (free) | HuggingFace (free) |
|---|---|---|
| Rate model | Fixed RPM / RPD per tier | Credit budget (~$0.10/month) |
| Low-tier model calls/day | 150 guaranteed | Not applicable — credit-based |
| High-tier model calls/day | 50 guaranteed | ~8 frontier-model calls/month total (estimated from $0.10 credit at partner rates) |
| Frontier model access | Yes — within rate limits | Yes — credits consumed at provider rates |
| Production use | Explicitly banned on free tier | No explicit ban; $0.10/month cap makes it impractical |
| Auth | GitHub PAT (models:read) | HuggingFace user access token |
HuggingFace’s $0.10/month sounds flexible but runs out fast if you’re hitting frontier models through paid partner providers. If your use case is CPU-class models — BERT-scale classification, embeddings, smaller open models — HuggingFace’s native provider has a broader selection and may be the better fit. For LLMs specifically, GitHub Models gives a more predictable daily budget at no cost. Once you scale past free tiers and need to optimize costs across model sizes, our LLM cost routing deep-dive covers when routing cheap vs. expensive models pays off.
Verdict
Use GitHub Models if: You’re a solo developer or small team prototyping an LLM-powered feature, you want API access with zero friction and no credit card, and 150 small-model calls (or 50 frontier-model calls) per day covers your validation workload. The OpenAI-compatible endpoint means dropping it into existing code takes minutes, and swapping back to direct providers later is a two-line change.
Use a paid API if: You have production traffic, need more than 50 frontier-model calls per day, care about latency SLAs, or need capabilities GitHub Models doesn’t offer — fine-tuning, embeddings (beyond what the catalog supports), or guaranteed uptime. At that point, evaluate the OpenAI API and Anthropic API head-to-head. GitHub’s pay-as-you-go rates match direct pricing for GPT-4o, but going direct removes one layer of proxy billing and gives you faster access to new model versions.
Caveats
GitHub Models is still in public preview as of June 2026. Rate limits, model availability, and pricing are all subject to change without notice — and the old Azure inference endpoint was deprecated with limited warning as a data point.
The free tier’s terms of service explicitly restrict use to prototyping and experimentation. Serving users on the free tier violates the ToS.
Model availability churns. The catalog has expanded steadily since the August 2024 launch; models that exist today may be removed or replaced.
Also see our Best AI Coding CLI in 2026 if you’re evaluating the broader landscape of AI developer tools beyond API access.
References
- GitHub Models prototyping docs (rate limits)
- Playground real-time token and latency metrics added (December 2024)
- GitHub Models billing overview
- GitHub Models cost reference (per-model multipliers)
- Inference REST API
- GitHub Models quickstart
- Responsible use of GitHub Models
- models:read scope now required (May 2025)
- Pay-as-you-go and BYOK launched (June 2025)
- GPT-5 added to GitHub Models (August 2025)
- GitHub Models built into repositories — public preview (May 2025)
- HuggingFace Inference Providers pricing