· llm / api / github

GitHub Models 2026 — free LLM API for developers reviewed

We tested GitHub Models' free-tier LLM API: rate limits, OpenAI compatibility, and whether 150 calls a day is enough for a real side project.

By

1,747 words · 9 min read

GitHub Models gives every GitHub account free access to GPT-4o, Llama, DeepSeek, and a dozen other models through an OpenAI-compatible API. Use it if you’re prototyping a side project and want LLM access without a credit card. Don’t use it for production — the free tier’s terms of service explicitly bar it, and when you move to pay-as-you-go, you’re often better off going to the model provider directly.

Who this is for

Solo developers and small teams building a proof-of-concept who want zero-friction LLM access inside the GitHub ecosystem. If you have any real user traffic or plan to automate more than a few dozen calls per day, the rate limits will stop you before lunch.

What we tested

GitHub Models as of June 2026: the free-tier rate limits documented at docs.github.com/en/github-models, the OpenAI-compatible inference endpoint at https://models.github.ai, PAT-based authentication, and the browser playground. All rate-limit figures come from the official prototyping docs — we note where the source applies a “subject to change” caveat.

GitHub Models is still in public preview as of June 2026. No GA announcement has been made. Factor that into production planning.

Model catalog

GitHub’s catalog covers 15+ models across five providers as of June 2026:

ProviderModels
OpenAIGPT-4o, GPT-4o Mini, GPT-4.1, GPT-4.1 Mini, GPT-5
MicrosoftPhi-4, Phi-4 Mini Instruct, Phi-4 Multimodal Instruct
MetaLlama-4 Maverick 17B, Llama-3.3-70B
DeepSeekDeepSeek-R1, DeepSeek-R1-0528, DeepSeek-V3-0324
xAIGrok-3, Grok-3 Mini

That’s a credible lineup for prototyping. GPT-4o, Llama-3.3-70B, DeepSeek-R1, and Grok-3 are all here. You’re not missing a frontier model that matters for side-project use.

GitHub classifies models into tiers that govern free-tier rate limits: low complexity (smaller, faster models like GPT-4o Mini and Phi-4 Mini), high complexity (frontier models like GPT-4o and Llama-3.3-70B), and a specialized tier for reasoning-heavy models (DeepSeek-R1, Grok-3). The catalog API returns a rate_limit_tier field per model if you need to query it programmatically.

One important product distinction: GitHub Models is separate from GitHub Copilot. Copilot Chat has its own licensed model routing. GitHub Models is for building your own AI applications — different infrastructure, different billing, different catalog.

Rate limits in practice

Official free-tier limits as of June 2026, from the prototyping docs:

TierRequests/minRequests/dayInput tokens/reqOutput tokens/reqConcurrent
Low complexity151508,0004,0005
High complexity10508,0004,0002
Reasoning (DeepSeek-R1, Grok-3)1–28–154,0004,0001

150 low-complexity calls per day is enough to validate an idea when a human is on the other end of the interaction. For any batch job, data pipeline, or automated agent loop, you’ll exhaust the daily budget in a few minutes.

The docs carry an explicit caveat: these limits are subject to change without notice, and no SLA is published. You’re building on a public preview product with no uptime or rate-limit guarantees.

When you hit the ceiling, three paths forward:

Upgrade your GitHub Copilot plan. GitHub Copilot Individual ($10/month) raises your GitHub Models rate limits above the free-tier defaults. It’s the fastest path to more headroom without leaving GitHub’s billing ecosystem.

Enable pay-as-you-go. GitHub bills pay-as-you-go usage in token units at $0.00001 per token unit. The per-token rates use a multiplier system: GPT-4o input tokens cost $2.50/1M (0.25× multiplier — 250,000 token units per 1M actual tokens at $0.00001/token unit), which matches OpenAI’s direct rate exactly. For other models, check the specific multipliers at docs.github.com/en/billing/reference/costs-for-github-models — cheaper models have lower multipliers, premium models higher.

Switch to a paid provider directly. If prototyping volume tells you the project has legs, evaluate the OpenAI API and Anthropic API directly. You lose the GitHub UX but gain more control over quotas, billing, and model selection without a proxied billing layer. If you want unified access to multiple providers while you evaluate options, see our OpenRouter vs. direct API comparison.

OpenAI-compatible endpoint

The inference endpoint at https://models.github.ai implements the OpenAI API schema. Pointing any existing OpenAI SDK client at GitHub Models is two lines:

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://models.github.ai/inference",
    api_key=os.environ["GITHUB_TOKEN"]
)

response = client.chat.completions.create(
    model="openai/gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello"}]
)

The drop-in experience works, with two gotchas to watch:

Model ID format. GitHub Models uses provider/model-name identifiers — openai/gpt-4o, not gpt-4o. Any code that passes a bare model name to the endpoint will get a validation error. It’s a one-line fix, but it’ll surprise you if you copy from an OpenAI tutorial.

Parameter support varies. Not all OpenAI request parameters carry over across all catalog models. Multimodal models like Phi-4 Multimodal Instruct expect specific input formats for image content. Before assuming text-only behavior, check the catalog endpoint (GET https://models.github.ai/catalog/models) for supported_input_modalities and supported_output_modalities per model.

Rate limit errors return HTTP 429 with response headers that identify which limit was hit — request rate vs. daily budget, minute vs. day. That granularity is useful: you can distinguish between calling too fast and exhausting your daily quota.

One changelog note: the original Azure-hosted endpoint (models.inference.ai.azure.com) is deprecated. Older tutorials and blog posts still reference it. If your code uses that base URL, update it to models.github.ai.

Developer experience

Authentication. Create a fine-grained PAT at github.com/settings/tokens with the models:read scope. Classic PATs work too — no additional scope needed. Set it as GITHUB_TOKEN in your environment, pass it as a Bearer token, and you’re in. That’s the fastest path from zero to a working API call of any hosted LLM service we’ve tested.

Playground. Zero setup — any signed-in GitHub user can open github.com/marketplace/models, pick a model, and start prompting in the browser. Since December 2024, the playground shows real-time latency, input token count, and output token count per request. That’s the fastest way to compare two models before writing any code. Use it to sanity-check model behavior and get a rough latency feel for your specific prompts.

VS Code. GitHub Models is accessible in VS Code through the AI Toolkit extension — a separate install from GitHub Copilot. AI Toolkit surfaces GitHub Models alongside Azure AI Foundry and local models in a Chat view. This is not the same as using Copilot Chat; the two products have separate infrastructure, separate billing, and separate model catalogs.

Latency

GitHub publishes no latency SLAs or benchmark data for the GitHub Models API endpoint. No third-party benchmarks specifically isolate the GitHub Models inference layer either — major benchmark sites test models at their native APIs.

What you can infer: free-tier inference runs on a shared Azure AI queue. Paid and enterprise tiers run on dedicated Azure AI deployments, which should reduce variance under concurrent load — but “should” is inference, not measurement. Use the playground’s per-request latency display to calibrate expectations for your actual prompts and models before committing to an architecture that depends on specific latency numbers.

GitHub Models vs. HuggingFace Inference Providers

The most common comparison for “free LLM API” is HuggingFace. The picture shifted in 2025 when HuggingFace rebranded its Inference API as Inference Providers, routing large frontier LLMs through third-party partners (Groq, Together AI, SambaNova). The comparison now looks like this:

DimensionGitHub Models (free)HuggingFace (free)
Rate modelFixed RPM / RPD per tierCredit budget (~$0.10/month)
Low-tier model calls/day150 guaranteedNot applicable — credit-based
High-tier model calls/day50 guaranteed~8 frontier-model calls/month total (estimated from $0.10 credit at partner rates)
Frontier model accessYes — within rate limitsYes — credits consumed at provider rates
Production useExplicitly banned on free tierNo explicit ban; $0.10/month cap makes it impractical
AuthGitHub PAT (models:read)HuggingFace user access token

HuggingFace’s $0.10/month sounds flexible but runs out fast if you’re hitting frontier models through paid partner providers. If your use case is CPU-class models — BERT-scale classification, embeddings, smaller open models — HuggingFace’s native provider has a broader selection and may be the better fit. For LLMs specifically, GitHub Models gives a more predictable daily budget at no cost. Once you scale past free tiers and need to optimize costs across model sizes, our LLM cost routing deep-dive covers when routing cheap vs. expensive models pays off.

Verdict

Use GitHub Models if: You’re a solo developer or small team prototyping an LLM-powered feature, you want API access with zero friction and no credit card, and 150 small-model calls (or 50 frontier-model calls) per day covers your validation workload. The OpenAI-compatible endpoint means dropping it into existing code takes minutes, and swapping back to direct providers later is a two-line change.

Use a paid API if: You have production traffic, need more than 50 frontier-model calls per day, care about latency SLAs, or need capabilities GitHub Models doesn’t offer — fine-tuning, embeddings (beyond what the catalog supports), or guaranteed uptime. At that point, evaluate the OpenAI API and Anthropic API head-to-head. GitHub’s pay-as-you-go rates match direct pricing for GPT-4o, but going direct removes one layer of proxy billing and gives you faster access to new model versions.

Caveats

GitHub Models is still in public preview as of June 2026. Rate limits, model availability, and pricing are all subject to change without notice — and the old Azure inference endpoint was deprecated with limited warning as a data point.

The free tier’s terms of service explicitly restrict use to prototyping and experimentation. Serving users on the free tier violates the ToS.

Model availability churns. The catalog has expanded steadily since the August 2024 launch; models that exist today may be removed or replaced.

Also see our Best AI Coding CLI in 2026 if you’re evaluating the broader landscape of AI developer tools beyond API access.

References