· llm / fine-tuning / lora

How to fine-tune a small LLM in 2026 (LoRA on a laptop)

Fine-tune Llama 3.1 8B with QLoRA on a consumer GPU — pinned Unsloth install, exact training config, GGUF export to Ollama, and eight failure modes.

By

1,814 words · 10 min read

Fine-tuning Llama 3.1 8B on consumer hardware is realistic in 2026. With Unsloth v0.1.39-beta and QLoRA 4-bit quantization, an RTX 3080 (10 GB VRAM) is enough — the same operation OOMs on vanilla HuggingFace PEFT. Trainable parameters are 42M out of 8B total (0.52%), and for a 500–1,000 example dataset you’re looking at 20–60 minutes on a free Colab T4.

Who this is for

Developers who need a model that reliably produces a specific output format, follows a house style, or handles a narrow domain better than the base model. You need a GPU with at least 6 GB VRAM, Python 3.10+, and a small dataset. If you’re on macOS Apple Silicon without a discrete GPU, training isn’t available via Unsloth as of May 2026 — inference works, training doesn’t.

Prerequisites

Hardware minimums (QLoRA 4-bit, Unsloth)

ModelVRAM required
Llama 3.2 3B3.5 GB
Mistral 7B5 GB
Llama 3.1 8B6 GB
Llama 3.3 70B41 GB

The tutorial targets Llama 3.1 8B. If you’re below 6 GB, drop to 3B (unsloth/Llama-3.2-3B-bnb-4bit). If you have no GPU at all, Lambda Labs offers RTX 3090 instances at around $0.50/hr for short training runs.

If you’re still deciding whether to fine-tune versus calling a hosted API, the real cost of running an AI agent team in 2026 breaks down the TCO math including GPU rental.

Install the toolchain

The Unsloth installer pins compatible versions of transformers, trl, peft, and datasets automatically — several point releases in the 4.x range cause recursion errors or training bugs, and the installer excludes them.

curl -fsSL https://unsloth.ai/install.sh | sh

Verify the versions it resolved:

pip show unsloth peft trl transformers
# unsloth           0.1.39b0
# peft              0.19.1
# trl               0.18.2   (or 0.23+ — see §6 Troubleshooting)
# transformers      4.51.3   (or a later compatible point release)

Failure mode: If you installed trl separately and it resolved to v0.23 or later, SFTConfig (from trl) is the recommended config class for SFTTrainer. The training snippet in §3 uses transformers.TrainingArguments, which still works but misses SFT-specific settings — pin trl<0.23 or update the import (see §6).

Download the base model

The 4-bit pre-quantized checkpoint is ~5.4 GB vs ~16 GB for the 16-bit version. Always use the bnb-4bit variant:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
    dtype=None,  # auto-detects BF16 on Ampere and later
)

On an 8 GB GPU, Unsloth fits 2,972 tokens of context for Llama 3.1 8B. Vanilla HuggingFace with Flash Attention 2 OOMs on the same hardware.

Data prep

Format

Use ShareGPT JSONL with a conversations list. This is what the canonical mlabonne notebook uses and what Unsloth’s train_on_responses_only helper expects:

{"conversations": [{"from": "human", "value": "Summarize this contract clause in plain English: ..."}, {"from": "gpt", "value": "This clause means the vendor can ..."}]}
{"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}

Each line is one training example. The from field must be "human" and "gpt" — other values break the ChatML template mapping.

How many examples

TaskMinimum
Classification (per class)100–300
Structured extraction200–500
General instruction-following500–1,000
Content generation500–2,000
Complex domain (legal, medical)1,000–5,000

Quality beats quantity. 200 carefully curated examples consistently outperform 2,000 hastily collected ones. One malformed example can teach a pattern that persists through the full run.

Load with datasets

from datasets import load_dataset

dataset = load_dataset("json", data_files="my_data.jsonl", split="train")

Failure mode: Some datasets 4.x point releases cause recursion errors — Unsloth’s installer excludes them automatically. If you’re using a pre-existing environment, check pip show datasets and reinstall via the Unsloth script to get a pinned-compatible version.

Fine-tuning run

Add the LoRA adapter, then run the trainer. This config is the confirmed working setup from mlabonne’s canonical notebook, adapted for a small local dataset.

from unsloth import FastLanguageModel, is_bfloat16_supported
from unsloth.chat_templates import get_chat_template
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import Dataset

# (assuming model, tokenizer, dataset from previous steps)

# Apply ChatML template
tokenizer = get_chat_template(tokenizer, chat_template="chatml")

def format_conversations(examples):
    texts = [
        tokenizer.apply_chat_template(conv, tokenize=False, add_generation_prompt=False)
        for conv in examples["conversations"]
    ]
    return {"text": texts}

dataset = dataset.map(format_conversations, batched=True)

# Attach LoRA adapter — trains 42M of 8B params (0.52%)
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
    use_rslora=True,                        # Rank-Stabilized LoRA — more stable at r=16
    use_gradient_checkpointing="unsloth",   # Unsloth's custom gradient checkpointing — extra VRAM savings
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    packing=True,
    args=TrainingArguments(
        learning_rate=3e-4,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,      # effective batch = 8
        num_train_epochs=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        warmup_steps=10,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        output_dir="outputs",
        logging_steps=10,
    ),
)

trainer.train()

What to expect

HardwareDataset sizeApprox. time
Colab T4 (16 GB)500–2,000 examples20–60 min
Colab T4 (16 GB)100k examples~47 hr
A100 40 GB100k examples~4 hr 45 min
RTX 3080 (10 GB)500–2,000 examples30–90 min

First run is slow — torch.compile warmup takes up to 5 minutes. Measure throughput only after the first few steps stabilize.

Failure mode — OOM: Reduce per_device_train_batch_size to 1 and increase gradient_accumulation_steps to keep the effective batch constant. Also try maximum_memory_usage=0.5 if the OOM happens during evaluation.

Evaluating the result

Watch the training loss in the logs. Healthy ranges:

Loss valueWhat it means
1.5–2.5Too high — model isn’t learning; check data formatting
0.5–1.0Healthy range
0.1–0.4Getting low — one epoch is likely enough
Near 0.0Overfitting — reduce epochs or increase dataset size

After training, do a quick vibe-check before exporting:

FastLanguageModel.for_inference(model)

messages = [{"role": "user", "content": "Your test prompt here"}]
inputs = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")

outputs = model.generate(inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

If the output looks coherent and on-task, the model is ready to export. If it repeats itself or generates gibberish, check the chat template and loss curve before assuming the training failed.

Quantizing to GGUF for local inference

Unsloth exports directly to GGUF and writes the Ollama Modelfile automatically, handling chat template mapping for supported families including Llama-3, Mistral, Phi-3, and Gemma. Don’t write the Modelfile yourself — template mismatch is the most common post-export complaint.

# Q4_K_M: good quality/size balance, ~4.5 GB output
model.save_pretrained_gguf("model_gguf", tokenizer, quantization_method="q4_k_m")

Quantization options:

MethodOutput sizeQuality
q8_0~8 GBNear-lossless — use for quality checks
q5_k_m~5.5 GBGood midpoint
q4_k_m~4.5 GBRecommended for production

Failure mode — OOM during save: Pass maximum_memory_usage=0.5 to save_pretrained_gguf. The default is 0.75, which can push an 8 GB card over the edge.

Once the GGUF files are written:

# Unsloth wrote model_gguf/Modelfile — use it directly
ollama create my-fine-tuned-model -f model_gguf/Modelfile
ollama run my-fine-tuned-model

Test with the same prompt you used during the vibe-check. If the output matches what you saw in Python, the export is clean.

For a deeper look at running GGUF models day-to-day — Ollama vs LM Studio throughput, memory efficiency, and API compatibility — see Ollama vs LM Studio on Mac: which survives daily use?.

What to do if it doesn’t work

OOM during training

  1. Set per_device_train_batch_size=1, raise gradient_accumulation_steps to compensate.
  2. Reduce max_seq_length to 1024.
  3. Set fp16_full_eval=True and add eval_accumulation_steps=4.

Loss stuck at 0 (not NaN — just flat at 0.0)

Cause: train_on_responses_only is masking everything because the delimiters don’t match Llama 3.1’s token boundaries.

Fix — use the exact Llama 3.1 delimiters:

from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
    response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
)

If you’re not using train_on_responses_only, check that your JSONL has "from": "gpt" (not "from": "assistant") — the ShareGPT template maps gpt to the assistant role, not the literal string assistant.

Loss NaN

  1. Learning rate too high — reduce learning_rate from 3e-4 to 1e-4.
  2. Malformed JSONL — a single line with the wrong structure (missing conversations key, mixed templates) can NaN the loss. Validate the dataset: assert all("conversations" in ex for ex in dataset).
  3. Gradient explosion — add max_grad_norm=0.3 to TrainingArguments (default is 1.0).

Catastrophic forgetting

Symptom: the model only outputs text that looks like your training data, losing general capability.

Mitigations:

  • Lower the rank: r=8 instead of r=16.
  • Fewer epochs: 1 is usually enough.
  • Mix in 10–20% general-domain examples from a dataset like mlabonne/FineTome-100k.
  • Add a small dropout: lora_dropout=0.05.

LoRA is inherently more resistant to catastrophic forgetting than full fine-tuning, but it’s not immune with small, homogeneous datasets.

TRL 0.23+ — TrainingArguments import fails

SFTConfig (from trl) is the recommended config class for SFTTrainer from TRL 0.9+. It handles SFT-specific arguments that were previously passed directly to SFTTrainer.__init__. Using transformers.TrainingArguments still works but misses those SFT-specific settings. If you’re on TRL 0.23 or later, switch to:

# trl >= 0.23
from trl import SFTConfig

args = SFTConfig(
    learning_rate=3e-4,
    per_device_train_batch_size=2,
    # ... rest of args unchanged
    output_dir="outputs",
)

The rest of the trainer setup stays the same. If you want to pin the older API: pip install "trl<0.23".

Chat template mismatch after GGUF export

Symptom: model generates infinite repetition or gibberish after ollama run.

Fix: always use the Modelfile Unsloth writes to model_gguf/Modelfile. Don’t create a custom one. If you already created one manually, delete it and re-run save_pretrained_gguf.

CUDA compilation errors

Set UNSLOTH_COMPILE_DISABLE=1 before starting training. Slow downloads stalling at 90–95%: set UNSLOTH_STABLE_DOWNLOADS=1. Both are environment variables — set them in your shell or at the top of your script with os.environ["UNSLOTH_COMPILE_DISABLE"] = "1".

macOS Apple Silicon

Unsloth’s Python training API doesn’t support MLX yet (as of May 2026). Inference on GGUF models works fine via Ollama. For training on an M-series Mac, use Axolotl with MPS backend directly — it’s slower and uses more RAM than CUDA, but it works.

References