· llm / fine-tuning / lora
How to fine-tune a small LLM in 2026 (LoRA on a laptop)
Fine-tune Llama 3.1 8B with QLoRA on a consumer GPU — pinned Unsloth install, exact training config, GGUF export to Ollama, and eight failure modes.
By Ethan
1,814 words · 10 min read
Fine-tuning Llama 3.1 8B on consumer hardware is realistic in 2026. With Unsloth v0.1.39-beta and QLoRA 4-bit quantization, an RTX 3080 (10 GB VRAM) is enough — the same operation OOMs on vanilla HuggingFace PEFT. Trainable parameters are 42M out of 8B total (0.52%), and for a 500–1,000 example dataset you’re looking at 20–60 minutes on a free Colab T4.
Who this is for
Developers who need a model that reliably produces a specific output format, follows a house style, or handles a narrow domain better than the base model. You need a GPU with at least 6 GB VRAM, Python 3.10+, and a small dataset. If you’re on macOS Apple Silicon without a discrete GPU, training isn’t available via Unsloth as of May 2026 — inference works, training doesn’t.
Prerequisites
Hardware minimums (QLoRA 4-bit, Unsloth)
| Model | VRAM required |
|---|---|
| Llama 3.2 3B | 3.5 GB |
| Mistral 7B | 5 GB |
| Llama 3.1 8B | 6 GB |
| Llama 3.3 70B | 41 GB |
The tutorial targets Llama 3.1 8B. If you’re below 6 GB, drop to 3B (unsloth/Llama-3.2-3B-bnb-4bit). If you have no GPU at all, Lambda Labs offers RTX 3090 instances at around $0.50/hr for short training runs.
If you’re still deciding whether to fine-tune versus calling a hosted API, the real cost of running an AI agent team in 2026 breaks down the TCO math including GPU rental.
Install the toolchain
The Unsloth installer pins compatible versions of transformers, trl, peft, and datasets automatically — several point releases in the 4.x range cause recursion errors or training bugs, and the installer excludes them.
curl -fsSL https://unsloth.ai/install.sh | sh
Verify the versions it resolved:
pip show unsloth peft trl transformers
# unsloth 0.1.39b0
# peft 0.19.1
# trl 0.18.2 (or 0.23+ — see §6 Troubleshooting)
# transformers 4.51.3 (or a later compatible point release)
Failure mode: If you installed trl separately and it resolved to v0.23 or later, SFTConfig (from trl) is the recommended config class for SFTTrainer. The training snippet in §3 uses transformers.TrainingArguments, which still works but misses SFT-specific settings — pin trl<0.23 or update the import (see §6).
Download the base model
The 4-bit pre-quantized checkpoint is ~5.4 GB vs ~16 GB for the 16-bit version. Always use the bnb-4bit variant:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
max_seq_length=2048,
load_in_4bit=True,
dtype=None, # auto-detects BF16 on Ampere and later
)
On an 8 GB GPU, Unsloth fits 2,972 tokens of context for Llama 3.1 8B. Vanilla HuggingFace with Flash Attention 2 OOMs on the same hardware.
Data prep
Format
Use ShareGPT JSONL with a conversations list. This is what the canonical mlabonne notebook uses and what Unsloth’s train_on_responses_only helper expects:
{"conversations": [{"from": "human", "value": "Summarize this contract clause in plain English: ..."}, {"from": "gpt", "value": "This clause means the vendor can ..."}]}
{"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}
Each line is one training example. The from field must be "human" and "gpt" — other values break the ChatML template mapping.
How many examples
| Task | Minimum |
|---|---|
| Classification (per class) | 100–300 |
| Structured extraction | 200–500 |
| General instruction-following | 500–1,000 |
| Content generation | 500–2,000 |
| Complex domain (legal, medical) | 1,000–5,000 |
Quality beats quantity. 200 carefully curated examples consistently outperform 2,000 hastily collected ones. One malformed example can teach a pattern that persists through the full run.
Load with datasets
from datasets import load_dataset
dataset = load_dataset("json", data_files="my_data.jsonl", split="train")
Failure mode: Some datasets 4.x point releases cause recursion errors — Unsloth’s installer excludes them automatically. If you’re using a pre-existing environment, check pip show datasets and reinstall via the Unsloth script to get a pinned-compatible version.
Fine-tuning run
Add the LoRA adapter, then run the trainer. This config is the confirmed working setup from mlabonne’s canonical notebook, adapted for a small local dataset.
from unsloth import FastLanguageModel, is_bfloat16_supported
from unsloth.chat_templates import get_chat_template
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import Dataset
# (assuming model, tokenizer, dataset from previous steps)
# Apply ChatML template
tokenizer = get_chat_template(tokenizer, chat_template="chatml")
def format_conversations(examples):
texts = [
tokenizer.apply_chat_template(conv, tokenize=False, add_generation_prompt=False)
for conv in examples["conversations"]
]
return {"text": texts}
dataset = dataset.map(format_conversations, batched=True)
# Attach LoRA adapter — trains 42M of 8B params (0.52%)
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
use_rslora=True, # Rank-Stabilized LoRA — more stable at r=16
use_gradient_checkpointing="unsloth", # Unsloth's custom gradient checkpointing — extra VRAM savings
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
packing=True,
args=TrainingArguments(
learning_rate=3e-4,
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # effective batch = 8
num_train_epochs=1,
optim="adamw_8bit",
weight_decay=0.01,
warmup_steps=10,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
output_dir="outputs",
logging_steps=10,
),
)
trainer.train()
What to expect
| Hardware | Dataset size | Approx. time |
|---|---|---|
| Colab T4 (16 GB) | 500–2,000 examples | 20–60 min |
| Colab T4 (16 GB) | 100k examples | ~47 hr |
| A100 40 GB | 100k examples | ~4 hr 45 min |
| RTX 3080 (10 GB) | 500–2,000 examples | 30–90 min |
First run is slow — torch.compile warmup takes up to 5 minutes. Measure throughput only after the first few steps stabilize.
Failure mode — OOM: Reduce per_device_train_batch_size to 1 and increase gradient_accumulation_steps to keep the effective batch constant. Also try maximum_memory_usage=0.5 if the OOM happens during evaluation.
Evaluating the result
Watch the training loss in the logs. Healthy ranges:
| Loss value | What it means |
|---|---|
| 1.5–2.5 | Too high — model isn’t learning; check data formatting |
| 0.5–1.0 | Healthy range |
| 0.1–0.4 | Getting low — one epoch is likely enough |
| Near 0.0 | Overfitting — reduce epochs or increase dataset size |
After training, do a quick vibe-check before exporting:
FastLanguageModel.for_inference(model)
messages = [{"role": "user", "content": "Your test prompt here"}]
inputs = tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")
outputs = model.generate(inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
If the output looks coherent and on-task, the model is ready to export. If it repeats itself or generates gibberish, check the chat template and loss curve before assuming the training failed.
Quantizing to GGUF for local inference
Unsloth exports directly to GGUF and writes the Ollama Modelfile automatically, handling chat template mapping for supported families including Llama-3, Mistral, Phi-3, and Gemma. Don’t write the Modelfile yourself — template mismatch is the most common post-export complaint.
# Q4_K_M: good quality/size balance, ~4.5 GB output
model.save_pretrained_gguf("model_gguf", tokenizer, quantization_method="q4_k_m")
Quantization options:
| Method | Output size | Quality |
|---|---|---|
q8_0 | ~8 GB | Near-lossless — use for quality checks |
q5_k_m | ~5.5 GB | Good midpoint |
q4_k_m | ~4.5 GB | Recommended for production |
Failure mode — OOM during save: Pass maximum_memory_usage=0.5 to save_pretrained_gguf. The default is 0.75, which can push an 8 GB card over the edge.
Once the GGUF files are written:
# Unsloth wrote model_gguf/Modelfile — use it directly
ollama create my-fine-tuned-model -f model_gguf/Modelfile
ollama run my-fine-tuned-model
Test with the same prompt you used during the vibe-check. If the output matches what you saw in Python, the export is clean.
For a deeper look at running GGUF models day-to-day — Ollama vs LM Studio throughput, memory efficiency, and API compatibility — see Ollama vs LM Studio on Mac: which survives daily use?.
What to do if it doesn’t work
OOM during training
- Set
per_device_train_batch_size=1, raisegradient_accumulation_stepsto compensate. - Reduce
max_seq_lengthto 1024. - Set
fp16_full_eval=Trueand addeval_accumulation_steps=4.
Loss stuck at 0 (not NaN — just flat at 0.0)
Cause: train_on_responses_only is masking everything because the delimiters don’t match Llama 3.1’s token boundaries.
Fix — use the exact Llama 3.1 delimiters:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
trainer,
instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
)
If you’re not using train_on_responses_only, check that your JSONL has "from": "gpt" (not "from": "assistant") — the ShareGPT template maps gpt to the assistant role, not the literal string assistant.
Loss NaN
- Learning rate too high — reduce
learning_ratefrom3e-4to1e-4. - Malformed JSONL — a single line with the wrong structure (missing
conversationskey, mixed templates) can NaN the loss. Validate the dataset:assert all("conversations" in ex for ex in dataset). - Gradient explosion — add
max_grad_norm=0.3toTrainingArguments(default is 1.0).
Catastrophic forgetting
Symptom: the model only outputs text that looks like your training data, losing general capability.
Mitigations:
- Lower the rank:
r=8instead ofr=16. - Fewer epochs: 1 is usually enough.
- Mix in 10–20% general-domain examples from a dataset like
mlabonne/FineTome-100k. - Add a small dropout:
lora_dropout=0.05.
LoRA is inherently more resistant to catastrophic forgetting than full fine-tuning, but it’s not immune with small, homogeneous datasets.
TRL 0.23+ — TrainingArguments import fails
SFTConfig (from trl) is the recommended config class for SFTTrainer from TRL 0.9+. It handles SFT-specific arguments that were previously passed directly to SFTTrainer.__init__. Using transformers.TrainingArguments still works but misses those SFT-specific settings. If you’re on TRL 0.23 or later, switch to:
# trl >= 0.23
from trl import SFTConfig
args = SFTConfig(
learning_rate=3e-4,
per_device_train_batch_size=2,
# ... rest of args unchanged
output_dir="outputs",
)
The rest of the trainer setup stays the same. If you want to pin the older API: pip install "trl<0.23".
Chat template mismatch after GGUF export
Symptom: model generates infinite repetition or gibberish after ollama run.
Fix: always use the Modelfile Unsloth writes to model_gguf/Modelfile. Don’t create a custom one. If you already created one manually, delete it and re-run save_pretrained_gguf.
CUDA compilation errors
Set UNSLOTH_COMPILE_DISABLE=1 before starting training. Slow downloads stalling at 90–95%: set UNSLOTH_STABLE_DOWNLOADS=1. Both are environment variables — set them in your shell or at the top of your script with os.environ["UNSLOTH_COMPILE_DISABLE"] = "1".
macOS Apple Silicon
Unsloth’s Python training API doesn’t support MLX yet (as of May 2026). Inference on GGUF models works fine via Ollama. For training on an M-series Mac, use Axolotl with MPS backend directly — it’s slower and uses more RAM than CUDA, but it works.
References
- Unsloth v0.1.39-beta — source, install script, compatibility matrix
- PEFT v0.19.1 — underlying LoRA engine
- VRAM requirements — per-model minimums and context length benchmarks
- Unsloth benchmarks — 2× speed, >70% less VRAM methodology
- mlabonne canonical notebook — Llama 3.1 8B / Unsloth / T4
- mlabonne HuggingFace post — walkthrough for the canonical notebook
- LoRA hyperparameters guide — rank, alpha, dropout tradeoffs
- Unsloth → Ollama tutorial — GGUF export and Modelfile
- Ollama import guide — custom GGUF import workflow
- Unsloth troubleshooting — OOM, NaN loss, CUDA errors
- Dataset size empirical guide — minimum examples per task type