PEFT in a nutshell
==Parameter-Efficient Fine-Tuning (PEFT) is an umbrella term for techniques that let you adapt a large pretrained model to a new task without updating (or even storing) all of its weights.==
Instead of fine-tuning billions of parameters, you train a tiny subset (often < 1 %) or add small, trainable “delta” modules while freezing the original backbone.
Why it matters
Challenge with full fine-tuning | How PEFT helps |
---|---|
GPU memory & time — All weights need gradients. | Train only a sliver of weights → 10-100 × lower RAM & FLOPs. |
Storage — A 7 B-parameter model saved in FP16 ≈ 14 GB per task. | Store just the deltas (tens of MB) → economical to ship many task-specific versions. |
Catastrophic forgetting when you re-train the whole model for every new task. | Backbone is frozen → each task’s adapters are independent. |
Core PEFT families
Family | Key idea | Typical trainable % | Notes |
---|---|---|---|
Adapters (Houlsby ’19, Pfeiffer ’20) | Insert small bottleneck MLPs between transformer sub-layers. | 1-3 % | Easy to stack/compose; slight latency hit. |
Prefix / Prompt / P-Tuning | Learn virtual tokens (prompts) or key–value prefixes that steer attention. | < 0.1 % | No extra latency; works well for seq-to-seq & GPT-style models. |
LoRA (Hu & al. ’22) | Decompose weight update ΔW into two low-rank matrices A · Bᵀ (rank r ≪ d). | 0.02-0.4 % | Adds two small linear layers in parallel; mergeable after training. |
IA³ (Liu & al. ’22) | Scale existing query/key/value & MLP activations with learned vectors. | 0.01-0.05 % | Almost zero latency overhead. |
BitFit / Bias-Tune | Update only bias terms (or layer norms). | 0.01 % | Surprisingly strong baseline for classification. |
QLoRA (Dettmers ’23) | Quantize backbone to 4-bit + apply LoRA on top. | 0.02-0.2 % | Runs 65 B models on a single 48 GB GPU. |
Rule of thumb: The larger the base model, the smaller the % of parameters you need to reach full-fine-tune quality.
How LoRA works (quick intuition)
-
Freeze the original weight matrix W₀ (e.g., 4096 × 4096).
-
Add ΔW = A · Bᵀ, where A is 4096 × r, B is 4096 × r, and r is tiny (e.g., 8).
-
Only A and B receive gradients. At inference you can either
-
keep the extra matrices (no merge, slightly slower) or
-
add ΔW back into W₀ and discard them (no overhead).
-
Typical numbers
For a 7 B Llama-2 model (≈ 6.7 B trainable weights):
Method | Extra trainable params | Disk after training |
---|---|---|
Full fine-tune | 6.7 B | 13.4 GB (fp16) |
LoRA r = 16 | 9 M | 36 MB |
Prefix-tuning (128 tokens) | 3 M | 12 MB |
Where to start in practice
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8b")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b")
lora_cfg = LoraConfig(
r=16, # rank
lora_alpha=32,
target_modules=["q_proj","v_proj"], # typical for attention
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters() # sanity-check: ~0.03 % trainable
Train normally with your preferred trainer/LoRA library; save only the PEFT adapter weights. At inference:
merged = model.merge_and_unload() # optional: collapses LoRA into base
Caveats & tips
-
Still need full forward pass → batch size limited by backbone size.
-
Task mismatch: For highly structured tasks (code generation ↔️ translation) you may need a larger rank or multiple Adapters.
-
Interference: To run many tasks simultaneously, use “adapter fusion” or multi-LoRA mixing.
-
Evaluation: Always compare to full fine-tune and to zero-shot prompting to ensure the PEFT gain is worth the extra training.
Take-away
PEFT gives you 80-100 % of full fine-tuning performance for pennies in compute and storage. In 2025 it has become the default approach for customizing large language and vision transformers on laptops, consumer GPUs, and edge devices.