PEFT — Parameter-Efficient Fine-Tuning

PEFT in a nutshell

==Parameter-Efficient Fine-Tuning (PEFT) is an umbrella term for techniques that let you adapt a large pretrained model to a new task without updating (or even storing) all of its weights.==

Instead of fine-tuning billions of parameters, you train a tiny subset (often < 1 %) or add small, trainable “delta” modules while freezing the original backbone.

Why it matters

Challenge with full fine-tuning	How PEFT helps
GPU memory & time — All weights need gradients.	Train only a sliver of weights → 10-100 × lower RAM & FLOPs.
Storage — A 7 B-parameter model saved in FP16 ≈ 14 GB per task.	Store just the deltas (tens of MB) → economical to ship many task-specific versions.
Catastrophic forgetting when you re-train the whole model for every new task.	Backbone is frozen → each task’s adapters are independent.

Core PEFT families

Family	Key idea	Typical trainable %	Notes
Adapters (Houlsby ’19, Pfeiffer ’20)	Insert small bottleneck MLPs between transformer sub-layers.	1-3 %	Easy to stack/compose; slight latency hit.
Prefix / Prompt / P-Tuning	Learn virtual tokens (prompts) or key–value prefixes that steer attention.	< 0.1 %	No extra latency; works well for seq-to-seq & GPT-style models.
LoRA (Hu & al. ’22)	Decompose weight update ΔW into two low-rank matrices A · Bᵀ (rank r ≪ d).	0.02-0.4 %	Adds two small linear layers in parallel; mergeable after training.
IA³ (Liu & al. ’22)	Scale existing query/key/value & MLP activations with learned vectors.	0.01-0.05 %	Almost zero latency overhead.
BitFit / Bias-Tune	Update only bias terms (or layer norms).	0.01 %	Surprisingly strong baseline for classification.
QLoRA (Dettmers ’23)	Quantize backbone to 4-bit + apply LoRA on top.	0.02-0.2 %	Runs 65 B models on a single 48 GB GPU.

Rule of thumb: The larger the base model, the smaller the % of parameters you need to reach full-fine-tune quality.

How LoRA works (quick intuition)

Freeze the original weight matrix W₀ (e.g., 4096 × 4096).
Add ΔW = A · Bᵀ, where A is 4096 × r, B is 4096 × r, and r is tiny (e.g., 8).
Only A and B receive gradients. At inference you can either
- keep the extra matrices (no merge, slightly slower) or
- add ΔW back into W₀ and discard them (no overhead).

Typical numbers

For a 7 B Llama-2 model (≈ 6.7 B trainable weights):

Method	Extra trainable params	Disk after training
Full fine-tune	6.7 B	13.4 GB (fp16)
LoRA r = 16	9 M	36 MB
Prefix-tuning (128 tokens)	3 M	12 MB

Where to start in practice

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8b")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b")

lora_cfg = LoraConfig(
    r=16,             # rank
    lora_alpha=32,
    target_modules=["q_proj","v_proj"],  # typical for attention
    lora_dropout=0.05,
    bias="none",
)

model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters()   # sanity-check: ~0.03 % trainable

Train normally with your preferred trainer/LoRA library; save only the PEFT adapter weights. At inference:

merged = model.merge_and_unload()  # optional: collapses LoRA into base

Caveats & tips

Still need full forward pass → batch size limited by backbone size.
Task mismatch: For highly structured tasks (code generation ↔️ translation) you may need a larger rank or multiple Adapters.
Interference: To run many tasks simultaneously, use “adapter fusion” or multi-LoRA mixing.
Evaluation: Always compare to full fine-tune and to zero-shot prompting to ensure the PEFT gain is worth the extra training.

Take-away

PEFT gives you 80-100 % of full fine-tuning performance for pennies in compute and storage. In 2025 it has become the default approach for customizing large language and vision transformers on laptops, consumer GPUs, and edge devices.

Explorer