PEFT in a nutshell

==Parameter-Efficient Fine-Tuning (PEFT) is an umbrella term for techniques that let you adapt a large pretrained model to a new task without updating (or even storing) all of its weights.==

Instead of fine-tuning billions of parameters, you train a tiny subset (often < 1 %) or add small, trainable “delta” modules while freezing the original backbone.

Why it matters

Challenge with full fine-tuningHow PEFT helps
GPU memory & time — All weights need gradients.Train only a sliver of weights → 10-100 × lower RAM & FLOPs.
Storage — A 7 B-parameter model saved in FP16 ≈ 14 GB per task.Store just the deltas (tens of MB) → economical to ship many task-specific versions.
Catastrophic forgetting when you re-train the whole model for every new task.Backbone is frozen → each task’s adapters are independent.

Core PEFT families

FamilyKey ideaTypical trainable %Notes
Adapters (Houlsby ’19, Pfeiffer ’20)Insert small bottleneck MLPs between transformer sub-layers.1-3 %Easy to stack/compose; slight latency hit.
Prefix / Prompt / P-TuningLearn virtual tokens (prompts) or key–value prefixes that steer attention.< 0.1 %No extra latency; works well for seq-to-seq & GPT-style models.
LoRA (Hu & al. ’22)Decompose weight update ΔW into two low-rank matrices A · Bᵀ (rank r ≪ d).0.02-0.4 %Adds two small linear layers in parallel; mergeable after training.
IA³ (Liu & al. ’22)Scale existing query/key/value & MLP activations with learned vectors.0.01-0.05 %Almost zero latency overhead.
BitFit / Bias-TuneUpdate only bias terms (or layer norms).0.01 %Surprisingly strong baseline for classification.
QLoRA (Dettmers ’23)Quantize backbone to 4-bit + apply LoRA on top.0.02-0.2 %Runs 65 B models on a single 48 GB GPU.

Rule of thumb: The larger the base model, the smaller the % of parameters you need to reach full-fine-tune quality.

How LoRA works (quick intuition)

  1. Freeze the original weight matrix W₀ (e.g., 4096 × 4096).

  2. Add ΔW = A · Bᵀ, where A is 4096 × r, B is 4096 × r, and r is tiny (e.g., 8).

  3. Only A and B receive gradients. At inference you can either

    • keep the extra matrices (no merge, slightly slower) or

    • add ΔW back into W₀ and discard them (no overhead).

Typical numbers

For a 7 B Llama-2 model (≈ 6.7 B trainable weights):

MethodExtra trainable paramsDisk after training
Full fine-tune6.7 B13.4 GB (fp16)
LoRA r = 169 M36 MB
Prefix-tuning (128 tokens)3 M12 MB

Where to start in practice

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8b")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b")

lora_cfg = LoraConfig(
    r=16,             # rank
    lora_alpha=32,
    target_modules=["q_proj","v_proj"],  # typical for attention
    lora_dropout=0.05,
    bias="none",
)

model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters()   # sanity-check: ~0.03 % trainable

Train normally with your preferred trainer/LoRA library; save only the PEFT adapter weights. At inference:

merged = model.merge_and_unload()  # optional: collapses LoRA into base

Caveats & tips

  • Still need full forward pass → batch size limited by backbone size.

  • Task mismatch: For highly structured tasks (code generation ↔️ translation) you may need a larger rank or multiple Adapters.

  • Interference: To run many tasks simultaneously, use “adapter fusion” or multi-LoRA mixing.

  • Evaluation: Always compare to full fine-tune and to zero-shot prompting to ensure the PEFT gain is worth the extra training.


Take-away

PEFT gives you 80-100 % of full fine-tuning performance for pennies in compute and storage. In 2025 it has become the default approach for customizing large language and vision transformers on laptops, consumer GPUs, and edge devices.