Contents
  1. What Fine-Tuning Is
  2. The Two-Phase Architecture
  3. Three Strategies
  4. LoRA: Low-Rank Adaptation
  5. The Loss Function and Gradient Flow
  6. Data Format
  7. Key Hyperparameters
  8. Hardware Requirements
← All posts

Fine-Tuning: Adapting Pretrained Models to New Tasks

Fine-tuning takes a pretrained model and continues training on a smaller, task-specific dataset. This post covers the architecture of adaptation: what changes, what stays frozen, and how parameter-efficient methods like LoRA reduce the cost.

Fine-tuning architecture
Fine-tuning architecture: Phase 1 pretrains on a large corpus and produces frozen weights W. Phase 2 inserts trainable adapter layers ΔW and trains only those. Gradients flow only through ΔW.

What Fine-Tuning Is

Pretraining is the process of training a model on a very large, general-purpose corpus. The model learns syntax, semantics, reasoning patterns, and broad world knowledge. This is expensive: it requires large hardware clusters and many weeks of compute.

Fine-tuning takes that pretrained model and continues training for a shorter period on a smaller, curated dataset that is specific to a target task. The result is a model that retains the general capabilities of the pretrained backbone while being adapted to the domain or task at hand.

The key ratio: pretraining datasets are measured in trillions of tokens. Fine-tuning datasets are measured in thousands to millions of examples. The difference in scale is what makes fine-tuning practical.

The Two-Phase Architecture

Phase 1: Pretraining. A transformer backbone is trained on a large corpus using a self-supervised objective (next-token prediction, masked language modelling, etc.). The output is a set of pretrained weights WW that encode general representations.

Phase 2: Fine-tuning. The pretrained weights WW are loaded. A task-specific dataset of input-output pairs is prepared. Training resumes, but now only certain parts of the model are updated.

The critical structural decision in Phase 2 is: which weights get updated?

Three Strategies

Full fine-tuning. Every parameter in the model is updated during training. This gives the model maximum flexibility to adapt but requires storing gradients and optimiser states for all parameters. For large models, this is memory-intensive.

Parameter-efficient fine-tuning (PEFT). The backbone weights WW are frozen. A small set of additional parameters ΔW\Delta W (adapter layers) is inserted and trained. Only ΔW\Delta W receives gradient updates. The backbone WW is never modified. This dramatically reduces the number of trainable parameters and the memory required during training.

Prompt tuning. No weights are updated at all. Instead, soft prompt vectors are prepended to the input and trained while the entire model remains frozen. This is the most parameter-efficient approach but tends to underperform on tasks that require deeper adaptation.

LoRA: Low-Rank Adaptation

LoRA is the most widely used PEFT method. The core idea: weight updates during fine-tuning tend to have low intrinsic rank. Instead of learning a full d×kd \times k update matrix, LoRA approximates it as the product of two smaller matrices:

W=W+ΔW=W+BAW' = W + \Delta W = W + BA

where BRd×rB \in \mathbb{R}^{d \times r} and ARr×kA \in \mathbb{R}^{r \times k}, with rank rmin(d,k)r \ll \min(d, k).

Only AA and BB are trained. WW is frozen. The total number of trainable parameters is r(d+k)r(d + k) instead of dkdk. For a typical attention weight matrix with d=k=4096d = k = 4096 and r=16r = 16, this reduces trainable parameters from 16,777,21616{,}777{,}216 to 131,072131{,}072, a reduction of over 99%.

LoRA is typically applied to the query, key, value, and output projection matrices of the attention layers. During inference, ΔW=BA\Delta W = BA can be merged back into WW, so inference latency is unchanged.

The Loss Function and Gradient Flow

Fine-tuning uses the same loss as pretraining for generative tasks: cross-entropy over the next token prediction.

L=1Tt=1Tlogpθ(xtx<t)\mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{<t})

For classification tasks, cross-entropy over class labels is used instead.

Gradients flow backward from the loss through the task head and through ΔW\Delta W only. Because WW is frozen, no gradient is computed for it. This is what makes PEFT memory-efficient: the optimiser only needs to maintain states for ΔW\Delta W.

Data Format

Fine-tuning requires structured prompt-response pairs. The format depends on whether the task is generative (produce a completion given a prompt) or discriminative (assign a label to an input). Both require explicit formatting so the model learns the expected input-output structure.

For instruction-following tasks, a common format is:

### Instruction
<task description>

### Input
<example input>

### Response
<expected output>

Consistency in formatting matters. The model will learn the format as part of the task.

Key Hyperparameters

HyperparameterTypical starting valueEffect
Learning rate1×1041 \times 10^{-4}Most impactful; too high causes catastrophic forgetting
Epochs1-5More epochs risks overfitting on small datasets
Batch size4-32Affects gradient noise and memory
LoRA rank rr8-64Higher rank = more expressiveness, more parameters
LoRA alpha2r2rScales the adapter contribution

The learning rate is the single most important hyperparameter. Fine-tuning uses a much smaller learning rate than pretraining because the model already has useful representations. Large learning rates will overwrite them.

Hardware Requirements

Training and fine-tuning require parallel floating-point computation at scale. CPUs are not practical for this. GPUs and TPUs are standard. The key resource constraints are:

  • Memory: activations, gradients, and optimiser states must all fit in device memory simultaneously during training.
  • Compute: each forward and backward pass involves billions of multiply-accumulate operations.

PEFT methods like LoRA exist precisely to reduce the memory footprint to the point where fine-tuning is feasible on a single GPU.

← All posts