Fine-Tuning: Adapting Pretrained Models to New Tasks

Fine-tuning architecture: Phase 1 pretrains on a large corpus and produces frozen weights W. Phase 2 inserts trainable adapter layers ΔW and trains only those. Gradients flow only through ΔW.

What Fine-Tuning Is

Pretraining is the process of training a model on a very large, general-purpose corpus. The model learns syntax, semantics, reasoning patterns, and broad world knowledge. This is expensive: it requires large hardware clusters and many weeks of compute.

Fine-tuning takes that pretrained model and continues training for a shorter period on a smaller, curated dataset that is specific to a target task. The result is a model that retains the general capabilities of the pretrained backbone while being adapted to the domain or task at hand.

The key ratio: pretraining datasets are measured in trillions of tokens. Fine-tuning datasets are measured in thousands to millions of examples. The difference in scale is what makes fine-tuning practical.

The Two-Phase Architecture

Phase 1: Pretraining. A transformer backbone is trained on a large corpus using a self-supervised objective (next-token prediction, masked language modelling, etc.). The output is a set of pretrained weights $W$ that encode general representations.

Phase 2: Fine-tuning. The pretrained weights $W$ are loaded. A task-specific dataset of input-output pairs is prepared. Training resumes, but now only certain parts of the model are updated.

The critical structural decision in Phase 2 is: which weights get updated?

Three Strategies

Full fine-tuning. Every parameter in the model is updated during training. This gives the model maximum flexibility to adapt but requires storing gradients and optimiser states for all parameters. For large models, this is memory-intensive.

Parameter-efficient fine-tuning (PEFT). The backbone weights $W$ are frozen. A small set of additional parameters $\Delta W$ (adapter layers) is inserted and trained. Only $\Delta W$ receives gradient updates. The backbone $W$ is never modified. This dramatically reduces the number of trainable parameters and the memory required during training.

Prompt tuning. No weights are updated at all. Instead, soft prompt vectors are prepended to the input and trained while the entire model remains frozen. This is the most parameter-efficient approach but tends to underperform on tasks that require deeper adaptation.

LoRA: Low-Rank Adaptation

LoRA is the most widely used PEFT method. The core idea: weight updates during fine-tuning tend to have low intrinsic rank. Instead of learning a full $d \times k$ update matrix, LoRA approximates it as the product of two smaller matrices:

W' = W + \Delta W = W + BA

where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$ , with rank $r \ll \min(d, k)$ .

Only $A$ and $B$ are trained. $W$ is frozen. The total number of trainable parameters is $r(d + k)$ instead of $dk$ . For a typical attention weight matrix with $d = k = 4096$ and $r = 16$ , this reduces trainable parameters from $16{,}777{,}216$ to $131{,}072$ , a reduction of over 99%.

LoRA is typically applied to the query, key, value, and output projection matrices of the attention layers. During inference, $\Delta W = BA$ can be merged back into $W$ , so inference latency is unchanged.

The Loss Function and Gradient Flow

Fine-tuning uses the same loss as pretraining for generative tasks: cross-entropy over the next token prediction.

\mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{<t})

For classification tasks, cross-entropy over class labels is used instead.

Gradients flow backward from the loss through the task head and through $\Delta W$ only. Because $W$ is frozen, no gradient is computed for it. This is what makes PEFT memory-efficient: the optimiser only needs to maintain states for $\Delta W$ .

Data Format

Fine-tuning requires structured prompt-response pairs. The format depends on whether the task is generative (produce a completion given a prompt) or discriminative (assign a label to an input). Both require explicit formatting so the model learns the expected input-output structure.

For instruction-following tasks, a common format is:

### Instruction
<task description>

### Input
<example input>

### Response
<expected output>

Consistency in formatting matters. The model will learn the format as part of the task.

Key Hyperparameters

Hyperparameter	Typical starting value	Effect
Learning rate	$1 \times 10^{-4}$	Most impactful; too high causes catastrophic forgetting
Epochs	1-5	More epochs risks overfitting on small datasets
Batch size	4-32	Affects gradient noise and memory
LoRA rank $r$	8-64	Higher rank = more expressiveness, more parameters
LoRA alpha	$2r$	Scales the adapter contribution

The learning rate is the single most important hyperparameter. Fine-tuning uses a much smaller learning rate than pretraining because the model already has useful representations. Large learning rates will overwrite them.

Hardware Requirements

Training and fine-tuning require parallel floating-point computation at scale. CPUs are not practical for this. GPUs and TPUs are standard. The key resource constraints are:

Memory: activations, gradients, and optimiser states must all fit in device memory simultaneously during training.
Compute: each forward and backward pass involves billions of multiply-accumulate operations.

PEFT methods like LoRA exist precisely to reduce the memory footprint to the point where fine-tuning is feasible on a single GPU.