Fine-Tuning: Adapting Pretrained Models to New Tasks
Fine-tuning takes a pretrained model and continues training on a smaller, task-specific dataset. This post covers the architecture of adaptation: what changes, what stays frozen, and how parameter-efficient methods like LoRA reduce the cost.
What Fine-Tuning Is
Pretraining is the process of training a model on a very large, general-purpose corpus. The model learns syntax, semantics, reasoning patterns, and broad world knowledge. This is expensive: it requires large hardware clusters and many weeks of compute.
Fine-tuning takes that pretrained model and continues training for a shorter period on a smaller, curated dataset that is specific to a target task. The result is a model that retains the general capabilities of the pretrained backbone while being adapted to the domain or task at hand.
The key ratio: pretraining datasets are measured in trillions of tokens. Fine-tuning datasets are measured in thousands to millions of examples. The difference in scale is what makes fine-tuning practical.
The Two-Phase Architecture
Phase 1: Pretraining. A transformer backbone is trained on a large corpus using a self-supervised objective (next-token prediction, masked language modelling, etc.). The output is a set of pretrained weights that encode general representations.
Phase 2: Fine-tuning. The pretrained weights are loaded. A task-specific dataset of input-output pairs is prepared. Training resumes, but now only certain parts of the model are updated.
The critical structural decision in Phase 2 is: which weights get updated?
Three Strategies
Full fine-tuning. Every parameter in the model is updated during training. This gives the model maximum flexibility to adapt but requires storing gradients and optimiser states for all parameters. For large models, this is memory-intensive.
Parameter-efficient fine-tuning (PEFT). The backbone weights are frozen. A small set of additional parameters (adapter layers) is inserted and trained. Only receives gradient updates. The backbone is never modified. This dramatically reduces the number of trainable parameters and the memory required during training.
Prompt tuning. No weights are updated at all. Instead, soft prompt vectors are prepended to the input and trained while the entire model remains frozen. This is the most parameter-efficient approach but tends to underperform on tasks that require deeper adaptation.
LoRA: Low-Rank Adaptation
LoRA is the most widely used PEFT method. The core idea: weight updates during fine-tuning tend to have low intrinsic rank. Instead of learning a full update matrix, LoRA approximates it as the product of two smaller matrices:
where and , with rank .
Only and are trained. is frozen. The total number of trainable parameters is instead of . For a typical attention weight matrix with and , this reduces trainable parameters from to , a reduction of over 99%.
LoRA is typically applied to the query, key, value, and output projection matrices of the attention layers. During inference, can be merged back into , so inference latency is unchanged.
The Loss Function and Gradient Flow
Fine-tuning uses the same loss as pretraining for generative tasks: cross-entropy over the next token prediction.
For classification tasks, cross-entropy over class labels is used instead.
Gradients flow backward from the loss through the task head and through only. Because is frozen, no gradient is computed for it. This is what makes PEFT memory-efficient: the optimiser only needs to maintain states for .
Data Format
Fine-tuning requires structured prompt-response pairs. The format depends on whether the task is generative (produce a completion given a prompt) or discriminative (assign a label to an input). Both require explicit formatting so the model learns the expected input-output structure.
For instruction-following tasks, a common format is:
### Instruction
<task description>
### Input
<example input>
### Response
<expected output>
Consistency in formatting matters. The model will learn the format as part of the task.
Key Hyperparameters
| Hyperparameter | Typical starting value | Effect |
|---|---|---|
| Learning rate | Most impactful; too high causes catastrophic forgetting | |
| Epochs | 1-5 | More epochs risks overfitting on small datasets |
| Batch size | 4-32 | Affects gradient noise and memory |
| LoRA rank | 8-64 | Higher rank = more expressiveness, more parameters |
| LoRA alpha | Scales the adapter contribution |
The learning rate is the single most important hyperparameter. Fine-tuning uses a much smaller learning rate than pretraining because the model already has useful representations. Large learning rates will overwrite them.
Hardware Requirements
Training and fine-tuning require parallel floating-point computation at scale. CPUs are not practical for this. GPUs and TPUs are standard. The key resource constraints are:
- Memory: activations, gradients, and optimiser states must all fit in device memory simultaneously during training.
- Compute: each forward and backward pass involves billions of multiply-accumulate operations.
PEFT methods like LoRA exist precisely to reduce the memory footprint to the point where fine-tuning is feasible on a single GPU.