Neural Network Efficiency: FLOPs, Memory, and Quantisation
Efficient neural networks are smaller, faster, and consume less energy. This post covers the key metrics: FLOPs, memory footprint, latency, and how quantisation reduces all three.
Three Dimensions of Efficiency
An efficient neural network is:
- Smallest (storage): minimal memory footprint.
- Fastest (computation): minimal latency.
- Least (energy): minimal power consumption.
These are related but not identical. Reducing parameter count reduces memory but does not always reduce latency if the remaining operations are memory-bandwidth bound.
Key Metrics
Memory is determined by the model size: number of parameters multiplied by bits per parameter.
Latency depends on two components:
Throughput measures data movement: activations and weights moved per second.
FLOPs
A FLOP is a single floating-point operation. A MAC (multiply-accumulate) counts as 2 FLOPs: one multiply and one add.
Approximate costs by operation type:
| Operation | Relative cost |
|---|---|
| 32-bit int ADD | 0.1 |
| 32-bit float ADD | 1 |
| 32-bit float MULT | 5 |
AlexNet has 61M parameters. At 32 bits per parameter: .
FLOPs per second (FLOP/s) is the hardware metric. Model FLOPs tell you how many operations a forward pass requires.
Memory Bottleneck in Training
During training, the memory bottleneck is activations, not weights. Activations must be stored for backpropagation and are proportional to batch size times the size of each layer’s output. At higher layers, activation tensors are large. Weight gradients are comparatively small.
This is why gradient checkpointing (trading computation for memory by recomputing activations during the backward pass) is a useful technique for large models.
Quantisation
Quantisation reduces the number of bits used to represent weights and activations.
- 32-bit float: 4 bytes per value.
- 8-bit integer: 1 byte per value, 4x reduction in memory.
- 4-bit: 2x further reduction.
Benefits:
- Smaller model size.
- Faster inference on hardware that supports integer arithmetic.
- Lower memory bandwidth requirement during inference.
The trade-off is a potential drop in accuracy, particularly for very low bit widths. Post-training quantisation (PTQ) applies quantisation after training. Quantisation-aware training (QAT) incorporates the quantisation error into the training loop.
Efficiency Methods Summary
| Category | Techniques |
|---|---|
| Memory | Reduce parameters, weight sharing, quantisation |
| Computation | Pruning, knowledge distillation, efficient architectures |
| Data movement | Fused operators, tiling, reduced activation storage |
For deployment on edge devices, memory and data movement are often the binding constraints, not raw FLOP count. A model with fewer FLOPs but poor memory access patterns can be slower than a model with more FLOPs and good cache locality.