Neural Network Efficiency: FLOPs, Memory, and Quantisation

Three Dimensions of Efficiency

An efficient neural network is:

Smallest (storage): minimal memory footprint.
Fastest (computation): minimal latency.
Least (energy): minimal power consumption.

These are related but not identical. Reducing parameter count reduces memory but does not always reduce latency if the remaining operations are memory-bandwidth bound.

Key Metrics

Memory is determined by the model size: number of parameters multiplied by bits per parameter.

Latency depends on two components:

T_\text{latency} = T_\text{comp} + T_\text{data}

T_\text{comp} = \frac{\text{Number of operations in NN}}{\text{Processor peak throughput}}

T_\text{data} = \frac{\text{Input activation size} + \text{Output activation size}}{\text{Memory bandwidth}}

Throughput measures data movement: activations and weights moved per second.

FLOPs

A FLOP is a single floating-point operation. A MAC (multiply-accumulate) counts as 2 FLOPs: one multiply and one add.

Approximate costs by operation type:

Operation	Relative cost
32-bit int ADD	0.1
32-bit float ADD	1
32-bit float MULT	5

AlexNet has 61M parameters. At 32 bits per parameter: $61 \times 10^6 \times 4 \text{ bytes} = 244 \text{ MB}$ .

FLOPs per second (FLOP/s) is the hardware metric. Model FLOPs tell you how many operations a forward pass requires.

During training, the memory bottleneck is activations, not weights. Activations must be stored for backpropagation and are proportional to batch size times the size of each layer’s output. At higher layers, activation tensors are large. Weight gradients are comparatively small.

This is why gradient checkpointing (trading computation for memory by recomputing activations during the backward pass) is a useful technique for large models.

Quantisation

Quantisation reduces the number of bits used to represent weights and activations.

32-bit float: 4 bytes per value.
8-bit integer: 1 byte per value, 4x reduction in memory.
4-bit: 2x further reduction.

Benefits:

Smaller model size.
Faster inference on hardware that supports integer arithmetic.
Lower memory bandwidth requirement during inference.

The trade-off is a potential drop in accuracy, particularly for very low bit widths. Post-training quantisation (PTQ) applies quantisation after training. Quantisation-aware training (QAT) incorporates the quantisation error into the training loop.

Efficiency Methods Summary

Category	Techniques
Memory	Reduce parameters, weight sharing, quantisation
Computation	Pruning, knowledge distillation, efficient architectures
Data movement	Fused operators, tiling, reduced activation storage

For deployment on edge devices, memory and data movement are often the binding constraints, not raw FLOP count. A model with fewer FLOPs but poor memory access patterns can be slower than a model with more FLOPs and good cache locality.

Three Dimensions of Efficiency

Key Metrics

FLOPs

Memory Bottleneck in Training

Quantisation

Efficiency Methods Summary