Contents
  1. Three Dimensions of Efficiency
  2. Key Metrics
  3. FLOPs
  4. Memory Bottleneck in Training
  5. Quantisation
  6. Efficiency Methods Summary
← All posts

Neural Network Efficiency: FLOPs, Memory, and Quantisation

Efficient neural networks are smaller, faster, and consume less energy. This post covers the key metrics: FLOPs, memory footprint, latency, and how quantisation reduces all three.

Three Dimensions of Efficiency

An efficient neural network is:

  • Smallest (storage): minimal memory footprint.
  • Fastest (computation): minimal latency.
  • Least (energy): minimal power consumption.

These are related but not identical. Reducing parameter count reduces memory but does not always reduce latency if the remaining operations are memory-bandwidth bound.

Key Metrics

Memory is determined by the model size: number of parameters multiplied by bits per parameter.

Latency depends on two components:

Tlatency=Tcomp+TdataT_\text{latency} = T_\text{comp} + T_\text{data}
Tcomp=Number of operations in NNProcessor peak throughputT_\text{comp} = \frac{\text{Number of operations in NN}}{\text{Processor peak throughput}}
Tdata=Input activation size+Output activation sizeMemory bandwidthT_\text{data} = \frac{\text{Input activation size} + \text{Output activation size}}{\text{Memory bandwidth}}

Throughput measures data movement: activations and weights moved per second.

FLOPs

A FLOP is a single floating-point operation. A MAC (multiply-accumulate) counts as 2 FLOPs: one multiply and one add.

Approximate costs by operation type:

OperationRelative cost
32-bit int ADD0.1
32-bit float ADD1
32-bit float MULT5

AlexNet has 61M parameters. At 32 bits per parameter: 61×106×4 bytes=244 MB61 \times 10^6 \times 4 \text{ bytes} = 244 \text{ MB}.

FLOPs per second (FLOP/s) is the hardware metric. Model FLOPs tell you how many operations a forward pass requires.

Memory Bottleneck in Training

During training, the memory bottleneck is activations, not weights. Activations must be stored for backpropagation and are proportional to batch size times the size of each layer’s output. At higher layers, activation tensors are large. Weight gradients are comparatively small.

This is why gradient checkpointing (trading computation for memory by recomputing activations during the backward pass) is a useful technique for large models.

Quantisation

Quantisation reduces the number of bits used to represent weights and activations.

  • 32-bit float: 4 bytes per value.
  • 8-bit integer: 1 byte per value, 4x reduction in memory.
  • 4-bit: 2x further reduction.

Benefits:

  • Smaller model size.
  • Faster inference on hardware that supports integer arithmetic.
  • Lower memory bandwidth requirement during inference.

The trade-off is a potential drop in accuracy, particularly for very low bit widths. Post-training quantisation (PTQ) applies quantisation after training. Quantisation-aware training (QAT) incorporates the quantisation error into the training loop.

Efficiency Methods Summary

CategoryTechniques
MemoryReduce parameters, weight sharing, quantisation
ComputationPruning, knowledge distillation, efficient architectures
Data movementFused operators, tiling, reduced activation storage

For deployment on edge devices, memory and data movement are often the binding constraints, not raw FLOP count. A model with fewer FLOPs but poor memory access patterns can be slower than a model with more FLOPs and good cache locality.

← All posts