Neural Architecture Search and Network Pruning

Pruning

Pruning reduces model size by removing weights, neurons, or channels that contribute little to the output. The main categories:

Method	What is removed	Notes
Magnitude-based	Weights with small absolute value	Simple, effective baseline
Gradient/sensitivity-based	Weights with low gradient signal	More principled
Percentage-of-zero-based	Already-zero activations	Exploits sparsity
Scale-based	Channels with small scaling factors (e.g. BN gamma)	Structured pruning
Factorisation/decomposition	Low-rank approximations of weight matrices	Structured, hardware-friendly
AutoML-based	Determined by reinforcement learning policy	NetAdapt, AMC

Structured pruning (channels, filters) produces hardware-friendly sparsity. Unstructured pruning (individual weights) produces irregular sparsity that requires sparse matrix libraries to realise speedups.

NetAdapt is a channel-pruning method: it iteratively removes channels while satisfying a latency constraint, adapted per hardware platform. It cannot be used for attention mechanisms.

After pruning, the learning rate should be reduced and the model fine-tuned to recover accuracy.

Key Architectures and Design Choices

MobileNet uses depthwise separable convolutions to reduce FLOPs while maintaining accuracy. A standard $k \times k$ convolution over $C_{in}$ channels is replaced by:

A depthwise convolution: one filter per input channel.
A pointwise ( $1 \times 1$ ) convolution: combines channel outputs.

SqueezeNet introduces the Fire module: a squeeze layer ( $1 \times 1$ convolutions) followed by an expand layer ( $1 \times 1$ and $3 \times 3$ convolutions). This reduces parameters significantly.

ShuffleNet uses grouped convolutions and channel shuffling. Grouped convolutions restrict each filter to a subset of input channels. Channel shuffling restores cross-group information flow. Not suitable for attention mechanisms.

ResNeXt generalises ResNet by replacing the single large convolution with multiple parallel grouped convolutions (cardinality), which can be better than simply increasing depth or width.

Neural Architecture Search

NAS automates the design of neural network architectures. The three components:

Search space: defines what architectures are possible (layer types, connections, operations).
Search strategy: how to explore the space (reinforcement learning, evolutionary algorithms, gradient-based).
Performance estimation: how to evaluate candidate architectures without full training (proxy tasks, weight sharing, once-for-all).

Once-for-All (OFA): trains a single large network from which any subnet can be extracted by subsampling depth, width, kernel size, and resolution. A specific subnet is then selected based on target platform latency constraints. This avoids retraining from scratch for each deployment target.

NAS variants:

Name	Approach
NASNet	RL-based cell search
AmoebaNet	Evolutionary search
ProxylessNAS	Gradient-based, latency-aware
PARTS	Partial channel connections
MobileNetV3	NAS-designed with NetAdapt

Applications of NAS

Object detection: search for efficient backbone and neck architectures.
Mobile applications: latency-constrained search for edge deployment.
Pose estimation: architecture search for keypoint detection.
General GAN components: generator and discriminator architecture search.

The trade-off in all NAS methods is search cost versus the quality of the discovered architecture. Gradient-based methods are significantly cheaper than RL or evolutionary approaches but may get stuck in local optima.