A ResNet-50 is 100MB and needs 4GB of RAM to run. None of that is acceptable on edge hardware. Today you'll quantize and prune models for deployment.
Quantization reduces precision: float32 (4 bytes) → float16 (2 bytes) → int8 (1 byte) → int4 (0.5 bytes). An int8 model is 4× smaller and 4× faster on hardware with int8 SIMD units (ARM Cortex-A with NEON, x86 with AVX2). Quality loss: typically 0.5–2% accuracy on classification, larger on detection. Process: collect a calibration dataset (100–1000 representative samples), run through the model to collect activation statistics, compute per-layer scale factors, quantize weights and activations.
Pruning: set weights below a threshold to zero, then remove zero channels/filters (structured pruning). A VGG-16 can be pruned to 1/10 the original parameters with <2% accuracy loss. Iterative: prune 10%, retrain, prune 10%, retrain... Knowledge distillation: train a small 'student' model to mimic a large 'teacher' model's output logits (not just the ground truth labels). The student learns the teacher's 'soft knowledge' — typically achieves 90%+ of teacher accuracy at 10–50× fewer parameters.
NAS (Neural Architecture Search) automatically finds efficient architectures. EfficientNet, MobileNetV3, NASNet were found via NAS. MobileNetV3 achieves 75.2% ImageNet accuracy with only 5.4M parameters and 0.22 GFLOPs — designed for mobile inference. For edge deployment, start with a proven mobile architecture (MobileNet, EfficientNet-Lite, YOLO-nano) rather than optimizing a full model from scratch.
# Model quantization: full precision vs int8
# pip install torch torchvision
import torch
import torch.nn as nn
import time, os
# Simple CNN for demonstration
class TinyCNN(nn.Module):
def __init__(self):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(1, 32, 3, padding=1), nn.ReLU(),
nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(),
nn.AdaptiveAvgPool2d(4),
)
self.classifier = nn.Linear(64*4*4, 10)
def forward(self, x):
x = self.features(x)
return self.classifier(x.flatten(1))
model = TinyCNN()
model.eval()
# ── Float32 baseline ─────────────────────────────────────
x = torch.randn(1, 1, 28, 28)
t0 = time.perf_counter()
for _ in range(1000): _ = model(x)
fp32_ms = (time.perf_counter()-t0)
fp32_size = sum(p.numel()*4 for p in model.parameters()) / 1024
print(f"FP32: {fp32_size:.1f} KB, {fp32_ms*1000:.1f}ms for 1000 inferences")
# ── Dynamic quantization (int8 weights) ─────────────────
quantized = torch.quantization.quantize_dynamic(
model, {nn.Linear, nn.Conv2d}, dtype=torch.qint8
)
t0 = time.perf_counter()
for _ in range(1000): _ = quantized(x)
int8_ms = (time.perf_counter()-t0)
# Count quantized model size (approximate)
torch.save(quantized.state_dict(), '/tmp/q.pt')
int8_size = os.path.getsize('/tmp/q.pt') / 1024
print(f"INT8: {int8_size:.1f} KB, {int8_ms*1000:.1f}ms for 1000 inferences")
print(f"Size reduction: {fp32_size/int8_size:.1f}x")
print(f"Speed improvement: {fp32_ms/int8_ms:.1f}x")
# Check if outputs are similar
with torch.no_grad():
fp32_out = model(x)
int8_out = quantized(x)
max_diff = (fp32_out - int8_out).abs().max().item()
print(f"Max output difference: {max_diff:.4f}")
model = torchvision.models.mobilenet_v2(pretrained=True). Measure size and inference time.torch.onnx.export(model, x, 'model.onnx'). Open in Netron (netron.app) to visualize the graph.ort.InferenceSession('model.onnx'). Compare ONNX Runtime vs PyTorch inference speed.Implement knowledge distillation. Train a 'teacher' (ResNet-18, pretrained on CIFAR-10) and a 'student' (3-layer CNN, 10x fewer parameters). Train the student with: 0.7 × cross_entropy(student_logits, labels) + 0.3 × KL_divergence(student_logits/T, teacher_logits/T) where T=4 (temperature). Compare student trained with distillation vs without. How much accuracy does distillation recover?