Day 2 of 5
⏱ ~60 minutes
Edge Computing in 5 Days — Day 2

Model Optimization

A ResNet-50 is 100MB and needs 4GB of RAM to run. None of that is acceptable on edge hardware. Today you'll quantize and prune models for deployment.

Quantization

Quantization reduces precision: float32 (4 bytes) → float16 (2 bytes) → int8 (1 byte) → int4 (0.5 bytes). An int8 model is 4× smaller and 4× faster on hardware with int8 SIMD units (ARM Cortex-A with NEON, x86 with AVX2). Quality loss: typically 0.5–2% accuracy on classification, larger on detection. Process: collect a calibration dataset (100–1000 representative samples), run through the model to collect activation statistics, compute per-layer scale factors, quantize weights and activations.

Pruning and Knowledge Distillation

Pruning: set weights below a threshold to zero, then remove zero channels/filters (structured pruning). A VGG-16 can be pruned to 1/10 the original parameters with <2% accuracy loss. Iterative: prune 10%, retrain, prune 10%, retrain... Knowledge distillation: train a small 'student' model to mimic a large 'teacher' model's output logits (not just the ground truth labels). The student learns the teacher's 'soft knowledge' — typically achieves 90%+ of teacher accuracy at 10–50× fewer parameters.

Neural Architecture Search

NAS (Neural Architecture Search) automatically finds efficient architectures. EfficientNet, MobileNetV3, NASNet were found via NAS. MobileNetV3 achieves 75.2% ImageNet accuracy with only 5.4M parameters and 0.22 GFLOPs — designed for mobile inference. For edge deployment, start with a proven mobile architecture (MobileNet, EfficientNet-Lite, YOLO-nano) rather than optimizing a full model from scratch.

python
# Model quantization: full precision vs int8
# pip install torch torchvision

import torch
import torch.nn as nn
import time, os

# Simple CNN for demonstration
class TinyCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1), nn.ReLU(),
            nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(),
            nn.AdaptiveAvgPool2d(4),
        )
        self.classifier = nn.Linear(64*4*4, 10)

    def forward(self, x):
        x = self.features(x)
        return self.classifier(x.flatten(1))

model = TinyCNN()
model.eval()

# ── Float32 baseline ─────────────────────────────────────
x = torch.randn(1, 1, 28, 28)
t0 = time.perf_counter()
for _ in range(1000): _ = model(x)
fp32_ms = (time.perf_counter()-t0)

fp32_size = sum(p.numel()*4 for p in model.parameters()) / 1024
print(f"FP32: {fp32_size:.1f} KB, {fp32_ms*1000:.1f}ms for 1000 inferences")

# ── Dynamic quantization (int8 weights) ─────────────────
quantized = torch.quantization.quantize_dynamic(
    model, {nn.Linear, nn.Conv2d}, dtype=torch.qint8
)
t0 = time.perf_counter()
for _ in range(1000): _ = quantized(x)
int8_ms = (time.perf_counter()-t0)

# Count quantized model size (approximate)
torch.save(quantized.state_dict(), '/tmp/q.pt')
int8_size = os.path.getsize('/tmp/q.pt') / 1024
print(f"INT8: {int8_size:.1f} KB, {int8_ms*1000:.1f}ms for 1000 inferences")

print(f"Size reduction:  {fp32_size/int8_size:.1f}x")
print(f"Speed improvement: {fp32_ms/int8_ms:.1f}x")

# Check if outputs are similar
with torch.no_grad():
    fp32_out  = model(x)
    int8_out  = quantized(x)
    max_diff  = (fp32_out - int8_out).abs().max().item()
    print(f"Max output difference: {max_diff:.4f}")
💡
INT8 quantization gives the best results when you calibrate with representative data. Dynamic quantization (quantize weights only, activations at runtime) is the easiest to apply. Static quantization (calibrate both) gives better accuracy. Post-training quantization (PTQ) works for most cases; quantization-aware training (QAT) is needed when PTQ causes >2% accuracy drop.
📝 Day 2 Exercise
Quantize a Real Model
  1. Download a pretrained MobileNetV2: model = torchvision.models.mobilenet_v2(pretrained=True). Measure size and inference time.
  2. Apply dynamic quantization. Measure new size and speed. Compare top-1 accuracy on 100 ImageNet validation images.
  3. Apply static quantization: insert QuantStub/DeQuantStub, calibrate with 100 images, then convert. Compare accuracy and speed.
  4. Export the quantized model to ONNX: torch.onnx.export(model, x, 'model.onnx'). Open in Netron (netron.app) to visualize the graph.
  5. Run the ONNX model with ONNX Runtime: ort.InferenceSession('model.onnx'). Compare ONNX Runtime vs PyTorch inference speed.

Day 2 Summary

  • INT8 quantization: 4x smaller, 4x faster, 0.5-2% accuracy loss — best trade-off for edge
  • Dynamic quant is easiest; static quant is most accurate; QAT is for difficult cases
  • MobileNet/EfficientNet-Lite are designed for edge — use them instead of optimizing large models
  • Always benchmark accuracy AND latency before and after optimization on target hardware
Challenge

Implement knowledge distillation. Train a 'teacher' (ResNet-18, pretrained on CIFAR-10) and a 'student' (3-layer CNN, 10x fewer parameters). Train the student with: 0.7 × cross_entropy(student_logits, labels) + 0.3 × KL_divergence(student_logits/T, teacher_logits/T) where T=4 (temperature). Compare student trained with distillation vs without. How much accuracy does distillation recover?

Finished this lesson?