Day 4 of 5
⏱ ~60 minutes
Edge Computing in 5 Days — Day 4

ONNX and Deployment

ONNX is the open standard for ML model exchange. Today you'll use ONNX Runtime for optimized inference and explore dedicated edge accelerators.

ONNX and ONNX Runtime

ONNX (Open Neural Network Exchange) is a standardized graph format. Export from PyTorch, TensorFlow, scikit-learn, XGBoost. Run with ONNX Runtime — 2–5x faster than PyTorch CPU inference. ONNX Runtime uses execution providers: CPUExecutionProvider (default, uses AVX2/AVX512 on x86, NEON on ARM), CUDAExecutionProvider, TensorRTExecutionProvider, OpenVINOExecutionProvider (Intel CPU/VPU), CoreMLExecutionProvider (Apple Silicon).

Edge Accelerators

Google Coral TPU: USB or PCIe, 4 TOPS, INT8 only, 2W. Compiles models with the Edge TPU Compiler — not all ops are supported (use MobileNet/EfficientNet). Best for classification and detection at high frame rates. NVIDIA Jetson Nano: 472 GFLOPS, 128-core Maxwell GPU, 5W, $99. Runs full PyTorch and TensorRT. Intel Neural Compute Stick 2: USB, 4 TOPS, OpenVINO, $70. Hailo-8: 26 TOPS, designed for automotive ADAS.

TensorRT Optimization

TensorRT is NVIDIA's inference optimizer. Takes an ONNX model, fuses layers (Conv+BN+ReLU → single op), selects optimal CUDA kernels, quantizes to INT8 or FP16. A ResNet-50 that takes 8ms in PyTorch takes 1.5ms in TensorRT INT8 on a V100 — 5x speedup. On Jetson Nano: TensorRT is the path to real-time inference. Calibrate with representative data, build the engine (slow, done once), serialize to disk, load for inference.

python
# ONNX Runtime inference — cross-platform, optimized
# pip install onnxruntime torch torchvision

import numpy as np
import time
import torch
import torchvision.models as models
import onnxruntime as ort

# ── 1. Export PyTorch model to ONNX ──────────────────────
model = models.mobilenet_v2(pretrained=True)
model.eval()

dummy = torch.randn(1, 3, 224, 224)
torch.onnx.export(
    model, dummy, 'mobilenetv2.onnx',
    opset_version=13,
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={'input': {0: 'batch'}, 'output': {0: 'batch'}},
    verbose=False
)
print("Exported to mobilenetv2.onnx")

# ── 2. Run with ONNX Runtime ─────────────────────────────
# Show available providers
print(f"Available: {ort.get_available_providers()}")

# Use best available provider
providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
sess_opts = ort.SessionOptions()
sess_opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess = ort.InferenceSession('mobilenetv2.onnx', sess_options=sess_opts,
                             providers=providers)

# ── 3. Benchmark ──────────────────────────────────────────
inp = np.random.randn(1, 3, 224, 224).astype(np.float32)

# Warmup
for _ in range(5): sess.run(None, {'input': inp})

N = 100
t0 = time.perf_counter()
for _ in range(N): out = sess.run(None, {'input': inp})
ort_ms = (time.perf_counter()-t0) / N * 1000

# Compare to PyTorch
with torch.no_grad():
    for _ in range(5): model(dummy)  # warmup
    t0 = time.perf_counter()
    for _ in range(N): model(dummy)
    pt_ms  = (time.perf_counter()-t0) / N * 1000

print(f"PyTorch CPU:     {pt_ms:.1f}ms per inference")
print(f"ONNX Runtime:    {ort_ms:.1f}ms per inference")
print(f"Speedup:         {pt_ms/ort_ms:.1f}x")

# Class prediction
top5 = np.argsort(out[0][0])[-5:][::-1]
print(f"Top-5 classes: {top5}")
💡
ONNX Runtime's graph optimization (ORT_ENABLE_ALL) fuses operators automatically — Conv+BatchNorm+ReLU becomes a single fused op. On ARM CPUs with NEON, ONNX Runtime also uses the XNNPACK kernel library for further SIMD acceleration. You get these optimizations for free by switching from PyTorch to ONNX Runtime.
📝 Day 4 Exercise
Compare Inference Backends
  1. Export MobileNetV2 to ONNX. Benchmark PyTorch CPU vs ONNX Runtime CPU on your machine.
  2. On Raspberry Pi 4: install onnxruntime-arm (pip3 install onnxruntime). Benchmark. How many FPS can you achieve?
  3. Enable graph optimization levels: ORT_DISABLE_ALL, ORT_ENABLE_BASIC, ORT_ENABLE_EXTENDED, ORT_ENABLE_ALL. Compare latency for each.
  4. Export to TFLite and compare: ONNX Runtime vs TFLite on the same Raspberry Pi. Which is faster for MobileNetV2?
  5. If available, test Google Coral USB Accelerator: install pycoral, compile MobileNetV2 with Edge TPU Compiler, run inference. How does Coral compare to CPU?

Day 4 Summary

  • ONNX Runtime: 2-5x faster than PyTorch CPU, uses XNNPACK/AVX2/NEON automatically
  • Execution providers let the same code use CPU, CUDA, TensorRT, CoreML, OpenVINO with one flag change
  • Coral TPU: 4 TOPS, INT8 only, 2W — highest performance-per-watt for supported architectures
  • TensorRT gives 3-5x speedup over PyTorch on NVIDIA — essential for Jetson edge deployment
Challenge

Deploy a real-time object detection pipeline on Raspberry Pi: use YOLO-nano (NanoDet, <1MB) or SSD MobileNet converted to ONNX. Connect a USB camera. Process frames with OpenCV. Run ONNX Runtime inference on each frame. Draw bounding boxes. What FPS do you achieve? What's the bottleneck: capture, preprocessing, inference, or drawing?

Finished this lesson?