ONNX is the open standard for ML model exchange. Today you'll use ONNX Runtime for optimized inference and explore dedicated edge accelerators.
ONNX (Open Neural Network Exchange) is a standardized graph format. Export from PyTorch, TensorFlow, scikit-learn, XGBoost. Run with ONNX Runtime — 2–5x faster than PyTorch CPU inference. ONNX Runtime uses execution providers: CPUExecutionProvider (default, uses AVX2/AVX512 on x86, NEON on ARM), CUDAExecutionProvider, TensorRTExecutionProvider, OpenVINOExecutionProvider (Intel CPU/VPU), CoreMLExecutionProvider (Apple Silicon).
Google Coral TPU: USB or PCIe, 4 TOPS, INT8 only, 2W. Compiles models with the Edge TPU Compiler — not all ops are supported (use MobileNet/EfficientNet). Best for classification and detection at high frame rates. NVIDIA Jetson Nano: 472 GFLOPS, 128-core Maxwell GPU, 5W, $99. Runs full PyTorch and TensorRT. Intel Neural Compute Stick 2: USB, 4 TOPS, OpenVINO, $70. Hailo-8: 26 TOPS, designed for automotive ADAS.
TensorRT is NVIDIA's inference optimizer. Takes an ONNX model, fuses layers (Conv+BN+ReLU → single op), selects optimal CUDA kernels, quantizes to INT8 or FP16. A ResNet-50 that takes 8ms in PyTorch takes 1.5ms in TensorRT INT8 on a V100 — 5x speedup. On Jetson Nano: TensorRT is the path to real-time inference. Calibrate with representative data, build the engine (slow, done once), serialize to disk, load for inference.
# ONNX Runtime inference — cross-platform, optimized
# pip install onnxruntime torch torchvision
import numpy as np
import time
import torch
import torchvision.models as models
import onnxruntime as ort
# ── 1. Export PyTorch model to ONNX ──────────────────────
model = models.mobilenet_v2(pretrained=True)
model.eval()
dummy = torch.randn(1, 3, 224, 224)
torch.onnx.export(
model, dummy, 'mobilenetv2.onnx',
opset_version=13,
input_names=['input'],
output_names=['output'],
dynamic_axes={'input': {0: 'batch'}, 'output': {0: 'batch'}},
verbose=False
)
print("Exported to mobilenetv2.onnx")
# ── 2. Run with ONNX Runtime ─────────────────────────────
# Show available providers
print(f"Available: {ort.get_available_providers()}")
# Use best available provider
providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
sess_opts = ort.SessionOptions()
sess_opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess = ort.InferenceSession('mobilenetv2.onnx', sess_options=sess_opts,
providers=providers)
# ── 3. Benchmark ──────────────────────────────────────────
inp = np.random.randn(1, 3, 224, 224).astype(np.float32)
# Warmup
for _ in range(5): sess.run(None, {'input': inp})
N = 100
t0 = time.perf_counter()
for _ in range(N): out = sess.run(None, {'input': inp})
ort_ms = (time.perf_counter()-t0) / N * 1000
# Compare to PyTorch
with torch.no_grad():
for _ in range(5): model(dummy) # warmup
t0 = time.perf_counter()
for _ in range(N): model(dummy)
pt_ms = (time.perf_counter()-t0) / N * 1000
print(f"PyTorch CPU: {pt_ms:.1f}ms per inference")
print(f"ONNX Runtime: {ort_ms:.1f}ms per inference")
print(f"Speedup: {pt_ms/ort_ms:.1f}x")
# Class prediction
top5 = np.argsort(out[0][0])[-5:][::-1]
print(f"Top-5 classes: {top5}")
pip3 install onnxruntime). Benchmark. How many FPS can you achieve?Deploy a real-time object detection pipeline on Raspberry Pi: use YOLO-nano (NanoDet, <1MB) or SSD MobileNet converted to ONNX. Connect a USB camera. Process frames with OpenCV. Run ONNX Runtime inference on each frame. Draw bounding boxes. What FPS do you achieve? What's the bottleneck: capture, preprocessing, inference, or drawing?