Day 3 of 5
⏱ ~60 minutes
Edge Computing in 5 Days — Day 3

TensorFlow Lite

TFLite is the deployment format for edge ML. Today you'll convert, optimize, and run .tflite models on Raspberry Pi and Arduino Nano 33.

TFLite Architecture

TensorFlow Lite has two runtimes: TFLite for Android/iOS/Linux (Raspberry Pi, Coral) — full interpreter, supports most ops, C++ and Python APIs. TFLite Micro (TFLM) — microcontroller runtime, no dynamic memory allocation, 16–256KB footprint, runs on Arduino Nano 33 BLE Sense, STM32, and ESP32-S3. The workflow: train in TensorFlow/Keras → convert to FlatBuffer format (.tflite) → deploy to edge device.

Model Conversion

tf.lite.TFLiteConverter.from_keras_model(model) converts a Keras model. For optimization, set converter.optimizations = [tf.lite.Optimize.DEFAULT] (applies INT8 quantization). Provide a representative dataset function for full integer quantization. The output is a .tflite file — a single FlatBuffer file containing the model graph and quantized weights. Use xxd -i model.tflite > model_data.h to embed the model as a C array for TFLM.

Running Inference

Python TFLite interpreter: interpreter = tf.lite.Interpreter('model.tflite'), interpreter.allocate_tensors(), set input tensor, interpreter.invoke(), get output tensor. In C++ (TFLM): define an arena (static memory buffer), create resolver with needed ops, load model from flash, create interpreter, allocate, run. Total C++ code: ~30 lines. The model runs with zero heap allocation.

python
# TFLite: convert, quantize, and run inference
# pip install tensorflow numpy Pillow

import tensorflow as tf
import numpy as np
import time

# ── 1. Build and train a simple model ────────────────────
def build_model():
    return tf.keras.Sequential([
        tf.keras.layers.Flatten(input_shape=(28,28)),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(10, activation='softmax')
    ])

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train/255.0, x_test/255.0

model = build_model()
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=3, validation_split=0.1, verbose=0)
print(f"Base accuracy: {model.evaluate(x_test, y_test, verbose=0)[1]:.3f}")

# ── 2. Convert to TFLite (float32) ───────────────────────
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
with open('model_fp32.tflite', 'wb') as f: f.write(tflite_model)
print(f"FP32 size: {len(tflite_model)/1024:.1f} KB")

# ── 3. Convert with INT8 quantization ────────────────────
converter_q = tf.lite.TFLiteConverter.from_keras_model(model)
converter_q.optimizations = [tf.lite.Optimize.DEFAULT]

def representative_data():
    for i in range(100):
        yield [x_train[i:i+1].astype(np.float32)]

converter_q.representative_dataset = representative_data
converter_q.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter_q.inference_input_type  = tf.uint8
converter_q.inference_output_type = tf.uint8
tflite_quant = converter_q.convert()
with open('model_int8.tflite', 'wb') as f: f.write(tflite_quant)
print(f"INT8 size: {len(tflite_quant)/1024:.1f} KB")

# ── 4. Run inference with TFLite interpreter ─────────────
def tflite_predict(model_path, input_data):
    interp = tf.lite.Interpreter(model_path=model_path)
    interp.allocate_tensors()
    in_idx  = interp.get_input_details()[0]['index']
    out_idx = interp.get_output_details()[0]['index']
    interp.set_tensor(in_idx, input_data)
    interp.invoke()
    return interp.get_tensor(out_idx)

# Benchmark
sample = x_test[:1].astype(np.float32)
t0 = time.perf_counter()
for _ in range(1000): tflite_predict('model_fp32.tflite', sample)
fp32_ms = (time.perf_counter()-t0)

print(f"
FP32 inference: {fp32_ms:.1f}ms for 1000 runs")
print(f"INT8 model is {len(tflite_model)/len(tflite_quant):.1f}x smaller")
💡
For TFLM deployment, the entire model must fit in flash and the inference arena must fit in RAM. An Arduino Nano 33 BLE Sense has 1MB flash and 256KB RAM. A 40KB INT8 model with 64KB arena fits comfortably. Profile with interpreter.get_tensor_details() to find the peak activation memory during inference.
📝 Day 3 Exercise
Deploy a Model to Raspberry Pi
  1. Train the MNIST model and convert to both FP32 and INT8 TFLite formats. Record sizes.
  2. Copy the .tflite file to a Raspberry Pi. Install: pip3 install tflite-runtime (lighter than full TF).
  3. Run inference on the Pi. Benchmark: how many images/second for FP32 vs INT8?
  4. Install TFLite with XNNPACK delegate enabled (for ARM NEON acceleration). Does speed improve?
  5. Run the same INT8 model on Google Coral USB Accelerator if available. Compare Coral vs Pi CPU inference speed.

Day 3 Summary

  • TFLite: convert Keras model to .tflite FlatBuffer, deploy to Pi/Android/embedded
  • INT8 quantization: representative_dataset calibrates activations; target INT8 ops for full quantization
  • TFLite Micro: no heap allocation, 16-256KB footprint, C array model embedding for MCUs
  • XNNPACK delegate uses ARM NEON SIMD — often 3-5x faster than scalar inference on Pi
Challenge

Deploy a keyword spotting model (TensorFlow Speech Commands dataset) to an Arduino Nano 33 BLE Sense. The board has a microphone and runs TFLM. Use the pre-built hello_edge_impulse or Arduino_TensorFlowLite library. Train a model to recognize 'yes' and 'no'. What is the false positive rate? How does adding more keywords affect accuracy?

Finished this lesson?