A fully on-device PCB defect inspector running YOLOv8s and a 2B-param LLM on a Raspberry Pi 5.



A factory line operator doesn't want a Jupyter notebook. They want to point a camera at a board, hear "this one's fine" or "open circuit, top right", and move on. So that's what we built.
This was a two-person project with Abir Abidi. We split the work across research, training, and deployment, and shipped the full stack together.
The whole system runs on a Raspberry Pi 5 (8 GB) with a Pi Camera v3. Detection, language model, text-to-speech, and web server all live on the same board. No GPU at inference time, no cloud round-trip, no external API in the loop.
The task is to spot 6 classes of defects on a PCB (missing_hole, mouse_bite, open_circuit, short, spur, spurious_copper) in seconds, and explain the result to a non-technical operator.
Production hardware: Raspberry Pi 5 (8 GB) + Pi Camera Module v3. Training hardware: Azure VM with an NVIDIA T4 (16 GB), AMD EPYC 7V12, 54 GB RAM.
YOLOv8s for perception, Gemma-4-E2B-it (Q4_K_M, ~2B params) for narration, Piper for voice, FastAPI for transport, an Expo app over Cloudflare Tunnel for the operator. mAP@0.5 of 0.953. End-to-end inspection (capture, detect, narrate, speak) in about 21 seconds. Beats a fine-tuned VLM baseline by +91.3 points of exact-match accuracy and +0.83 of macro F1.

Everything lives on the Pi. The phone is just a remote control with a screen.
YOLOv8s, 3-fold cross-validation on the public PCB defects dataset. 1386 images, 95/5 split, 180 epochs per fold.
| Metric | Value |
|---|---|
| mAP@0.5 | 0.953 |
| mAP@0.5:0.95 | 0.473 |
| Macro Precision | 0.962 |
| Macro Recall | 0.918 |
Per-class F1 stays above 0.85 across all 6 classes. The best class (short) hits 0.996. The hardest one (open_circuit) lands at 0.853, which matches the dataset itself, since open circuits are visually the most ambiguous.
| Class | Precision | Recall | F1 |
|---|---|---|---|
| missing_hole | 0.968 | 1.000 | 0.984 |
| mouse_bite | 0.921 | 0.950 | 0.935 |
| open_circuit | 0.860 | 0.847 | 0.853 |
| short | 0.993 | 1.000 | 0.996 |
| spur | 0.994 | 0.885 | 0.936 |
| spurious_copper | 0.914 | 0.885 | 0.899 |
Before settling on the hybrid pipeline, I evaluated a vision-language model doing the full task end-to-end. Same validation split, same prompt, full multi-label comparison.
| Metric | YOLOv8s | LLaVA-OneVision 0.5B + LoRA |
|---|---|---|
| Exact-match accuracy | 91.3 % | 0.0 % |
| Macro precision | 0.962 | 0.059 |
| Macro recall | 0.918 | 0.167 |
| Macro F1 | 0.939 | 0.043 |
| Mean Jaccard | 0.902 | 0.176 |
The VLM was retired after this run. A 0.5B generalist VLM hallucinated defect classes and never converged on a spatially fine PCB task at that parameter budget.
Measured during a real inspection session on the Pi 5.
| Stage | Duration | Share |
|---|---|---|
| Photo capture | 5 ms | 0.0 % |
| YOLOv8 inference | 1.71 s | 8.1 % |
| LLM inference (Gemma-4-E2B-it) | 14.44 s | 68.1 % |
| TTS (Piper) | 5.04 s | 23.7 % |
| End-to-end | 21.2 s | 100 % |
llama-server subprocess, loaded once at boot and never reloaded./health, /telemetry, and /ws/start. YOLO, llama-server, and Piper are loaded once during the FastAPI lifespan, so each request only pays inference cost. Annotated frames are served from a static /captures mount.A 0.5B VLM trying to do both perception and language reached only 4.3 % macro F1. YOLOv8s handles the spatially fine detection it was designed for, and a 2B LLM only has to verbalize the structured detection output. That split is the single biggest reason the system works.
The 2.5 GB GGUF is loaded once at boot and reused over HTTP. This removes a ~10 s cold start per inspection and isolates native crashes from the FastAPI process. If llama-server fails, the API stays up and surfaces the error to the operator instead of bringing the whole service down.
Google released Gemma-4 in April 2026 during development, so I migrated. Cost of the swap: latency went from milliseconds to about 14 s. Gain: complete defect coverage (13/13 vs 10/13 on the same prompt). For a one-shot quality-control loop where the operator is standing next to the board, correct answers are worth the extra latency.
68 % of every inspection is Gemma generating ~50 tokens on 4 Arm cores. The detector is not the bottleneck. The next optimization pass targets token budget and quantization, not the YOLO graph.
With no physical trigger button on the Pi kit, the mobile app became the control plane. The tunnel gave it remote reach with no extra infrastructure, which turned a missing button into a remote inspection feature.