Performance

C++ State Machine benchmarked — switch/case vs GoF unique_ptr vs C++23 variant/visit

C++ State Machine, Three Ways: Switch/Case, GoF unique_ptr, and C++23 Variant — Benchmarked

Three implementations of the same aircraft lifecycle FSM — a switch/case flat machine, the classic GoF State pattern with unique_ptr, and a C++23 compile-time variant/visit design — compiled with GCC 14 at -O2 -std=c++23, measured with nanobench, and linked to four live Godbolt sessions. The headline numbers: for a full seven-event mission cycle, the variant approach is 6× faster than OOP and switch/case is essentially free (the optimizer evaluates the constant-input trace at compile time). On the steady-state telemetry hot path, switch runs at 0.56 ns/event, variant at 1.07 ns/event, OOP at 1.20 ns/event. The culprit is not virtual dispatch — it is heap allocation. A follow-up section adds get_if chains and the [[likely]] attribute: on single-variant dispatch all four strategies land within 0.11 ns.

H.264 vs H.265 vs AV1 end-to-end latency on Jetson Orin — grouped bar chart

H.264, H.265, and AV1 on Jetson Orin: A Real Hardware Latency Benchmark

A rigorous per-stage latency benchmark across H.264, H.265, and AV1 hardware codecs on NVIDIA Jetson Orin (JetPack 6), measuring encode, wire, and decode separately at FHD and HD resolutions. AV1 wins end-to-end at 104 ms FHD / 86 ms HD. H.264 is the worst choice despite being the oldest: its nvv4l2decoder holds ~4 frames in an internal DPB buffer, adding 130–170 ms of hidden latency. Wire latency is governed by parse-element lookahead, not byte volume. Clock sync achieves ±234 µs via chrony. Full pipeline source, CSVs, and reproduction steps included.

Cross-process zero-copy NVMM IPC on Jetson — dma-buf fd passing, NvBufSurfaceImport, lock-free pool

Cross-Process Zero-Copy on Jetson: dma-buf fds, NvBufSurfaceImport, and a Cache-Line-Padded Pool

Two processes on a Jetson, one camera frame in NVMM (GPU memory), no copies. The kernel does the heavy lifting via dma-buf fds; SCM_RIGHTS carries the fd across the process boundary; NvBufSurfaceImport reconstructs the surface on the consumer side; a cache-line-padded ring of atomic ref-counts keeps fan-out coherent without locks. With benchmark numbers and a Godbolt-runnable demo of the SCM_RIGHTS pattern.

C++ low-latency patterns benchmarked — HFT, cache-friendly layouts, and the tricks that don't reproduce

C++ Low-Latency Patterns, Benchmarked: 15 Tricks from HFT and CppCon 2025 (and Which Claims Don't Reproduce)

Fifteen C++ performance patterns from three sources — Bilokon & Gunduz’s HFT paper at Imperial, Jonathan Müller’s Cache-Friendly C++ deck at CppCon 2025, and Okade & Baker’s C++ Performance Tips at CppCon 2025 — implemented as single-file nanobench programs, run under Docker on GCC 14 with -O2 -std=c++23, every one of them linked to a working Godbolt session. The surprising half: three of the ’textbook’ speedups do not reproduce on a modern CPU. Cache warming is 7x slower than cold. Bitmask branch reduction is slower than the cascade it replaces. This post is the full runbook, numbers included, with the nuance that explains when each pattern actually earns its keep.

C++ tools that enforce low-latency patterns — builtins, compiler flags, clang-tidy checks

C++ Low-Latency, Enforced: __builtin_*, Compiler Flags, and clang-tidy, Benchmarked

Follow-up to the 15-pattern HFT post. The first post asked ‘which patterns work?’ This post answers ‘how do you force your team to keep using them?’ Seven new nanobench programs, each with a working Godbolt link, covering __builtin_popcountll (24x faster than a bit-count loop), __builtin_unreachable in switch defaults (1.55x), __builtin_bswap64 (same speed as the portable idiom — GCC already folds it), [[likely]] and [[unlikely]] (no measurable effect on tight loops — an honest null result), and a flags matrix showing -ffast-math + -march=x86-64-v3 giving 6.9x over -O2. Then a .clang-tidy config that fails CI on every common perf regression, with the performance-for-range-copy warning demonstrated at 41x real runtime cost.

Cleaning Up, Pipelining, and Bake-Testing the STM32H750 Tracker

A sequel to the first STM32H750 tracker post. After the C++ port was proven in production, I spent a week of evenings cutting dead vendor code, splitting the algorithm out to host for unit tests, wiring the LCD SPI through DMA to let the CPU run the tracker in parallel with the blit, unlocking the camera’s real frame rate, chasing a subtle BB-drift bug back to a too-wide SAD search, and finally building an offline A/B harness that compares SAD, NCC, and MOSSE on four synthetic scenarios so the next tracker port is a data decision, not a vibes one.

Building a Template-Matching Tracker on an STM32H750: What Worked, What Didn't

A long, honest retrospective on turning a WeAct STM32H750 board, a cheap OV7725 IR camera, and an Xbox controller into a live template-matching tracker. Every dead end, every wrong assumption, every fix — and what I’d do differently next time.

Pathfinding algorithms for games — A*, JPS, Theta*, flow fields, visibility graphs

Game Pathfinding Algorithms, Benchmarked: A, JPS, Theta, Flow Fields, Visibility Graphs

Five pathfinders implemented in C++23, each in a single Godbolt-ready file, benchmarked on the same grids. A* is the baseline. JPS expands 22x fewer nodes on open maps yet runs slower than A* in naive form. Theta* produces shorter any-angle paths at 2-8x the cost. Flow fields dominate when many agents share a goal. Visibility graphs — AoE II DE’s approach — need 5 waypoints where A* needs 600. Plus the StarCraft ‘harvesters ignore collisions’ hack and why SC2 switched to navmeshes.

O3DE multi-camera rendering performance analysis

Chasing 18 Milliseconds: A Performance Deep Dive into O3DE's Render Readback Pipeline

We spent a full session systematically profiling O3DE’s multi-camera streaming pipeline, testing eight different optimization approaches, and pinpointed the exact bottleneck: 18 ms of fixed overhead in the AttachmentReadback scope system. Here’s what we tried, what we measured, and what it means for the engine.

Three live Godot camera streams over RTP/UDP rendered by GStreamer clients

From Unity to Godot: Multi-Camera Streaming at 50 FPS with Async GPU Readback

After O3DE’s 18 ms frame-graph readback made 30 FPS streaming impossible, we tried Godot. It got us there — eventually. This is the full path from 105 FPS on nothing to 50 FPS per camera with three live RTP streams, including every wrong turn and every underdocumented Godot behavior we hit on the way.