C++ | Pavel Guzenfeld

Zero-copy GStreamer tracking pipeline on Jetson Orin

From a PyTorch Tracker to a Zero-Copy GStreamer Pipeline: Rebuilding SAM2.1/SAMURAI on Jetson, Step by Step

A long, hands-on account of turning a research-grade PyTorch visual tracker (SAM2.1 / SAMURAI) into a real-time, zero-copy GStreamer + CUDA + TensorRT pipeline on a Jetson Orin NX — decomposing the model into engines, exporting and parity-validating each one, porting the per-frame math to CUDA, packaging it as a GStreamer element, and squeezing 8 fps into 24 with queues and frame-skipping. Every stage is validated against a golden reference.

The Road to Boost.SML 1.2.0: New API and a Type-Name Heisenbug

A warts-and-all account of cutting the Boost.SML 1.2.0 release — four new public APIs, a behavior change rooted in undefined behavior, a type-name Heisenbug that only fired on GCC and MSVC, a dead “Run” button, and the surprisingly deep rabbit hole of making 30 Compiler Explorer links that actually work.

C++ State Machine benchmarked — switch/case vs GoF unique_ptr vs C++23 variant/visit

C++ State Machine, Three Ways: Switch/Case, GoF unique_ptr, and C++23 Variant — Benchmarked

Three implementations of the same aircraft lifecycle FSM — a switch/case flat machine, the classic GoF State pattern with unique_ptr, and a C++23 compile-time variant/visit design — compiled with GCC 14 at -O2 -std=c++23, measured with nanobench, and linked to four live Godbolt sessions. The headline numbers: for a full seven-event mission cycle, the variant approach is 6× faster than OOP and switch/case is essentially free (the optimizer evaluates the constant-input trace at compile time). On the steady-state telemetry hot path, switch runs at 0.56 ns/event, variant at 1.07 ns/event, OOP at 1.20 ns/event. The culprit is not virtual dispatch — it is heap allocation. A follow-up section adds get_if chains and the [[likely]] attribute: on single-variant dispatch all four strategies land within 0.11 ns.

Cross-process zero-copy NVMM IPC on Jetson — dma-buf fd passing, NvBufSurfaceImport, lock-free pool

Cross-Process Zero-Copy on Jetson: dma-buf fds, NvBufSurfaceImport, and a Cache-Line-Padded Pool

Two processes on a Jetson, one camera frame in NVMM (GPU memory), no copies. The kernel does the heavy lifting via dma-buf fds; SCM_RIGHTS carries the fd across the process boundary; NvBufSurfaceImport reconstructs the surface on the consumer side; a cache-line-padded ring of atomic ref-counts keeps fan-out coherent without locks. With benchmark numbers and a Godbolt-runnable demo of the SCM_RIGHTS pattern.

C++ low-latency patterns benchmarked — HFT, cache-friendly layouts, and the tricks that don't reproduce

C++ Low-Latency Patterns, Benchmarked: 15 Tricks from HFT and CppCon 2025 (and Which Claims Don't Reproduce)

Fifteen C++ performance patterns from three sources — Bilokon & Gunduz’s HFT paper at Imperial, Jonathan Müller’s Cache-Friendly C++ deck at CppCon 2025, and Okade & Baker’s C++ Performance Tips at CppCon 2025 — implemented as single-file nanobench programs, run under Docker on GCC 14 with -O2 -std=c++23, every one of them linked to a working Godbolt session. The surprising half: three of the ’textbook’ speedups do not reproduce on a modern CPU. Cache warming is 7x slower than cold. Bitmask branch reduction is slower than the cascade it replaces. This post is the full runbook, numbers included, with the nuance that explains when each pattern actually earns its keep.

C++ tools that enforce low-latency patterns — builtins, compiler flags, clang-tidy checks

C++ Low-Latency, Enforced: __builtin_*, Compiler Flags, and clang-tidy, Benchmarked

Follow-up to the 15-pattern HFT post. The first post asked ‘which patterns work?’ This post answers ‘how do you force your team to keep using them?’ Seven new nanobench programs, each with a working Godbolt link, covering __builtin_popcountll (24x faster than a bit-count loop), __builtin_unreachable in switch defaults (1.55x), __builtin_bswap64 (same speed as the portable idiom — GCC already folds it), [[likely]] and [[unlikely]] (no measurable effect on tight loops — an honest null result), and a flags matrix showing -ffast-math + -march=x86-64-v3 giving 6.9x over -O2. Then a .clang-tidy config that fails CI on every common perf regression, with the performance-for-range-copy warning demonstrated at 41x real runtime cost.

Cleaning Up, Pipelining, and Bake-Testing the STM32H750 Tracker

A sequel to the first STM32H750 tracker post. After the C++ port was proven in production, I spent a week of evenings cutting dead vendor code, splitting the algorithm out to host for unit tests, wiring the LCD SPI through DMA to let the CPU run the tracker in parallel with the blit, unlocking the camera’s real frame rate, chasing a subtle BB-drift bug back to a too-wide SAD search, and finally building an offline A/B harness that compares SAD, NCC, and MOSSE on four synthetic scenarios so the next tracker port is a data decision, not a vibes one.

Building a Template-Matching Tracker on an STM32H750: What Worked, What Didn't

A long, honest retrospective on turning a WeAct STM32H750 board, a cheap OV7725 IR camera, and an Xbox controller into a live template-matching tracker. Every dead end, every wrong assumption, every fix — and what I’d do differently next time.

Pathfinding algorithms for games — A*, JPS, Theta*, flow fields, visibility graphs

Game Pathfinding Algorithms, Benchmarked: A, JPS, Theta, Flow Fields, Visibility Graphs

Five pathfinders implemented in C++23, each in a single Godbolt-ready file, benchmarked on the same grids. A* is the baseline. JPS expands 22x fewer nodes on open maps yet runs slower than A* in naive form. Theta* produces shorter any-angle paths at 2-8x the cost. Flow fields dominate when many agents share a goal. Visibility graphs — AoE II DE’s approach — need 5 waypoints where A* needs 600. Plus the StarCraft ‘harvesters ignore collisions’ hack and why SC2 switched to navmeshes.

O3DE multi-camera rendering performance analysis

Chasing 18 Milliseconds: A Performance Deep Dive into O3DE's Render Readback Pipeline

We spent a full session systematically profiling O3DE’s multi-camera streaming pipeline, testing eight different optimization approaches, and pinpointed the exact bottleneck: 18 ms of fixed overhead in the AttachmentReadback scope system. Here’s what we tried, what we measured, and what it means for the engine.