Posts

Zero-copy GStreamer tracking pipeline on Jetson Orin

From a PyTorch Tracker to a Zero-Copy GStreamer Pipeline: Rebuilding SAM2.1/SAMURAI on Jetson, Step by Step

A long, hands-on account of turning a research-grade PyTorch visual tracker (SAM2.1 / SAMURAI) into a real-time, zero-copy GStreamer + CUDA + TensorRT pipeline on a Jetson Orin NX — decomposing the model into engines, exporting and parity-validating each one, porting the per-frame math to CUDA, packaging it as a GStreamer element, and squeezing 8 fps into 24 with queues and frame-skipping. Every stage is validated against a golden reference.

A cropped sensor view handing pan authority off to a physical gimbal

Two Pans, One Stick: Blending a Digital Crop Pan with a Physical Gimbal

A long-range zoom camera pans two ways at once — by sliding a crop window across the sensor, and by physically rotating the gimbal underneath it. Here’s how to make one joystick drive both so the operator never feels the seam, the maths of the handoff, and the per-axis bug that kept the gimbal rolling after the stick let go.

The Road to Boost.SML 1.2.0: New API and a Type-Name Heisenbug

A warts-and-all account of cutting the Boost.SML 1.2.0 release — four new public APIs, a behavior change rooted in undefined behavior, a type-name Heisenbug that only fired on GCC and MSVC, a dead “Run” button, and the surprisingly deep rabbit hole of making 30 Compiler Explorer links that actually work.

Hunting an undocumented AHRS on the serial bus

Which tty Is the AHRS? Hunting an Undocumented Serial Device on a Jetson

A small AHRS was wired to a Jetson over USB, but nobody wrote down which serial port. Here’s how I tracked it down by its protocol instead of its name, fell into the classic dialout permissions trap, and decoded its orientation stream into human-readable numbers.

C++ State Machine benchmarked — switch/case vs GoF unique_ptr vs C++23 variant/visit

C++ State Machine, Three Ways: Switch/Case, GoF unique_ptr, and C++23 Variant — Benchmarked

Three implementations of the same aircraft lifecycle FSM — a switch/case flat machine, the classic GoF State pattern with unique_ptr, and a C++23 compile-time variant/visit design — compiled with GCC 14 at -O2 -std=c++23, measured with nanobench, and linked to four live Godbolt sessions. The headline numbers: for a full seven-event mission cycle, the variant approach is 6× faster than OOP and switch/case is essentially free (the optimizer evaluates the constant-input trace at compile time). On the steady-state telemetry hot path, switch runs at 0.56 ns/event, variant at 1.07 ns/event, OOP at 1.20 ns/event. The culprit is not virtual dispatch — it is heap allocation. A follow-up section adds get_if chains and the [[likely]] attribute: on single-variant dispatch all four strategies land within 0.11 ns.

H.264 vs H.265 vs AV1 end-to-end latency on Jetson Orin — grouped bar chart

H.264, H.265, and AV1 on Jetson Orin: A Real Hardware Latency Benchmark

A rigorous per-stage latency benchmark across H.264, H.265, and AV1 hardware codecs on NVIDIA Jetson Orin (JetPack 6), measuring encode, wire, and decode separately at FHD and HD resolutions. AV1 wins end-to-end at 104 ms FHD / 86 ms HD. H.264 is the worst choice despite being the oldest: its nvv4l2decoder holds ~4 frames in an internal DPB buffer, adding 130–170 ms of hidden latency. Wire latency is governed by parse-element lookahead, not byte volume. Clock sync achieves ±234 µs via chrony. Full pipeline source, CSVs, and reproduction steps included.

Cross-process zero-copy NVMM IPC on Jetson — dma-buf fd passing, NvBufSurfaceImport, lock-free pool

Cross-Process Zero-Copy on Jetson: dma-buf fds, NvBufSurfaceImport, and a Cache-Line-Padded Pool

Two processes on a Jetson, one camera frame in NVMM (GPU memory), no copies. The kernel does the heavy lifting via dma-buf fds; SCM_RIGHTS carries the fd across the process boundary; NvBufSurfaceImport reconstructs the surface on the consumer side; a cache-line-padded ring of atomic ref-counts keeps fan-out coherent without locks. With benchmark numbers and a Godbolt-runnable demo of the SCM_RIGHTS pattern.

C++ low-latency patterns benchmarked — HFT, cache-friendly layouts, and the tricks that don't reproduce

C++ Low-Latency Patterns, Benchmarked: 15 Tricks from HFT and CppCon 2025 (and Which Claims Don't Reproduce)

Fifteen C++ performance patterns from three sources — Bilokon & Gunduz’s HFT paper at Imperial, Jonathan Müller’s Cache-Friendly C++ deck at CppCon 2025, and Okade & Baker’s C++ Performance Tips at CppCon 2025 — implemented as single-file nanobench programs, run under Docker on GCC 14 with -O2 -std=c++23, every one of them linked to a working Godbolt session. The surprising half: three of the ’textbook’ speedups do not reproduce on a modern CPU. Cache warming is 7x slower than cold. Bitmask branch reduction is slower than the cascade it replaces. This post is the full runbook, numbers included, with the nuance that explains when each pattern actually earns its keep.

C++ tools that enforce low-latency patterns — builtins, compiler flags, clang-tidy checks

C++ Low-Latency, Enforced: __builtin_*, Compiler Flags, and clang-tidy, Benchmarked

Follow-up to the 15-pattern HFT post. The first post asked ‘which patterns work?’ This post answers ‘how do you force your team to keep using them?’ Seven new nanobench programs, each with a working Godbolt link, covering __builtin_popcountll (24x faster than a bit-count loop), __builtin_unreachable in switch defaults (1.55x), __builtin_bswap64 (same speed as the portable idiom — GCC already folds it), [[likely]] and [[unlikely]] (no measurable effect on tight loops — an honest null result), and a flags matrix showing -ffast-math + -march=x86-64-v3 giving 6.9x over -O2. Then a .clang-tidy config that fails CI on every common perf regression, with the performance-for-range-copy warning demonstrated at 41x real runtime cost.

Cleaning Up, Pipelining, and Bake-Testing the STM32H750 Tracker

A sequel to the first STM32H750 tracker post. After the C++ port was proven in production, I spent a week of evenings cutting dead vendor code, splitting the algorithm out to host for unit tests, wiring the LCD SPI through DMA to let the CPU run the tracker in parallel with the blit, unlocking the camera’s real frame rate, chasing a subtle BB-drift bug back to a too-wide SAD search, and finally building an offline A/B harness that compares SAD, NCC, and MOSSE on four synthetic scenarios so the next tracker port is a data decision, not a vibes one.