Cross-process zero-copy NVMM IPC on Jetson — dma-buf fd passing, NvBufSurfaceImport, lock-free pool

Cross-Process Zero-Copy on Jetson: dma-buf fds, NvBufSurfaceImport, and a Cache-Line-Padded Pool

Two processes on a Jetson, one camera frame in NVMM (GPU memory), no copies. The kernel does the heavy lifting via dma-buf fds; SCM_RIGHTS carries the fd across the process boundary; NvBufSurfaceImport reconstructs the surface on the consumer side; a cache-line-padded ring of atomic ref-counts keeps fan-out coherent without locks. With benchmark numbers and a Godbolt-runnable demo of the SCM_RIGHTS pattern.

April 25, 2026 · 22 min · Pavel Guzenfeld
C++ low-latency patterns benchmarked — HFT, cache-friendly layouts, and the tricks that don't reproduce

C++ Low-Latency Patterns, Benchmarked: 15 Tricks from HFT and CppCon 2025 (and Which Claims Don't Reproduce)

Fifteen C++ performance patterns from three sources — Bilokon & Gunduz’s HFT paper at Imperial, Jonathan Müller’s Cache-Friendly C++ deck at CppCon 2025, and Okade & Baker’s C++ Performance Tips at CppCon 2025 — implemented as single-file nanobench programs, run under Docker on GCC 14 with -O2 -std=c++23, every one of them linked to a working Godbolt session. The surprising half: three of the ’textbook’ speedups do not reproduce on a modern CPU. Cache warming is 7x slower than cold. Bitmask branch reduction is slower than the cascade it replaces. This post is the full runbook, numbers included, with the nuance that explains when each pattern actually earns its keep.

April 23, 2026 · 28 min · Pavel Guzenfeld
C++ tools that enforce low-latency patterns — builtins, compiler flags, clang-tidy checks

C++ Low-Latency, Enforced: __builtin_*, Compiler Flags, and clang-tidy, Benchmarked

Follow-up to the 15-pattern HFT post. The first post asked ‘which patterns work?’ This post answers ‘how do you force your team to keep using them?’ Seven new nanobench programs, each with a working Godbolt link, covering __builtin_popcountll (24x faster than a bit-count loop), __builtin_unreachable in switch defaults (1.55x), __builtin_bswap64 (same speed as the portable idiom — GCC already folds it), [[likely]] and [[unlikely]] (no measurable effect on tight loops — an honest null result), and a flags matrix showing -ffast-math + -march=x86-64-v3 giving 6.9x over -O2. Then a .clang-tidy config that fails CI on every common perf regression, with the performance-for-range-copy warning demonstrated at 41x real runtime cost.

April 23, 2026 · 21 min · Pavel Guzenfeld
Cleaning Up, Pipelining, and Bake-Testing the STM32H750 Tracker

Cleaning Up, Pipelining, and Bake-Testing the STM32H750 Tracker

A sequel to the first STM32H750 tracker post. After the C++ port was proven in production, I spent a week of evenings cutting dead vendor code, splitting the algorithm out to host for unit tests, wiring the LCD SPI through DMA to let the CPU run the tracker in parallel with the blit, unlocking the camera’s real frame rate, chasing a subtle BB-drift bug back to a too-wide SAD search, and finally building an offline A/B harness that compares SAD, NCC, and MOSSE on four synthetic scenarios so the next tracker port is a data decision, not a vibes one.

April 23, 2026 · 17 min · Pavel Guzenfeld
Building a Template-Matching Tracker on an STM32H750

Building a Template-Matching Tracker on an STM32H750: What Worked, What Didn't

A long, honest retrospective on turning a WeAct STM32H750 board, a cheap OV7725 IR camera, and an Xbox controller into a live template-matching tracker. Every dead end, every wrong assumption, every fix — and what I’d do differently next time.

April 21, 2026 · 23 min · Pavel Guzenfeld
Pathfinding algorithms for games — A*, JPS, Theta*, flow fields, visibility graphs

Game Pathfinding Algorithms, Benchmarked: A*, JPS, Theta*, Flow Fields, Visibility Graphs

Five pathfinders implemented in C++23, each in a single Godbolt-ready file, benchmarked on the same grids. A* is the baseline. JPS expands 22x fewer nodes on open maps yet runs slower than A* in naive form. Theta* produces shorter any-angle paths at 2-8x the cost. Flow fields dominate when many agents share a goal. Visibility graphs — AoE II DE’s approach — need 5 waypoints where A* needs 600. Plus the StarCraft ‘harvesters ignore collisions’ hack and why SC2 switched to navmeshes.

April 18, 2026 · 15 min · Pavel Guzenfeld
O3DE multi-camera rendering performance analysis

Chasing 18 Milliseconds: A Performance Deep Dive into O3DE's Render Readback Pipeline

We spent a full session systematically profiling O3DE’s multi-camera streaming pipeline, testing eight different optimization approaches, and pinpointed the exact bottleneck: 18 ms of fixed overhead in the AttachmentReadback scope system. Here’s what we tried, what we measured, and what it means for the engine.

April 17, 2026 · 7 min · Pavel Guzenfeld
Three live Godot camera streams over RTP/UDP rendered by GStreamer clients

From Unity to Godot: Multi-Camera Streaming at 50 FPS with Async GPU Readback

After O3DE’s 18 ms frame-graph readback made 30 FPS streaming impossible, we tried Godot. It got us there — eventually. This is the full path from 105 FPS on nothing to 50 FPS per camera with three live RTP streams, including every wrong turn and every underdocumented Godot behavior we hit on the way.

April 17, 2026 · 12 min · Pavel Guzenfeld
O3DE rendering a ground plane from a camera spawned programmatically inside a headless Docker container

From Unity to O3DE: Multi-Camera Streaming at 1080p in a Headless Docker Container

Exploring whether O3DE can replace Unity as the render engine for a drone simulation that streams multiple 1080p camera feeds via GStreamer. From first scaffold to three live RenderToTexture pipelines in a single session.

April 16, 2026 · 6 min · Pavel Guzenfeld
PX4 SIH running on real flight controller hardware with Unity visualization

Running PX4 SIH on Real Hardware: Custom Firmware for In-the-Loop Flight Simulation

Getting PX4’s Simulation-In-Hardware (SIH) module running on a production flight controller — discovering the firmware doesn’t include SIH, building a custom PX4 with flash-trimming, fan-out routing serial MAVLink to both the simulator and ground station, and connecting it all to the Unity visualization pipeline.

April 14, 2026 · 16 min · Pavel Guzenfeld