This is a follow-up to From Unity to O3DE: Multi-Camera Streaming at 1080p. That post covered the initial migration; this one covers what happened when we tried to make it fast.
We had three cameras rendering at 1920x1080 via O3DE’s RenderToTexture pipeline, each streaming H.264 over UDP through GStreamer. It worked. The problem: 20 FPS. The GPU was 85% idle. Something was burning 50 ms per frame, and it wasn’t rendering.
This post documents the systematic hunt for those milliseconds.
The setup
Three RenderToTexture pipelines in O3DE’s Atom renderer, each with its own View, rendering to separate output images. A GStreamerStreamComponent reads back each camera’s output via AttachmentReadback and pushes frames through appsrc -> nvh264enc -> rtph264pay -> udpsink. Everything runs in a headless Docker container on an RTX 3060 Laptop.
The baseline
First, establish what costs what:
| Configuration | Frame time | FPS |
|---|---|---|
| Engine + SwapChain only (no RTT) | 31 ms | 32 |
| 3 RTT pipelines, no readback | 32 ms | 31 |
| 3 RTT pipelines + readback | 50 ms | 20 |
The rendering itself is essentially free: three extra full MainPipeline render graphs add only 1 ms. The entire 18 ms overhead comes from the readback path.
Attempt 1: Producer-consumer encode
Hypothesis: The readback callback runs memcpy (8 MB per frame) and gst_app_src_push_buffer on the main thread. Moving that off should help.
Implementation: OnReadbackComplete enqueues a shared_ptr to the pixel data into a lock-free queue. A dedicated AZStd::thread pops items and does the memcpy + GStreamer push.
Result: Main thread CPU dropped from 70% to 21%. But frame time unchanged at 50 ms. The Vulkan Graphics Queue thread became the new bottleneck at 70% CPU.
Lesson: The encode was never on the critical path. We moved work off the main thread, but the frame time is set by the slowest stage in the pipeline. With the encode off the main thread, the graphics queue submission became the limiter.
Attempt 2: Parallel CompileProducers
Hypothesis: FrameScheduler::CompileProducers iterates all 160 scope producers (40 passes x 4 pipelines) sequentially. Parallelizing them should help.
Implementation: Patched CompileProducers() to dispatch each scope’s CompileResources() to AZ::JobCompletion workers.
Result: No effect. Worker threads showed <1% CPU. Each scope’s CompileResources completes in microseconds – the per-job overhead exceeds the work itself.
Lesson: The bottleneck isn’t in CompileResources. The scope compile is cheap per scope.
Attempt 3: Round-robin readback
Hypothesis: 3 readback scopes per frame cost 3x. Only reading back 1 camera per frame (cycling) should cut the overhead to 1/3.
Implementation: (m_tickCount % m_cameras.size()) == i guard on RequestReadback.
Result: No effect. Same 50 ms.
Lesson: The overhead isn’t proportional to the number of readbacks per frame. It’s a fixed cost of having readback objects registered in the frame graph.
Attempt 4: Ephemeral readback
Hypothesis: Persistent AttachmentReadback objects keep their scopes registered in the frame graph across frames, adding overhead even when idle. Creating and destroying them per frame should help.
Implementation: Create a fresh AttachmentReadback in each OnTick, capture the shared_ptr in the callback lambda so it lives until the callback fires, then gets destroyed.
Result: No effect. 50 ms.
Lesson: The scope gets added to the frame graph the moment ReadbackOutput is called and costs its full overhead within that same frame. Object lifetime doesn’t matter.
Attempt 5: Copy queue
Hypothesis: The readback scope runs on the Graphics queue. Moving it to the dedicated Transfer queue (NVIDIA RTX 3060 has a separate transfer queue family) should overlap with graphics work.
Implementation: Patched AttachmentReadback’s ScopeProducerFunctionNoData constructor to pass RHI::HardwareQueueClass::Copy instead of Graphics. Verified the Vulkan backend already supports it.
Result: No effect. 50 ms. Confirmed with recompile of libAtom_RPI.Private.so.
Lesson: The scope is still in the same frame graph regardless of which queue it targets. The frame graph compiler processes all scopes – it doesn’t know or care about the hardware queue class when resolving dependencies and allocating resources.
Attempt 6: Persistent RTT output
Hypothesis: The RTT output is a transient image that gets allocated/deallocated each frame. Making it persistent (Imported lifetime with AttachmentImage) might reduce the readback scope’s allocation overhead.
Implementation: Patched RenderToTexturePass::BuildInternal to create the output as AttachmentLifetimeType::Imported with a persistent AttachmentImage from the system attachment pool.
Result: Partial improvement for 1 camera (50 ms -> 40.7 ms, -18.6%). But with 3 cameras, back to 50 ms.
Lesson: The persistent image helps per-scope (eliminates transient allocation), but the total overhead scales with the number of scopes. 3 cameras x 6 ms per scope compile = 18 ms, back to the same cost.
Attempt 7: Batched readback scope
Hypothesis: 3 separate readback scopes = 3x compile overhead. One shared scope that round-robins across cameras should cost 1/3.
Implementation: BatchedReadback class with a single AttachmentReadback instance shared across all cameras. Only one scope registered in the frame graph. Round-robin which camera gets read back each frame.
Result: No effect. 50 ms with a single scope.
Lesson: The 18 ms is the fixed cost of having any AttachmentReadback scope in the frame graph, regardless of count. One scope costs the same as three.
Attempt 8: FrameCaptureRequestBus
Hypothesis: Maybe the public FrameCaptureRequestBus::CapturePassAttachmentWithCallback API has a more efficient internal path.
Result: Same 50 ms. It uses AttachmentReadback internally.
What we actually measured
The GPU fence wait between frames: 4-6 microseconds. The frames-in-flight system (3-frame ring buffer) is already working. GPU is never blocking the CPU.
Per-thread CPU usage with producer-consumer:
Main thread: 21% (was 70% before producer-consumer)
Graphics Queue: 70% (Vulkan command submission)
Encode thread: 5% (our consumer -- memcpy + GStreamer)
GStreamer encoders: 7% each (NVENC hardware encoding)
The main thread is free. The GPU is free. The encode is free. The bottleneck is the CommandQueue::ProcessQueue thread doing vkQueueSubmit for all the scopes including the readback scope.
The root cause
AttachmentReadback adds a scope to O3DE’s frame graph via ImportScopeProducer. Every frame, the FrameGraphCompiler processes this scope:
- Resolves its attachment dependencies against all other scopes
- Allocates transient resources (staging buffer)
- Builds the execution schedule
- Compiles its
ShaderResourceGroups - Records its command buffer
- Submits via the graphics queue
Steps 1-4 happen on the main thread during FrameScheduler::Compile. Steps 5-6 happen during FrameScheduler::Execute. The total: 18 ms of fixed overhead, regardless of how many images are copied or which queue is used.
This is fundamental to O3DE’s architecture. The frame graph compiler doesn’t distinguish between a readback scope and a render scope – both get the full compile treatment.
The fix that would work
The copy needs to happen outside the frame graph, after FrameScheduler::EndFrame(). This requires:
- Making the RTT output image persistent (we proved this works – rendering is unaffected)
- After
EndFrame(), recordingvkCmdCopyImageToBufferinto a command buffer on the dedicated transfer queue - Using a semaphore to synchronize: graphics queue signals when rendering is done, transfer queue waits before copying
- Mapping the staging buffer on the CPU and pushing pixels to GStreamer
This bypasses the scope system entirely. The estimated cost: DMA copy time (~1 ms for 8 MB at PCIe bandwidth) instead of 18 ms of frame graph compilation.
This is an engine-level change – not something achievable from gem/plugin code. The right path is a contribution to O3DE: adding a DirectReadback mode to RenderToTexturePass that copies the output to a persistent staging buffer within the existing render scope’s command list.
What we shipped anyway
Despite not beating the 18 ms wall:
- 3-camera RTT streaming works end-to-end at 20-22 FPS via GStreamer
- Producer-consumer architecture decouples encode from the main thread
- Config-driven – all camera positions, FOVs, ports, encoder settings from YAML
- 6 engine patches documented with measurements for each
- The exact bottleneck identified with profiling data at every level
The profiling methodology – fence timing, per-thread CPU measurement, elimination testing (with/without readback), engine instrumentation – is reusable for any O3DE performance investigation.
Numbers
| Configuration | Frame time | FPS | GPU |
|---|---|---|---|
| Baseline (SwapChain only) | 31 ms | 32 | 24% |
| 3 RTT, no readback | 32 ms | 31 | 15% |
| 3 RTT + readback (before) | 50 ms | 20 | 15% |
| 3 RTT + readback (after all opts) | 45-50 ms | 20-22 | 16% |
| Theoretical (bypass scope system) | ~33 ms | ~30 | ~20% |
The gap between 32 ms (no readback) and 50 ms (with readback) is the engineering opportunity. 18 ms of scope compilation overhead, recoverable with a targeted engine change that doesn’t touch the rendering path at all.