The performance of hardware-accelerated ray tracing on modern GPUs, even with advances in traversal hardware and BVH compression, is fundamentally limited by memory bandwidth. This makes memory traffic—not compute—the primary bottleneck in ray tracing, and demands data structures and construction algorithms that are explicitly optimized for bandwidth efficiency. We introduce a new block-based representation for wide bounding volume hierarchies (BVHs) that directly targets this bottleneck. Our approach organizes multiple primitives and internal nodes into compact, bandwidth-efficient blocks, reducing the number and cost of memory transactions during traversal. Unlike conventional layouts, our representation enables the merging of both internal and leaf nodes, forming composite bounding volumes that amortize memory accesses across larger portions of the hierarchy. To further align BVH construction with hardware realities, we introduce a memory-centric reformulation of the surface area heuristic (SAH). Rather than modeling traversal cost in terms of compute, our formulation estimates the cost of data movement, yielding a metric that more accurately predicts performance on modern GPUs. Under this model, our merged-node representation achieves substantial reductions in memory traffic. We describe both an optimal construction algorithm and an efficient greedy variant that leverage our representation and bandwidth-driven cost model. Across a range of scenes, our approach reduces acceleration structure size and significantly lowers data movement per ray, resulting in consistently faster rendering. These results demonstrate that treating memory bandwidth as a first-class design constraint leads to more efficient ray tracing acceleration structures.
Heatmap visualization of total memory access per pixel, comparing a state-of-the-art wide BVH [Vaidyanathan et al. 2022; Ylitie et al. 2017] optimized for GPU ray tracing (as baseline) to our wide BVH with merged nodes. The visualization for the baseline BVH has significantly more red and yellow pixels than ours, particularly around geometric discontinuities, indicating where most memory accesses are generated. Our BVH produces a 15.8% reduction in total memory traffic, a 48.3% reduction in BVH size, and a 31.7% reduction in render time.
Render Times
Render times on 5 different scenes with different geometric complexities and different path tracing ray types (primary, reflection, secondary, tertiary) all show that, in comparison to the state-of-the-art baseline, our BVHs result in faster ray tracing.