nine: Optimizing dynamic systemmem buffers

Axel Davy requested to merge axeldavy/mesa:nine_dynamic_systemmem into master

There are three kind of buffer locations in D3D9: . DEFAULT: On the GPU . MANAGED: An intermediate CPU buffer is used for locks and dirty regions are uploaded to a GPU copy at the first draw call needing the buffer. . SYSTEMMEM: A CPU buffer.

In additions DYNAMIC and WRITEONLY flags can be used. DYNAMIC allows the usage of the locking flags DISCARD and NOOVERWRITE.

Most Apps use static buffers locked once per scene (DEFAULT writeonly or MANAGED) in combination with dynamic buffers filled in a round fashion with DISCARD and NOOVERWRITE (DEFAULT dynamic writeonly).

Some old apps, probably to avoid using GPU memory, use SYSTEMMEM buffers instead. So far Nine hadn't optimized these kinds of buffers. We stored them in PIPE_USAGE_STAGING as reads are supposed to be fast, and we ignored locking flags. It worked, but for these apps it is slow.

One example is Halo ( The app uses SYSTEMMEM buffers with the DYNAMIC flags. The index buffers are locked in a round fashion (with draw calls after each lock) with the NOOVERWRITE flag (it never discards). And it has various NOOVERWRITE/DISCARD behaviours for its vertex buffers. For example for one vertex buffer it makes a lot of consecutive locks with the DISCARD flag each time, and then begins using it for rendering.

It is unclear what DISCARD and NOOVERWRITE are supposed to REALLY mean for SYSTEMMEM dynamic buffers. In D3D7, NOOVERWRITE only means that for the current frame the affected data is unused, it's possible the same behaviour is expected here (thus we would need to sync relative to the previous frame). DISCARD might be only a hint to reupload the whole buffer at the next draw call. Tests on windows 10 show different behaviours of SYSTEMMEM on the three main vendors. But they all indicate an intermediate buffer is passed at the lock, and the data is uploaded for the Draw call.

EDIT: The patch series is no longer RFC.

The new version implements SYSTEMMEM (non-DYNAMIC takes the DYNAMIC path) by having any lock dirty the whole area (to handle writes outside the locked region), and uploading only what is needed for the draw calls in efficient fashion (I aggressively try to generate DISCARD/UNSYNCHRONIZED).

In addition I make DEFAULT pool use this path when software vertex processing is required.

As a result, the path is quite optimized and performance for affected applications is very significantly above what it used to be. Affected applications are usually from the start of the direct3D9 area, and applications with direct3D8 to direct3D9 wrappers.

On, an user reported even reported a case where the performance goes from 1-2 fps to 34 fps (with d3d8to9. 26fps with the same wrapper on windows. 36fps on pure d3d8).


Thus this patch series makes SYSTEMMEM use the same code as for MANAGED (as they seem pretty similar in behavior), and in addition it detects when locks are done in a round fashion for DYNAMIC SYSTEMMEM buffers, and in this case for the CPU->GPU upload we use a DISCARD/NOOVERWRITE pattern.

This patch series is RFC because there are a lot of optimizations I am not sure about, and would like some feedback.

=== Flushing in EndScene === Draw calls are supposed to be between BeginScene() and EndScene() calls. So far we ignored them. But EndScene() API doc indicates it is supposed to flush the GPU queue, and it is advised to do it ahead of the call to Present(). Implementing this behavior gives me +15-20% gain with Halo, even when I disable Nine's internal threading. Maybe some Halo multithreading is involved. I get a small hit (2%) with 3DMark03/3DMark05. The hit is greater if I allow more than 1 flush in EndScene per frame (most games use 1 EndScene, but a few apps use more).

RFC: Is there a way to reduce the hit ? More generally reduce the overhead of flushing in Present(). Also surprisingly I get 5% perf boost in halo, if instead of flushing in EndScene, I add a flush before flush_resource + calling the hud in Present(). This doesn't make sense to me.

=== Optimizing DrawPrimitiveUp === DrawPrimitiveUp enables to pass a CPU pointer to vertex data and draw from it. We upload the data with u_upload. The app is making many calls to DrawPrimitiveUp. The patch series set the alignment between the allocations to 1 to have them consecutive, which enables to play with the starting vertex and not change the bound vertex buffer state. For some scenes there is a minor gain sometimes not. Same if I try to be nice to GTT WC by requiring a 64 byte alignment in u_upload (but then I need to bind again the vertex buffer). How heavy is it to set a new vertex buffer state ? Same question for being WC aligned. What do you think makes the most sense here ?

=== Optimizing dynamic SYSTEMMEM === What do you think of the proposed scheme ? Putting the gain of the EndScene flush aside, on a normal scene I get no gains when my GPU dpm is set to low, and 10% when it is set to high. However when there is smoke (see bug report), fps drops much less. There is a x2.2 (low) / x3.1 (high) perf gain when smoke appears. Instead of using the MANAGED path for SYSTEMMEM, I also tried: . Fixing the locking flags of systemmem managed. Convert NOOVERWRITE to UNSYNCHRONIZED, but ignore DISCARDS. When using NOOVERWRITE on the start of the buffer, convert to DISCARD. This is obviously hackish and will not work for other games, but performance is good. At the end of this patch series, I get the same performance as this solution (and even slightly better in some scenes). . If apply the above, but instead of storing SYSTEMMEM in STAGING, I use STREAM or DEFAULT. This gives me a significant performance loss. So either the app is reading some buffers, or the locking pattern is not WC friendly.

This patch series tries to upload GTT WC aligned blocks, but it doesn't seem to bring any perf gain. Any idea why ?

Edited by Axel Davy

Merge request reports