Gallium Nine improvement: Reduce overhead of stateblocks
Stateblocks are a recording of states in d3d9. They enable to record and apply later state changes.
They can be only a few states, or the whole set of states. One particularity of d3d9 is that one can change the content of the recorded states at any moment, which reduces the potential for optimizations.
In addition some of the states can have their content evolve. For example a stateblock might record that a given texture should be in a given slot, but the texture itself can have its internal resource change.
Nine has a worker thread that applies the states for real, and the main thread keeps track of the 'advertised states' that is the value of the states that we need to return if the app requests them (it is also used to replicate the filtering of redundant state application, which has impacts on some corner behaviours). -- off topic: Technically if 'Pure' device is requested (most games do) the main thread is allowed not to remember the states, and we could defer the filtering to the worker thread, but our tests didn't show gains when trying to implement that. Maybe the gain is bigger if stateblocks can work fully in the worker thread. --
The way Nine implements stateblocks is not very efficient. When a stateblock is applied, first all groups of states are checked (nine_state_copy_common), and those that have something recorded are applied on the set of advertised states. Then (nine_context_apply_stateblock) we go though all the groups of states again and apply individually each of them by appending a call for each of them in the command queue of the worker thread. Finally the worker thread will run all the command queue to set individually each state.
While the work of setting the states is unavoidable (main thread and worker thread), the process of appending the calls to the command queue can be optimized.
One way to improve the situation could be to have a copy of what to apply in the worker thread, and thus go through that only once. States which have internal parts that can change (textures, buffers) will need to be handled as is done currently, but that leaves plenty of states.
But since stateblocks states can have their value change, to accommodate for the extreme case where a stateblock would change between each application, this transmission of information to the worker thread shouldn't more heavy than just sending all the states one by one as is done today. For example it is not reasonable to send a copy of the NineStateBlock9 structure everytime the stateblock is changed.
Having a copy of the NineStateBlock9 structure, or an equivalent, in the worker thread, and send all potential updates to it, is a valid solution and probably the fastest. But it will lead to a lot of code.
An alternative could be to record the list of calls to do in the worker thread. Basically add a functionality to record the commands, and then we would do one call to the worker thread to apply the previous command recording. One way to implement it would be to record the size required to store the commands the first time the stateblock is applied. The second time allocate a buffer of this size and use it to record the commands instead of writing in the worker command queue. Then apply. If the states are updated, discard the previous recording. At least this won't be more expensive than what we are doing currently, and we will get gains if the states are not - or rarely - updated.
Both solutions are valid, and will result in faster stateblocks.