Batched ww-mutexes, wound-wait vs wait-die etc.

Wed Apr 11 08:27:06 UTC 2018

Hi!

Thinking of adding ww-mutexes for reservation also of vmwgfx resources, 
(like surfaces), I became a bit worried that doubling the locks taken 
during command submission wouldn't be a good thing. Particularly on ESX 
servers where a high number of virtual machines running graphics on a 
multi-core processor would initiate a very high number of processor 
locked cycles. The method we use today is to reserve all those resources 
under a single mutex. Buffer objects are still using reservation objects 
and hence ww-mutexes, though.

So I figured a "middle way" would be to add batched ww-mutexes, where 
the ww-mutex locking state, instead of being manipulated atomically,
was manipulated under a single lock-class global spinlock. We could then 
condense the sometimes 200+ locking operations per command submission to 
two, one for lock and one for unlock. Obvious drawbacks are that taking 
the spinlock is slightly more expensive than an atomic operation, and 
that we could introduce contention for the spinlock where there is no 
contention for an atomic operation.

So I set out to test this in practice. After reading up a bit on the 
theory it turned out that the current in-kernel wound-wait 
implementation, like once TTM (unknowingly), is actually not wound-wait, 
but wait-die. Correct name would thus be "wait-die mutexes", Some 
sources on the net claimed "wait-wound" is the better algorithm due to a 
reduced number of backoffs:

http://www.mathcs.emory.edu/~cheung/Courses/554/Syllabus/8-recv+serial/deadlock-compare.html

So I implemented both algorithms in a standalone testing module:

git+ssh://people.freedesktop.org/~thomash/ww_mutex_test

Some preliminary test trends:

1) Testing uncontended sequential command submissions: Batching 
ww-mutexes seems to be between 50% and 100% faster than the current 
kernel implementation. Still the kernel implementation performing much 
better than I thought.

2) Testing uncontended parallell command submission: Batching ww-mutexes 
slower (up to 50%) of the current kernel implementation, since the 
in-kernel implementation can make use of multi-core parallellism where 
the batching implementation sees spinlock contention. This effect 
should, however, probably  be relaxed if setting a longer command 
submission time, reducing the spinlock contention.

3) Testing contended parallell command submission: Batching is generally 
superior by usually around 50%, sometimes up to 100%, One of the reasons 
could be that batching appears to result in a significantly lower number 
of rollbacks.

5) Taking batching locks without actually batching can result i poor 
performance.

4) Wound-Wait vs Wait-Die. As predicted, particularly with a low number 
of parallell cs threads, Wound-wait appears to give a lower number of 
rollbacks, but there seems to be no overall locking time benefits. On 
the contrary, as the number of threads exceeds the number of cores, 
wound-wait appears to become increasingly more time-consuming than 
Wait-Die. One of the reason for this might be that Wound-Wait may see an 
increased number of unlocks per rollback. Another is that it is not 
trivial to find a good lock to wait for with Wound-Wait. With Wait-Die 
the thread rolling back just waits for the contended lock. With 
wound-wait the wounded thread is preempted, and in my implementation I 
choose to lazy-preempt at the next blocking lock, so that at least we 
have a lock to wait on, even if it's not a relevant lock to trigger a 
rollback.

So this raises a couple of questions:

1) Should we implement an upstream version of batching locks, perhaps as 
a choice on a per-lock-class basis?
2) Should we add a *real* wound-wait choice to our wound-wait mutexes. 
Otherwise perhaps rename them or document that they're actually doing 
wait-die.

/Thomas