Batched ww-mutexes, wound-wait vs wait-die etc.
Thomas Hellstrom
thellstrom at vmware.com
Wed Apr 11 08:27:06 UTC 2018
Hi!
Thinking of adding ww-mutexes for reservation also of vmwgfx resources,
(like surfaces), I became a bit worried that doubling the locks taken
during command submission wouldn't be a good thing. Particularly on ESX
servers where a high number of virtual machines running graphics on a
multi-core processor would initiate a very high number of processor
locked cycles. The method we use today is to reserve all those resources
under a single mutex. Buffer objects are still using reservation objects
and hence ww-mutexes, though.
So I figured a "middle way" would be to add batched ww-mutexes, where
the ww-mutex locking state, instead of being manipulated atomically,
was manipulated under a single lock-class global spinlock. We could then
condense the sometimes 200+ locking operations per command submission to
two, one for lock and one for unlock. Obvious drawbacks are that taking
the spinlock is slightly more expensive than an atomic operation, and
that we could introduce contention for the spinlock where there is no
contention for an atomic operation.
So I set out to test this in practice. After reading up a bit on the
theory it turned out that the current in-kernel wound-wait
implementation, like once TTM (unknowingly), is actually not wound-wait,
but wait-die. Correct name would thus be "wait-die mutexes", Some
sources on the net claimed "wait-wound" is the better algorithm due to a
reduced number of backoffs:
http://www.mathcs.emory.edu/~cheung/Courses/554/Syllabus/8-recv+serial/deadlock-compare.html
So I implemented both algorithms in a standalone testing module:
git+ssh://people.freedesktop.org/~thomash/ww_mutex_test
Some preliminary test trends:
1) Testing uncontended sequential command submissions: Batching
ww-mutexes seems to be between 50% and 100% faster than the current
kernel implementation. Still the kernel implementation performing much
better than I thought.
2) Testing uncontended parallell command submission: Batching ww-mutexes
slower (up to 50%) of the current kernel implementation, since the
in-kernel implementation can make use of multi-core parallellism where
the batching implementation sees spinlock contention. This effect
should, however, probably be relaxed if setting a longer command
submission time, reducing the spinlock contention.
3) Testing contended parallell command submission: Batching is generally
superior by usually around 50%, sometimes up to 100%, One of the reasons
could be that batching appears to result in a significantly lower number
of rollbacks.
5) Taking batching locks without actually batching can result i poor
performance.
4) Wound-Wait vs Wait-Die. As predicted, particularly with a low number
of parallell cs threads, Wound-wait appears to give a lower number of
rollbacks, but there seems to be no overall locking time benefits. On
the contrary, as the number of threads exceeds the number of cores,
wound-wait appears to become increasingly more time-consuming than
Wait-Die. One of the reason for this might be that Wound-Wait may see an
increased number of unlocks per rollback. Another is that it is not
trivial to find a good lock to wait for with Wound-Wait. With Wait-Die
the thread rolling back just waits for the contended lock. With
wound-wait the wounded thread is preempted, and in my implementation I
choose to lazy-preempt at the next blocking lock, so that at least we
have a lock to wait on, even if it's not a relevant lock to trigger a
rollback.
So this raises a couple of questions:
1) Should we implement an upstream version of batching locks, perhaps as
a choice on a per-lock-class basis?
2) Should we add a *real* wound-wait choice to our wound-wait mutexes.
Otherwise perhaps rename them or document that they're actually doing
wait-die.
/Thomas
More information about the dri-devel
mailing list