Batched ww-mutexes, wound-wait vs wait-die etc.

Fri Apr 13 20:23:23 UTC 2018

Hi, Daniel,

On 04/13/2018 07:13 PM, Daniel Vetter wrote:
> Hi Thomas,
>
> On Wed, Apr 11, 2018 at 10:27:06AM +0200, Thomas Hellstrom wrote:
>> Hi!
>>
>> Thinking of adding ww-mutexes for reservation also of vmwgfx resources,
>> (like surfaces), I became a bit worried that doubling the locks taken during
>> command submission wouldn't be a good thing. Particularly on ESX servers
>> where a high number of virtual machines running graphics on a multi-core
>> processor would initiate a very high number of processor locked cycles. The
>> method we use today is to reserve all those resources under a single mutex.
>> Buffer objects are still using reservation objects and hence ww-mutexes,
>> though.
>>
>> So I figured a "middle way" would be to add batched ww-mutexes, where the
>> ww-mutex locking state, instead of being manipulated atomically,
>> was manipulated under a single lock-class global spinlock. We could then
>> condense the sometimes 200+ locking operations per command submission to
>> two, one for lock and one for unlock. Obvious drawbacks are that taking the
>> spinlock is slightly more expensive than an atomic operation, and that we
>> could introduce contention for the spinlock where there is no contention for
>> an atomic operation.
>>
>> So I set out to test this in practice. After reading up a bit on the theory
>> it turned out that the current in-kernel wound-wait implementation, like
>> once TTM (unknowingly), is actually not wound-wait, but wait-die. Correct
>> name would thus be "wait-die mutexes", Some sources on the net claimed
>> "wait-wound" is the better algorithm due to a reduced number of backoffs:
> Nice stuff.
>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.mathcs.emory.edu_-7Echeung_Courses_554_Syllabus_8-2Drecv-2Bserial_deadlock-2Dcompare.html&d=DwIDAw&c=uilaK90D4TOVoH58JNXRgQ&r=wnSlgOCqfpNS4d02vP68_E9q2BNMCwfD2OZ_6dCFVQQ&m=wKSXGXnu1C6LFxAzoexAGlxUjlKVGGfv8XT25lAB1MM&s=d7rFCVCe2D_tHB-7ciIFdoCwm3MYMXoRUqQNC6O8QC8&e=
>>
>> So I implemented both algorithms in a standalone testing module:
>>
>> git+ssh://people.freedesktop.org/~thomash/ww_mutex_test
> One issue with real wait-wound is that we can't really implement it in the
> kernel: Kernel context can't be aborted at arbitrary places. It should
> still work (since as soon as it tries to grab another lock will wound it,
> and anyone waiting for a lock will get wounded),
So what I implemented was lazy wait-wound preemption, meaning that if a 
thread get wounded, it aborts on the next lock it needs to wait on. This 
is safe from a deadlock avoidance point of view and has an important 
implication, since after the abort we will wait on the lock we aborted 
on. If we just abort on the next lock, it may well be that we grab a 
number of locks immediately after abort and get wounded again, ..and 
again ..and again by the same contending thread. It's like restarting 
wait-die without waiting..

>   but the balance might be
> different in real world tests (which do some real cpu work once locks are
> acquired). Or did increase CS_UDELAY not change any of that?

>
>> Some preliminary test trends:
>>
>> 1) Testing uncontended sequential command submissions: Batching ww-mutexes
>> seems to be between 50% and 100% faster than the current kernel
>> implementation. Still the kernel implementation performing much better than
>> I thought.
> The core mutex implementation we share is real fast in the uncontended
> case. Benefitting from all the micro-optimization work was part of the
> reasons for moving the ttm reservations into core code.
>
>> 2) Testing uncontended parallell command submission: Batching ww-mutexes
>> slower (up to 50%) of the current kernel implementation, since the in-kernel
>> implementation can make use of multi-core parallellism where the batching
>> implementation sees spinlock contention. This effect should, however,
>> probably  be relaxed if setting a longer command submission time, reducing
>> the spinlock contention.
>>
>> 3) Testing contended parallell command submission: Batching is generally
>> superior by usually around 50%, sometimes up to 100%, One of the reasons
>> could be that batching appears to result in a significantly lower number of
>> rollbacks.
>>
>> 5) Taking batching locks without actually batching can result i poor
>> performance.
>>
>> 4) Wound-Wait vs Wait-Die. As predicted, particularly with a low number of
>> parallell cs threads, Wound-wait appears to give a lower number of
>> rollbacks, but there seems to be no overall locking time benefits. On the
>> contrary, as the number of threads exceeds the number of cores, wound-wait
>> appears to become increasingly more time-consuming than Wait-Die. One of the
>> reason for this might be that Wound-Wait may see an increased number of
>> unlocks per rollback. Another is that it is not trivial to find a good lock
>> to wait for with Wound-Wait. With Wait-Die the thread rolling back just
>> waits for the contended lock. With wound-wait the wounded thread is
>> preempted, and in my implementation I choose to lazy-preempt at the next
>> blocking lock, so that at least we have a lock to wait on, even if it's not
>> a relevant lock to trigger a rollback.
>>
>> So this raises a couple of questions:
>>
>> 1) Should we implement an upstream version of batching locks, perhaps as a
>> choice on a per-lock-class basis?
> Not sure. At least for me ww-mutexes were meant to be highly efficient in
> the uncontended case (generally you don't have people trampling all over
> their feet^Wbuffers, or there's bigger problems already). But with the
> guarantee that in contention cases you can't stage a denial of service,
> i.e. there's guaranteed forward progress and later threads can't block
> older threads forever. And given multicores are here to stay, I'd say
> weighting 2) higher than 1) in your experiments makes sense.
>
> If you need arbitrary lock ordering, in parallel, and contented, not sure
> that's even possible to solve in a fast&generic way. My gut feeling would
> be to reorg the algorithm to not have that?
>
> Note: ww-mutex use in struct drm_modeset_lock is kinda different, there we
> only care about the "does not deadlock" part, because generally display
> updates are ratelimited and a little bit of overhead doesn't matter.
>
> I guess your ESX use-case is a bit different, where you care less about
> parallelism within a single guest instance and more about parallelism of
> the overall system. Honestly not sure what to do.
>
>> 2) Should we add a *real* wound-wait choice to our wound-wait mutexes.
>> Otherwise perhaps rename them or document that they're actually doing
>> wait-die.
> I think a doc patch would be good at least. Including all the data you
> assembled here.

Actually, a further investigation appears to indicate that manipulating 
the lock state under a local spinlock is about fast as using atomic 
operations even for the completely uncontended cases.

This means that we could have a solution where you decide on a per-mutex 
or per-reservation object basis whether you want to manipulate 
lock-state under a "batch group" spinlock, meaning certain performance 
characteristics or traditional local locking, meaning other performance 
characteristics.

Like, vmwgfx could choose batching locks, radeon traditional locks, but 
the same API would work for both and locks could be shared between drivers..

I'll see if I get time to put together an RFC.

/Thomas

> Cheers, Daniel