Batched ww-mutexes, wound-wait vs wait-die etc.

Mon Apr 16 12:22:03 UTC 2018

On Mon, Apr 16, 2018 at 2:15 PM, Thomas Hellstrom <thellstrom at vmware.com> wrote:
> On 04/16/2018 11:19 AM, Daniel Vetter wrote:
>>
>> On Mon, Apr 16, 2018 at 10:23 AM, Thomas Hellstrom
>> <thellstrom at vmware.com> wrote:
>>>
>>> On 04/14/2018 10:33 AM, Daniel Vetter wrote:
>>>>
>>>> Hi Thomas,
>>>>
>>>> On Fri, Apr 13, 2018 at 10:23 PM, Thomas Hellstrom
>>>> <thellstrom at vmware.com> wrote:
>>>>>
>>>>> On 04/13/2018 07:13 PM, Daniel Vetter wrote:
>>>>>>
>>>>>> On Wed, Apr 11, 2018 at 10:27:06AM +0200, Thomas Hellstrom wrote:
>>>>>>>
>>>>>>> 2) Should we add a *real* wound-wait choice to our wound-wait
>>>>>>> mutexes.
>>>>>>> Otherwise perhaps rename them or document that they're actually doing
>>>>>>> wait-die.
>>>>>>
>>>>>> I think a doc patch would be good at least. Including all the data you
>>>>>> assembled here.
>>>>>
>>>>>
>>>>> Actually, a further investigation appears to indicate that manipulating
>>>>> the
>>>>> lock state under a local spinlock is about fast as using atomic
>>>>> operations
>>>>> even for the completely uncontended cases.
>>>>>
>>>>> This means that we could have a solution where you decide on a
>>>>> per-mutex
>>>>> or
>>>>> per-reservation object basis whether you want to manipulate lock-state
>>>>> under
>>>>> a "batch group" spinlock, meaning certain performance characteristics
>>>>> or
>>>>> traditional local locking, meaning other performance characteristics.
>>>>>
>>>>> Like, vmwgfx could choose batching locks, radeon traditional locks, but
>>>>> the
>>>>> same API would work for both and locks could be shared between
>>>>> drivers..
>>>>
>>>> Don't we need to make this decision at least on a per-class level?
>>>
>>>
>>> No, I was thinking more in the line of the ww_mutex having a pointer to
>>> the
>>> spinlock. It could either be the local mutex "wait_lock", or a
>>> per-batch-group lock. The mutex code wouldn't care. We do need a special
>>> API
>>> for batched locking, though, but not for ordinary locking.
>>> Both APIs should be able to handle local or grouped spinlocks.
>>>
>>> Note that this would of course require that there would be no performance
>>> loss for users that don't use batch groups.
>>>
>>> I guess the most efficient use for GPU command submission would be to use
>>> per-process batch-groups. Then when the batch encounters a ww_mutex with
>>> a
>>> different batch group (for example the display server shared surface,
>>> it'll
>>> just switch batch lock), and this way the contention for
>>> the batch spinlock will be mostly eliminated.
>>
>> But won't this force us through the spinlock case for all ww_mutex?
>> The core mutex code goes to extreme lengths to avoid that for the
>> uncontended fast path. That's why I meant the batched and non-batch
>> ww_mutex look fundamentally incompatible. Or maybe I missed something
>> somewhere.
>
>
> Yes, this would require  the assumption to hold that a _local_ spinlock
> path, that is, taking a local spinlock also in the fastpath would be as fast
> as using the atomic operations directly. And that's what I'm seeing, (or
> perhaps a percent or so slower with 20000 simulated CS'es taking 800
> uncontended locks each). Both running sequentially and in parallel. Guess I
> need to verify this on my rpi as well, so it's not an Intel specific thing.
>
> This is of course a prerequisite for the idea to work.
>
> Basically the spinlock fastpath appears to be an inlined atomic_cmpxchg, so
> even theoretically there should not be any noticeable performance loss. I'm
> not sure why the sleeping locks insist using atomic operations over
> spinlocks, but with the qspinlock implementation (seems to be 2014-ish), the
> atomic exchange on spin_unlock was elminated, and I guess that changed the
> playground...
>
> With a _shared_ spinlock, like with a batch group, we would see different
> performance characteristics, though.

Hm yeah, the performance characteristics have massively converged there.

Otoh I think spinlocks are still a lot more arch-dependent than struct
mutex locking, so not sure how well this holds over the gazillion of
other architectures Linux supports. And I think since we've moved
ww_mutex to be a core locking construct we should make sure stuff
doesn't get worse on those either, even if they practically have no
relevance for gpu drivers. ppc might be the only one, besides x86 and
arm/arm64.

>>>>    Or
>>>> how will the spinlock/batch-lock approach interact with the normal
>>>> ww_mutex_lock path (which does require the atomics/ordered stores
>>>> we're trying to avoid)?
>>>
>>>
>>> We can use the same code with some extra
>>> if (use_ww_ctx) in the common locking and unlocking path.
>>> Note that the "use_ww_ctx" parameter is defined at compile-time so the
>>> ordinary mutex path (binary) shouldn't change at all after optimization
>>> but
>>> I need to verify that, of course.
>>
>> Agreed, no issue there.
>>
>>> What you can't do with such a change is to lock / unlock a ww_mutex using
>>> the standard mutex API, like mutex_lock(&ww_mutex->base), but I guess
>>> that
>>> would be OK?
>>
>> Yeah right now that works, but I don't care about that. The point of
>> merging ww_mutex into the core mutex was that we can fully reuse the
>> __mutex_trylock_fast. Same for the unlock path.
>>
>> Once we don't share that code anymore, there's imo not that much point
>> in having ww_mutex interleaved with core mutex code.
>>
>>>> If we can't mix them I'm kinda leaning towards a
>>>> ww_batch_mutex/ww_batch_acquire_ctx, but exactly matching api
>>>> otherwise. We probably do need the new batch_start/end api, since
>>>> ww_acquire_done isn't quite the right place ...
>>>>
>>>>> I'll see if I get time to put together an RFC.
>>>>
>>>> Yeah I think there's definitely some use for batched ww locks, where
>>>> parallelism is generally low, or at least the ratio between "time
>>>> spent acquiring locks" and "time spent doing stuff while holding
>>>> locks" small enough to not make the reduced parallelism while
>>>> acquiring an issue.
>>>
>>>
>>> Yes. At least it's worth bringing up for discussion. The reduced
>>> parallelism
>>> shouldn't be an issue if per-process batch-groups are used, or, like for
>>> vmwgfx the command submission itself is serialized, due to a single FIFO.
>>
>> Random aside: We do seem to already implement wound semantics (or I'm
>> terribly confused). See __ww_mutex_add_waiter plus related wakeup code
>> (in __ww_mutex_lock_check_stamp).
>> -Daniel
>
>
> From what I can see, that's only the wait-die case, that is processes that
> don't hold the lock kill their transaction. What's implemented is also a
> multi-contender logic so that if we can deduce that a waiter will need to
> die once another waiter with higher priority becomes the lock holder, we
> wake it up early to kill its transaction. Wound-wait is when we preempt the
> lock holder. That never happens with the current code.

Right. Somehow I'm too dense to realize the difference of who gets
waken up and why and always mix them up. Upstream code only wakes
waiters, your code wakes the current holder (but it'll only notice the
next time around it'll try to lock another ww_mutex). I guess we could
submit at least that change as an RFC to the ww_mutex code.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch