Batched ww-mutexes, wound-wait vs wait-die etc.

Mon Apr 16 09:19:34 UTC 2018

On Mon, Apr 16, 2018 at 10:23 AM, Thomas Hellstrom
<thellstrom at vmware.com> wrote:
> On 04/14/2018 10:33 AM, Daniel Vetter wrote:
>>
>> Hi Thomas,
>>
>> On Fri, Apr 13, 2018 at 10:23 PM, Thomas Hellstrom
>> <thellstrom at vmware.com> wrote:
>>>
>>> On 04/13/2018 07:13 PM, Daniel Vetter wrote:
>>>>
>>>> On Wed, Apr 11, 2018 at 10:27:06AM +0200, Thomas Hellstrom wrote:
>>>>>
>>>>> 2) Should we add a *real* wound-wait choice to our wound-wait mutexes.
>>>>> Otherwise perhaps rename them or document that they're actually doing
>>>>> wait-die.
>>>>
>>>> I think a doc patch would be good at least. Including all the data you
>>>> assembled here.
>>>
>>>
>>> Actually, a further investigation appears to indicate that manipulating
>>> the
>>> lock state under a local spinlock is about fast as using atomic
>>> operations
>>> even for the completely uncontended cases.
>>>
>>> This means that we could have a solution where you decide on a per-mutex
>>> or
>>> per-reservation object basis whether you want to manipulate lock-state
>>> under
>>> a "batch group" spinlock, meaning certain performance characteristics or
>>> traditional local locking, meaning other performance characteristics.
>>>
>>> Like, vmwgfx could choose batching locks, radeon traditional locks, but
>>> the
>>> same API would work for both and locks could be shared between drivers..
>>
>> Don't we need to make this decision at least on a per-class level?
>
>
> No, I was thinking more in the line of the ww_mutex having a pointer to the
> spinlock. It could either be the local mutex "wait_lock", or a
> per-batch-group lock. The mutex code wouldn't care. We do need a special API
> for batched locking, though, but not for ordinary locking.
> Both APIs should be able to handle local or grouped spinlocks.
>
> Note that this would of course require that there would be no performance
> loss for users that don't use batch groups.
>
> I guess the most efficient use for GPU command submission would be to use
> per-process batch-groups. Then when the batch encounters a ww_mutex with a
> different batch group (for example the display server shared surface, it'll
> just switch batch lock), and this way the contention for
> the batch spinlock will be mostly eliminated.

But won't this force us through the spinlock case for all ww_mutex?
The core mutex code goes to extreme lengths to avoid that for the
uncontended fast path. That's why I meant the batched and non-batch
ww_mutex look fundamentally incompatible. Or maybe I missed something
somewhere.

>>   Or
>> how will the spinlock/batch-lock approach interact with the normal
>> ww_mutex_lock path (which does require the atomics/ordered stores
>> we're trying to avoid)?
>
>
> We can use the same code with some extra
> if (use_ww_ctx) in the common locking and unlocking path.
> Note that the "use_ww_ctx" parameter is defined at compile-time so the
> ordinary mutex path (binary) shouldn't change at all after optimization but
> I need to verify that, of course.

Agreed, no issue there.

> What you can't do with such a change is to lock / unlock a ww_mutex using
> the standard mutex API, like mutex_lock(&ww_mutex->base), but I guess that
> would be OK?

Yeah right now that works, but I don't care about that. The point of
merging ww_mutex into the core mutex was that we can fully reuse the
__mutex_trylock_fast. Same for the unlock path.

Once we don't share that code anymore, there's imo not that much point
in having ww_mutex interleaved with core mutex code.

>> If we can't mix them I'm kinda leaning towards a
>> ww_batch_mutex/ww_batch_acquire_ctx, but exactly matching api
>> otherwise. We probably do need the new batch_start/end api, since
>> ww_acquire_done isn't quite the right place ...
>>
>>> I'll see if I get time to put together an RFC.
>>
>> Yeah I think there's definitely some use for batched ww locks, where
>> parallelism is generally low, or at least the ratio between "time
>> spent acquiring locks" and "time spent doing stuff while holding
>> locks" small enough to not make the reduced parallelism while
>> acquiring an issue.
>
>
> Yes. At least it's worth bringing up for discussion. The reduced parallelism
> shouldn't be an issue if per-process batch-groups are used, or, like for
> vmwgfx the command submission itself is serialized, due to a single FIFO.

Random aside: We do seem to already implement wound semantics (or I'm
terribly confused). See __ww_mutex_add_waiter plus related wakeup code
(in __ww_mutex_lock_check_stamp).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch