Batched ww-mutexes, wound-wait vs wait-die etc.

Mon Apr 16 08:23:36 UTC 2018

On 04/14/2018 10:33 AM, Daniel Vetter wrote:
> Hi Thomas,
>
> On Fri, Apr 13, 2018 at 10:23 PM, Thomas Hellstrom
> <thellstrom at vmware.com> wrote:
>> On 04/13/2018 07:13 PM, Daniel Vetter wrote:
>>> On Wed, Apr 11, 2018 at 10:27:06AM +0200, Thomas Hellstrom wrote:
>>>> 2) Should we add a *real* wound-wait choice to our wound-wait mutexes.
>>>> Otherwise perhaps rename them or document that they're actually doing
>>>> wait-die.
>>> I think a doc patch would be good at least. Including all the data you
>>> assembled here.
>>
>> Actually, a further investigation appears to indicate that manipulating the
>> lock state under a local spinlock is about fast as using atomic operations
>> even for the completely uncontended cases.
>>
>> This means that we could have a solution where you decide on a per-mutex or
>> per-reservation object basis whether you want to manipulate lock-state under
>> a "batch group" spinlock, meaning certain performance characteristics or
>> traditional local locking, meaning other performance characteristics.
>>
>> Like, vmwgfx could choose batching locks, radeon traditional locks, but the
>> same API would work for both and locks could be shared between drivers..
> Don't we need to make this decision at least on a per-class level?

No, I was thinking more in the line of the ww_mutex having a pointer to 
the spinlock. It could either be the local mutex "wait_lock", or a
per-batch-group lock. The mutex code wouldn't care. We do need a special 
API for batched locking, though, but not for ordinary locking.
Both APIs should be able to handle local or grouped spinlocks.

Note that this would of course require that there would be no 
performance loss for users that don't use batch groups.

I guess the most efficient use for GPU command submission would be to 
use per-process batch-groups. Then when the batch encounters a ww_mutex 
with a different batch group (for example the display server shared 
surface, it'll just switch batch lock), and this way the contention for
the batch spinlock will be mostly eliminated.

>   Or
> how will the spinlock/batch-lock approach interact with the normal
> ww_mutex_lock path (which does require the atomics/ordered stores
> we're trying to avoid)?

We can use the same code with some extra
if (use_ww_ctx) in the common locking and unlocking path.
Note that the "use_ww_ctx" parameter is defined at compile-time so the 
ordinary mutex path (binary) shouldn't change at all after optimization 
but I need to verify that, of course.

What you can't do with such a change is to lock / unlock a ww_mutex 
using the standard mutex API, like mutex_lock(&ww_mutex->base), but I 
guess that would be OK?

>
> If we can't mix them I'm kinda leaning towards a
> ww_batch_mutex/ww_batch_acquire_ctx, but exactly matching api
> otherwise. We probably do need the new batch_start/end api, since
> ww_acquire_done isn't quite the right place ...
>
>> I'll see if I get time to put together an RFC.
> Yeah I think there's definitely some use for batched ww locks, where
> parallelism is generally low, or at least the ratio between "time
> spent acquiring locks" and "time spent doing stuff while holding
> locks" small enough to not make the reduced parallelism while
> acquiring an issue.

Yes. At least it's worth bringing up for discussion. The reduced 
parallelism shouldn't be an issue if per-process batch-groups are used, 
or, like for vmwgfx the command submission itself is serialized, due to 
a single FIFO.

/Thomas

> -Daniel