[Mesa-dev] [PATCH] mesa: Add new fast mtx_t mutex type for basic use cases

Fri Jan 30 02:52:26 PST 2015

On 29/01/15 17:14, Kristian Høgsberg wrote:
> On Thu, Jan 29, 2015 at 6:36 AM, Emil Velikov <emil.l.velikov at gmail.com> wrote:
>> On 28/01/15 05:08, Kristian Høgsberg wrote:
>>> While modern pthread mutexes are very fast, they still incur a call to an
>>> external DSO and overhead of the generality and features of pthread mutexes.
>>> Most mutexes in mesa only needs lock/unlock, and the idea here is that we can
>>> inline the atomic operation and make the fast case just two intructions.
>>> Mutexes are subtle and finicky to implement, so we carefully copy the
>>> implementation from Ulrich Dreppers well-written and well-reviewed paper:
>>>
>>>    "Futexes Are Tricky"
>>>    https://urldefense.proofpoint.com/v2/url?u=http-3A__www.akkadia.org_drepper_futex.pdf&d=AwIGaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=zfmBZnnVGHeYde45pMKNnVyzeaZbdIqVLprmZCM2zzE&m=NS0xLkqIj43l--WADuy3EQa3yVe4rItSr1sBgtCZJ28&s=jUMBbUUMfsjTAo4ye4aoY9kqeuG10NtNEuSLKRsxPoc&e=
>>>
>>> We implement "mutex3", which gives us a mutex that has no syscalls on
>>> uncontended lock or unlock.  Further, the uncontended case boils down to a
>>> cmpxchg and an untaken branch and the uncontended unlock is just a locked decr
>>> and an untaken branch.  We use __builtin_expect() to indicate that contention
>>> is unlikely so that gcc will put the contention code out of the main code
>>> flow.

I don't oppose the idea of a faster mutex.  But do you have some 
performance figures with this patch?  (It doesn't need to be a real-life 
app -- an artificial demo/benchmark would suffice.

What I'd like to know is, is the performance improvement significant 
enough to at least justify the complexity of maintaining a multiple 
mutex type across our code?

Because I never had the impression that mutexes were a bottleneck. 
Atomic reference counting is probably more of an problem.

>>> A fast mutex only supports lock/unlock, can't be recursive or used with
>>> condition variables.  We keep the pthread mutex implementation around as
>>> full_mtx_t for the few places where we use condition variables or recursive
>>> locking.  For platforms or compilers where futex and atomics aren't available,
>>> mtx_t falls back to the pthread mutex.

>>>
>>> The pthread mutex lock/unlock overhead shows up on benchmarks for CPU bound
>>> applications.  Most CPU bound cases are helped and some of our internal
>>> bind_buffer_object heavy benchmarks gain up to 10%.
>>>
>> Hi Kristian,
>>
>> Can I humbly ask that you split this into two patches - one that
>> introduces the new functions/struct and another one that uses them ?
>> This way it'll be easier if/when things go crazy.
>>
>> Also the patch seems to wonder between posix and win32
>> + typedef full_mtx_t mtx_t;
>>
>> and
>> + typedef mtx_t fast_mtx_t;
>>
>> Looks like a left over from the "should I rename XX variables to fast*
>> or just one to full*" moment :)
>
> Yeah, that's how it progressed :)  At first I called it fast_mtx_t and
> planned on replacing simple uses of mtx_t one by one. Jordan suggested
> that it'd be easier to make the regular mutex fast and then rename the
> couple of places that use more feature than we provide.

I'm however strongly against having a non-standard mutex using a 
standard name like `mtx_t`.

The point of using C11 names for threading primitives was to enable us 
to implement Mesa using standard-looking C code.  The idea was that at 
one point we'd only use our C11 threads.h emulation where needed. 
Please keep in mind that if/when platforms start providing C11 threads.h 
we might be forced to use them instead of our own, as system/3rd party 
headers might start depending on them on their ABIs.

It is imperative that any non-standard mutexes use names that don't 
collide with C11 threads names.

Jose