[Mesa-dev] [PATCH] mesa: avoid pthread_mutexes when adjusting object refcounts

Sat May 20 09:21:52 UTC 2017

On 26/10/16 02:53, Daniel Vetter wrote:
> On Wed, Oct 19, 2016 at 03:34:12PM -0700, Ian Romanick wrote:
>> On 10/19/2016 12:58 PM, Matt Turner wrote:
>>> On Wed, Oct 19, 2016 at 12:06 PM, Jan Ziak <0xe2.0x9a.0x9b at gmail.com> wrote:
>>>> This patch removes locks around reference counts in favor of atomic inc/dec
>>>> operations on the refcounts. This makes the refcount adjustments much faster.
>>> I had the same idea last year and sent a patch series ([PATCH 00/13]
>>> mesa: Locking improvements and optimizations [1]).
>>>
>>> The part that corresponds to your patch are patches 5-9. 1-4, 10, and
>>> 11 were committed.
>>>
>>> I'm having a difficult time remembering why I ultimately dropped the
>>> series, but Ian's comment on 05/13 indicates that the locking is
>>> broken for bufferobj.c, so it's not appropriate to use atomics there.
>>> I remember that texture locking is totally broken, but I cannot
>>> remember the details.
>> So... I was trying to trick you into fixing the locking problems.  I see
>> that I have failed. :)
> Hm, locking, figured I'll jump in with some patterns from the kernel world
> ;-)
>
>> I looked at this quite a bit more closely today.  Here's one timeline
>> that worries me:
>>
>>      Thread 1                     Thread 2
>>      Call glDeleteBuffers
>>                                   Call glNamedBufferData
>>                                   Call _mesa_lookup_bufferobj_err
>>      Remove buffer ID from hash
>>      Decrement refcount
>>      Free BO backing store
>>      Delete buffer object
>>                                   Reference freed memory
>>
>> The entire body of glDeleteBuffers holds a different lock (the hashtable
>> lock), so all of that happens atomically.  _mesa_lookup_bufferobj_err
>> does not update the refcount, so bad things can happen.
>>
>> I think the right way to fix this is to have every _mesa_lookup_*
>> function increment the refcount of the object before returning it.  At
>> that point, every modification to the refcount should happen while the
>> hashtable mutex is held.  I think that means that neither atomics nor an
>> additional mutex are needed for reference counting.
I've been doing some work on this but I think there is still a need for 
atomics when we remove a reference right? We can only drop both the 
atomic and mutex when increment the reference.

By re-using the hash table lock we always know that we won't delete the 
buffer out from underneath a thread using it, but we still have a race 
condition between the two threads trying to delete it and could end up 
with a double free?
Does that seem right?

> Yup, that's the pattern we use in the kernel. One additional complications
> is if you have lookup tables which only have a pointer, but do not hold a
> full reference on the object (i.e. weak references for caches). In that
> case you must check (under the lookup table lock) whether the refcount
> really is greater than 0, and only increment in that case. Atomically of
> course. That's implemented by the kref_get_unless_zero helper in the
> kernel.
>
> Upshot of this: You can fully untangle locking for lookup tables and
> caches from refcounting, and you can add a given object to as many places
> with weak references as you want. The other usual pattern is to only
> destroy an object while holding the lookup lock, but that's unecessariyl
> serializing, and as soon as you have 2+ lookup tables it gets seriously
> complicated (since you need to hold all the locks).
>
>> The other timelime the concerns me is:
>>
>>      Thread 1                     Thread 2
>>      Call glNamedBufferData
>>      Call _mesa_lookup_bufferobj_err
>>      Call ctx->Driver.BufferData
>>          (buffer_data_fallback as an example)
>>      Free old buffer
>>                                   Call glNamedBufferData
>>                                   Call _mesa_lookup_bufferobj_err
>>                                   Call ctx->Driver.BufferData
>>                                   Free old buffer
>>                                       (possibly double-free)
>>                                   Allocate new buffer
>>      Allocate new buffer
>>          (possibly leak the other new buffer)
>>
>> To fix this, I think we do want a per-object mutex that is only used in
>> ctx->Driver.BufferData.  If there is only one context in the share group
>> (a very common case!), we could avoid even that locking.
> One trick we're using in the kernel for this stuff is rcu, i.e. never
> change the data structure, but if you update it create a new one, put it
> into the lookup struct and then unreference the old one. Of course this is
> still racy (but in most apis winner-takes-it-all is perfectly fine
> behaviour) for concurrent updates, but it does fix the double-free or
> anything else nasty happening pretty cheaply.
>
>> Textures and renderbuffers have similar issues.  Textures have
>> additional problems with derived state.  Calling glTexParameter can
>> cause a lot of state inside the texture to get updated.  Multiple calls
>> to glTexParameter from multiple threads can result in inconsistent
>> state.  It also means that rendering from one thread while another
>> thread is updating state can lead to problems.
>>
>> Other drivers solve this problem by having N+1 copies of each object's
>> state.  Each context gets its own copy, and the share group gets a copy.
>>   Writes to state go to both the local copy and the shared copy.  Reads
>> come only from the local copy.  glBindTexture updates the local copy
>> from the shared copy.  The local copies are basically a write-through
>> cache.  I'm not sure how this works with bindless.  DSA also causes
>> problems because every DSA command implicitly has a bind.
> This sounds similar to rcu, but coded more explicitly. I definitely don't
> know enough about gl to say what might work (or work better), and how
> consistent you need to be about concurrent writers.
>
> Cheers, Daniel
>
>>> The other thing (in 12/13 [2]) is that vertex array objects,
>>> framebuffer objects, and pipeline objects are "container objects" and
>>> are not shared between contexts, so they require no locking. It would
>>> be better to solve that and avoid locking/atomics all together. But
>>> then FBOs (and renderbuffers?) from EXT_framebuffer_object and VAOs
>>> from APPLE_vertex_array_object are actually shared between contexts,
>>> so those are out as well.
>> If my theories above are correct, we don't need atomics or mutexes for
>> any of it.
>>
>>> That seems to leave sampler objects, shader objects, and program
>>> objects. I suspect those are relatively small compared with textures
>>> or buffers.
>>>
>>> I'm not really sure what I'd suggest, to be honest. It seems like a
>>> lot of aggravation for a minimal reward if done right.
>>>
>>> [1] https://lists.freedesktop.org/archives/mesa-dev/2015-August/090972.html
>>> [2] https://lists.freedesktop.org/archives/mesa-dev/2015-August/090984.html
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev