[Mesa-dev] [PATCH] mesa: avoid pthread_mutexes when adjusting object refcounts

Sat May 20 10:01:29 UTC 2017

On 20/05/17 19:21, Timothy Arceri wrote:

>
>
> On 26/10/16 02:53, Daniel Vetter wrote:
>> On Wed, Oct 19, 2016 at 03:34:12PM -0700, Ian Romanick wrote:
>>> On 10/19/2016 12:58 PM, Matt Turner wrote:
>>>> On Wed, Oct 19, 2016 at 12:06 PM, Jan Ziak 
>>>> <0xe2.0x9a.0x9b at gmail.com> wrote:
>>>>> This patch removes locks around reference counts in favor of 
>>>>> atomic inc/dec
>>>>> operations on the refcounts. This makes the refcount adjustments 
>>>>> much faster.
>>>> I had the same idea last year and sent a patch series ([PATCH 00/13]
>>>> mesa: Locking improvements and optimizations [1]).
>>>>
>>>> The part that corresponds to your patch are patches 5-9. 1-4, 10, and
>>>> 11 were committed.
>>>>
>>>> I'm having a difficult time remembering why I ultimately dropped the
>>>> series, but Ian's comment on 05/13 indicates that the locking is
>>>> broken for bufferobj.c, so it's not appropriate to use atomics there.
>>>> I remember that texture locking is totally broken, but I cannot
>>>> remember the details.
>>> So... I was trying to trick you into fixing the locking problems.  I 
>>> see
>>> that I have failed. :)
>> Hm, locking, figured I'll jump in with some patterns from the kernel 
>> world
>> ;-)
>>
>>> I looked at this quite a bit more closely today.  Here's one timeline
>>> that worries me:
>>>
>>>      Thread 1                     Thread 2
>>>      Call glDeleteBuffers
>>>                                   Call glNamedBufferData
>>>                                   Call _mesa_lookup_bufferobj_err
>>>      Remove buffer ID from hash
>>>      Decrement refcount
>>>      Free BO backing store
>>>      Delete buffer object
>>>                                   Reference freed memory
>>>
>>> The entire body of glDeleteBuffers holds a different lock (the 
>>> hashtable
>>> lock), so all of that happens atomically. _mesa_lookup_bufferobj_err
>>> does not update the refcount, so bad things can happen.
>>>
>>> I think the right way to fix this is to have every _mesa_lookup_*
>>> function increment the refcount of the object before returning it.  At
>>> that point, every modification to the refcount should happen while the
>>> hashtable mutex is held.  I think that means that neither atomics 
>>> nor an
>>> additional mutex are needed for reference counting.
> I've been doing some work on this but I think there is still a need 
> for atomics when we remove a reference right? We can only drop both 
> the atomic and mutex when increment the reference.

So in other words we should really still have the atomics in place.

>
> By re-using the hash table lock we always know that we won't delete 
> the buffer out from underneath a thread using it, but we still have a 
> race condition between the two threads trying to delete it and could 
> end up with a double free?
> Does that seem right?
>
>
>> Yup, that's the pattern we use in the kernel. One additional 
>> complications
>> is if you have lookup tables which only have a pointer, but do not 
>> hold a
>> full reference on the object (i.e. weak references for caches). In that
>> case you must check (under the lookup table lock) whether the refcount
>> really is greater than 0, and only increment in that case. Atomically of
>> course. That's implemented by the kref_get_unless_zero helper in the
>> kernel.
>>
>> Upshot of this: You can fully untangle locking for lookup tables and
>> caches from refcounting, and you can add a given object to as many 
>> places
>> with weak references as you want. The other usual pattern is to only
>> destroy an object while holding the lookup lock, but that's unecessariyl
>> serializing, and as soon as you have 2+ lookup tables it gets seriously
>> complicated (since you need to hold all the locks).
>>
>>> The other timelime the concerns me is:
>>>
>>>      Thread 1                     Thread 2
>>>      Call glNamedBufferData
>>>      Call _mesa_lookup_bufferobj_err
>>>      Call ctx->Driver.BufferData
>>>          (buffer_data_fallback as an example)
>>>      Free old buffer
>>>                                   Call glNamedBufferData
>>>                                   Call _mesa_lookup_bufferobj_err
>>>                                   Call ctx->Driver.BufferData
>>>                                   Free old buffer
>>>                                       (possibly double-free)
>>>                                   Allocate new buffer
>>>      Allocate new buffer
>>>          (possibly leak the other new buffer)
>>>
>>> To fix this, I think we do want a per-object mutex that is only used in
>>> ctx->Driver.BufferData.  If there is only one context in the share 
>>> group
>>> (a very common case!), we could avoid even that locking.
>> One trick we're using in the kernel for this stuff is rcu, i.e. never
>> change the data structure, but if you update it create a new one, put it
>> into the lookup struct and then unreference the old one. Of course 
>> this is
>> still racy (but in most apis winner-takes-it-all is perfectly fine
>> behaviour) for concurrent updates, but it does fix the double-free or
>> anything else nasty happening pretty cheaply.
>>
>>> Textures and renderbuffers have similar issues.  Textures have
>>> additional problems with derived state.  Calling glTexParameter can
>>> cause a lot of state inside the texture to get updated. Multiple calls
>>> to glTexParameter from multiple threads can result in inconsistent
>>> state.  It also means that rendering from one thread while another
>>> thread is updating state can lead to problems.
>>>
>>> Other drivers solve this problem by having N+1 copies of each object's
>>> state.  Each context gets its own copy, and the share group gets a 
>>> copy.
>>>   Writes to state go to both the local copy and the shared copy.  Reads
>>> come only from the local copy.  glBindTexture updates the local copy
>>> from the shared copy.  The local copies are basically a write-through
>>> cache.  I'm not sure how this works with bindless.  DSA also causes
>>> problems because every DSA command implicitly has a bind.
>> This sounds similar to rcu, but coded more explicitly. I definitely 
>> don't
>> know enough about gl to say what might work (or work better), and how
>> consistent you need to be about concurrent writers.
>>
>> Cheers, Daniel
>>
>>>> The other thing (in 12/13 [2]) is that vertex array objects,
>>>> framebuffer objects, and pipeline objects are "container objects" and
>>>> are not shared between contexts, so they require no locking. It would
>>>> be better to solve that and avoid locking/atomics all together. But
>>>> then FBOs (and renderbuffers?) from EXT_framebuffer_object and VAOs
>>>> from APPLE_vertex_array_object are actually shared between contexts,
>>>> so those are out as well.
>>> If my theories above are correct, we don't need atomics or mutexes for
>>> any of it.
>>>
>>>> That seems to leave sampler objects, shader objects, and program
>>>> objects. I suspect those are relatively small compared with textures
>>>> or buffers.
>>>>
>>>> I'm not really sure what I'd suggest, to be honest. It seems like a
>>>> lot of aggravation for a minimal reward if done right.
>>>>
>>>> [1] 
>>>> https://lists.freedesktop.org/archives/mesa-dev/2015-August/090972.html 
>>>>
>>>> [2] 
>>>> https://lists.freedesktop.org/archives/mesa-dev/2015-August/090984.html 
>>>>
>>>> _______________________________________________
>>>> mesa-dev mailing list
>>>> mesa-dev at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev