[Mesa-dev] [PATCH] mesa: avoid pthread_mutexes when adjusting object refcounts
Timothy Arceri
t_arceri at yahoo.com.au
Sat May 20 10:01:29 UTC 2017
On 20/05/17 19:21, Timothy Arceri wrote:
>
>
> On 26/10/16 02:53, Daniel Vetter wrote:
>> On Wed, Oct 19, 2016 at 03:34:12PM -0700, Ian Romanick wrote:
>>> On 10/19/2016 12:58 PM, Matt Turner wrote:
>>>> On Wed, Oct 19, 2016 at 12:06 PM, Jan Ziak
>>>> <0xe2.0x9a.0x9b at gmail.com> wrote:
>>>>> This patch removes locks around reference counts in favor of
>>>>> atomic inc/dec
>>>>> operations on the refcounts. This makes the refcount adjustments
>>>>> much faster.
>>>> I had the same idea last year and sent a patch series ([PATCH 00/13]
>>>> mesa: Locking improvements and optimizations [1]).
>>>>
>>>> The part that corresponds to your patch are patches 5-9. 1-4, 10, and
>>>> 11 were committed.
>>>>
>>>> I'm having a difficult time remembering why I ultimately dropped the
>>>> series, but Ian's comment on 05/13 indicates that the locking is
>>>> broken for bufferobj.c, so it's not appropriate to use atomics there.
>>>> I remember that texture locking is totally broken, but I cannot
>>>> remember the details.
>>> So... I was trying to trick you into fixing the locking problems. I
>>> see
>>> that I have failed. :)
>> Hm, locking, figured I'll jump in with some patterns from the kernel
>> world
>> ;-)
>>
>>> I looked at this quite a bit more closely today. Here's one timeline
>>> that worries me:
>>>
>>> Thread 1 Thread 2
>>> Call glDeleteBuffers
>>> Call glNamedBufferData
>>> Call _mesa_lookup_bufferobj_err
>>> Remove buffer ID from hash
>>> Decrement refcount
>>> Free BO backing store
>>> Delete buffer object
>>> Reference freed memory
>>>
>>> The entire body of glDeleteBuffers holds a different lock (the
>>> hashtable
>>> lock), so all of that happens atomically. _mesa_lookup_bufferobj_err
>>> does not update the refcount, so bad things can happen.
>>>
>>> I think the right way to fix this is to have every _mesa_lookup_*
>>> function increment the refcount of the object before returning it. At
>>> that point, every modification to the refcount should happen while the
>>> hashtable mutex is held. I think that means that neither atomics
>>> nor an
>>> additional mutex are needed for reference counting.
> I've been doing some work on this but I think there is still a need
> for atomics when we remove a reference right? We can only drop both
> the atomic and mutex when increment the reference.
So in other words we should really still have the atomics in place.
>
> By re-using the hash table lock we always know that we won't delete
> the buffer out from underneath a thread using it, but we still have a
> race condition between the two threads trying to delete it and could
> end up with a double free?
> Does that seem right?
>
>
>> Yup, that's the pattern we use in the kernel. One additional
>> complications
>> is if you have lookup tables which only have a pointer, but do not
>> hold a
>> full reference on the object (i.e. weak references for caches). In that
>> case you must check (under the lookup table lock) whether the refcount
>> really is greater than 0, and only increment in that case. Atomically of
>> course. That's implemented by the kref_get_unless_zero helper in the
>> kernel.
>>
>> Upshot of this: You can fully untangle locking for lookup tables and
>> caches from refcounting, and you can add a given object to as many
>> places
>> with weak references as you want. The other usual pattern is to only
>> destroy an object while holding the lookup lock, but that's unecessariyl
>> serializing, and as soon as you have 2+ lookup tables it gets seriously
>> complicated (since you need to hold all the locks).
>>
>>> The other timelime the concerns me is:
>>>
>>> Thread 1 Thread 2
>>> Call glNamedBufferData
>>> Call _mesa_lookup_bufferobj_err
>>> Call ctx->Driver.BufferData
>>> (buffer_data_fallback as an example)
>>> Free old buffer
>>> Call glNamedBufferData
>>> Call _mesa_lookup_bufferobj_err
>>> Call ctx->Driver.BufferData
>>> Free old buffer
>>> (possibly double-free)
>>> Allocate new buffer
>>> Allocate new buffer
>>> (possibly leak the other new buffer)
>>>
>>> To fix this, I think we do want a per-object mutex that is only used in
>>> ctx->Driver.BufferData. If there is only one context in the share
>>> group
>>> (a very common case!), we could avoid even that locking.
>> One trick we're using in the kernel for this stuff is rcu, i.e. never
>> change the data structure, but if you update it create a new one, put it
>> into the lookup struct and then unreference the old one. Of course
>> this is
>> still racy (but in most apis winner-takes-it-all is perfectly fine
>> behaviour) for concurrent updates, but it does fix the double-free or
>> anything else nasty happening pretty cheaply.
>>
>>> Textures and renderbuffers have similar issues. Textures have
>>> additional problems with derived state. Calling glTexParameter can
>>> cause a lot of state inside the texture to get updated. Multiple calls
>>> to glTexParameter from multiple threads can result in inconsistent
>>> state. It also means that rendering from one thread while another
>>> thread is updating state can lead to problems.
>>>
>>> Other drivers solve this problem by having N+1 copies of each object's
>>> state. Each context gets its own copy, and the share group gets a
>>> copy.
>>> Writes to state go to both the local copy and the shared copy. Reads
>>> come only from the local copy. glBindTexture updates the local copy
>>> from the shared copy. The local copies are basically a write-through
>>> cache. I'm not sure how this works with bindless. DSA also causes
>>> problems because every DSA command implicitly has a bind.
>> This sounds similar to rcu, but coded more explicitly. I definitely
>> don't
>> know enough about gl to say what might work (or work better), and how
>> consistent you need to be about concurrent writers.
>>
>> Cheers, Daniel
>>
>>>> The other thing (in 12/13 [2]) is that vertex array objects,
>>>> framebuffer objects, and pipeline objects are "container objects" and
>>>> are not shared between contexts, so they require no locking. It would
>>>> be better to solve that and avoid locking/atomics all together. But
>>>> then FBOs (and renderbuffers?) from EXT_framebuffer_object and VAOs
>>>> from APPLE_vertex_array_object are actually shared between contexts,
>>>> so those are out as well.
>>> If my theories above are correct, we don't need atomics or mutexes for
>>> any of it.
>>>
>>>> That seems to leave sampler objects, shader objects, and program
>>>> objects. I suspect those are relatively small compared with textures
>>>> or buffers.
>>>>
>>>> I'm not really sure what I'd suggest, to be honest. It seems like a
>>>> lot of aggravation for a minimal reward if done right.
>>>>
>>>> [1]
>>>> https://lists.freedesktop.org/archives/mesa-dev/2015-August/090972.html
>>>>
>>>> [2]
>>>> https://lists.freedesktop.org/archives/mesa-dev/2015-August/090984.html
>>>>
>>>> _______________________________________________
>>>> mesa-dev mailing list
>>>> mesa-dev at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
More information about the mesa-dev
mailing list