[Mesa-dev] [PATCH] mesa: avoid pthread_mutexes when adjusting object refcounts

Tue Oct 25 15:53:48 UTC 2016

On Wed, Oct 19, 2016 at 03:34:12PM -0700, Ian Romanick wrote:
> On 10/19/2016 12:58 PM, Matt Turner wrote:
> > On Wed, Oct 19, 2016 at 12:06 PM, Jan Ziak <0xe2.0x9a.0x9b at gmail.com> wrote:
> >> This patch removes locks around reference counts in favor of atomic inc/dec
> >> operations on the refcounts. This makes the refcount adjustments much faster.
> > 
> > I had the same idea last year and sent a patch series ([PATCH 00/13]
> > mesa: Locking improvements and optimizations [1]).
> > 
> > The part that corresponds to your patch are patches 5-9. 1-4, 10, and
> > 11 were committed.
> > 
> > I'm having a difficult time remembering why I ultimately dropped the
> > series, but Ian's comment on 05/13 indicates that the locking is
> > broken for bufferobj.c, so it's not appropriate to use atomics there.
> > I remember that texture locking is totally broken, but I cannot
> > remember the details.
> 
> So... I was trying to trick you into fixing the locking problems.  I see
> that I have failed. :)

Hm, locking, figured I'll jump in with some patterns from the kernel world
;-)

> I looked at this quite a bit more closely today.  Here's one timeline
> that worries me:
> 
>     Thread 1                     Thread 2
>     Call glDeleteBuffers
>                                  Call glNamedBufferData
>                                  Call _mesa_lookup_bufferobj_err
>     Remove buffer ID from hash
>     Decrement refcount
>     Free BO backing store
>     Delete buffer object
>                                  Reference freed memory
> 
> The entire body of glDeleteBuffers holds a different lock (the hashtable
> lock), so all of that happens atomically.  _mesa_lookup_bufferobj_err
> does not update the refcount, so bad things can happen.
> 
> I think the right way to fix this is to have every _mesa_lookup_*
> function increment the refcount of the object before returning it.  At
> that point, every modification to the refcount should happen while the
> hashtable mutex is held.  I think that means that neither atomics nor an
> additional mutex are needed for reference counting.

Yup, that's the pattern we use in the kernel. One additional complications
is if you have lookup tables which only have a pointer, but do not hold a
full reference on the object (i.e. weak references for caches). In that
case you must check (under the lookup table lock) whether the refcount
really is greater than 0, and only increment in that case. Atomically of
course. That's implemented by the kref_get_unless_zero helper in the
kernel.

Upshot of this: You can fully untangle locking for lookup tables and
caches from refcounting, and you can add a given object to as many places
with weak references as you want. The other usual pattern is to only
destroy an object while holding the lookup lock, but that's unecessariyl
serializing, and as soon as you have 2+ lookup tables it gets seriously
complicated (since you need to hold all the locks).

> The other timelime the concerns me is:
> 
>     Thread 1                     Thread 2
>     Call glNamedBufferData
>     Call _mesa_lookup_bufferobj_err
>     Call ctx->Driver.BufferData
>         (buffer_data_fallback as an example)
>     Free old buffer
>                                  Call glNamedBufferData
>                                  Call _mesa_lookup_bufferobj_err
>                                  Call ctx->Driver.BufferData
>                                  Free old buffer
>                                      (possibly double-free)
>                                  Allocate new buffer
>     Allocate new buffer
>         (possibly leak the other new buffer)
> 
> To fix this, I think we do want a per-object mutex that is only used in
> ctx->Driver.BufferData.  If there is only one context in the share group
> (a very common case!), we could avoid even that locking.

One trick we're using in the kernel for this stuff is rcu, i.e. never
change the data structure, but if you update it create a new one, put it
into the lookup struct and then unreference the old one. Of course this is
still racy (but in most apis winner-takes-it-all is perfectly fine
behaviour) for concurrent updates, but it does fix the double-free or
anything else nasty happening pretty cheaply.

> Textures and renderbuffers have similar issues.  Textures have
> additional problems with derived state.  Calling glTexParameter can
> cause a lot of state inside the texture to get updated.  Multiple calls
> to glTexParameter from multiple threads can result in inconsistent
> state.  It also means that rendering from one thread while another
> thread is updating state can lead to problems.
> 
> Other drivers solve this problem by having N+1 copies of each object's
> state.  Each context gets its own copy, and the share group gets a copy.
>  Writes to state go to both the local copy and the shared copy.  Reads
> come only from the local copy.  glBindTexture updates the local copy
> from the shared copy.  The local copies are basically a write-through
> cache.  I'm not sure how this works with bindless.  DSA also causes
> problems because every DSA command implicitly has a bind.

This sounds similar to rcu, but coded more explicitly. I definitely don't
know enough about gl to say what might work (or work better), and how
consistent you need to be about concurrent writers.

Cheers, Daniel

> > The other thing (in 12/13 [2]) is that vertex array objects,
> > framebuffer objects, and pipeline objects are "container objects" and
> > are not shared between contexts, so they require no locking. It would
> > be better to solve that and avoid locking/atomics all together. But
> > then FBOs (and renderbuffers?) from EXT_framebuffer_object and VAOs
> > from APPLE_vertex_array_object are actually shared between contexts,
> > so those are out as well.
> 
> If my theories above are correct, we don't need atomics or mutexes for
> any of it.
> 
> > That seems to leave sampler objects, shader objects, and program
> > objects. I suspect those are relatively small compared with textures
> > or buffers.
> > 
> > I'm not really sure what I'd suggest, to be honest. It seems like a
> > lot of aggravation for a minimal reward if done right.
> > 
> > [1] https://lists.freedesktop.org/archives/mesa-dev/2015-August/090972.html
> > [2] https://lists.freedesktop.org/archives/mesa-dev/2015-August/090984.html
> > _______________________________________________
> > mesa-dev mailing list
> > mesa-dev at lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch