[Mesa-dev] Question: st/mesa and context-shareable shaders

Jose Fonseca jfonseca at vmware.com
Mon Sep 28 14:59:49 PDT 2015


On 28/09/15 21:21, Marek Olšák wrote:
> On Mon, Sep 28, 2015 at 9:28 PM, Jose Fonseca <jfonseca at vmware.com> wrote:
>> On 28/09/15 14:51, Marek Olšák wrote:
>>>
>>> On Mon, Sep 28, 2015 at 2:55 PM, Jose Fonseca <jfonseca at vmware.com> wrote:
>>>>
>>>> On 27/09/15 19:14, Marek Olšák wrote:
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> For some reason, st/mesa assumes that shaders can't be shared by
>>>>> multiple contexts and therefore has a context pointer in shader keys.
>>>>> I'd like to get rid of this dependency, because there is no reason why
>>>>> shaders would need to be tied to a context. In fact, shaders don't
>>>>> even need a device, they just need a compiler.
>>>>
>>>>
>>>>>
>>>>>
>>>>> This is becoming a bigger issue with latest games that might prefer
>>>>> compiling shaders in another thread using a different GL context.
>>>>>
>>>>> As far as I know, shaders should be context-shareable on all hardware
>>>>> drivers.
>>>>
>>>>
>>>>
>>>> The thing to keep in mind is that, although in theory gallium shaders
>>>> could
>>>> be compiled once and reused by many pipe contexts, in practice many
>>>> drivers
>>>> need to produce shader variants that mix in state from other context
>>>> specific state objects, at draw time.
>>>>
>>>> llvmpipe, svga pipe driver are examples of such drivers that I know.
>>>> Maybe
>>>> amd/nvidia gallium drivers are immune from that problem, but my
>>>> understanding was that very few hardware has internal state objects that
>>>> line up perfectly with gallium state objects, hence that this problem was
>>>> widespread.
>>>>
>>>>
>>>> And because until now shaders were per-context, this means that pipe
>>>> drivers
>>>> can attach the variants to the shader object without any worries about
>>>> multi-threading, as it's guaranteed that no other thread is operating on
>>>> them.
>>>
>>>
>>> I'm well aware of that issue.
>>
>>
>> If you're aware but didn't think worth mention, then you clearly are
>> underestimating the issue.
>>
>> Multi-threading is never easy.  I think it's worth for you to open your mind
>> and think things through more thoroughly.
>
> I thought about it very thoroughly and I'm quite sure about it. See below.
>
>>
>>> You can use a mutex to prevent 2 threads
>>>
>>> from adding 2 shader variants simultaneously into a pipe shader. This
>>> should prevent corrupting the linked list of variants. Looking up the
>>> required shader variant based on states doesn't need locks if you add
>>> an atomic counter containing how many shaders are in the list,
>>> preventing the lookup function from having to read the ending pointer
>>> of the list where new shaders can be added by other threads.
>>>
>>> Drivers also shouldn't attach state pointers to shaders, but instead
>>> they should add their copies there. All in all, this is quite easily
>>> fixable and therefore not a big issue. It certainly doesn't prevent
>>> shaders from becoming screen objects.
>>
>>
>>
>>>>
>>>> So, if you want to make pipe shaders global, you need to ensure that all
>>>> shaders objects manipulations are thread-safe.  In particular, if you
>>>> still
>>>> want to attach variants to the shaders, you need to ensure access to
>>>> mutable
>>>> members are protected with mutex.  Which means you might be forced to
>>>> lock
>>>> mutexes on hot-paths, severely impacting performance (single- and multi-
>>>> thread apps alike).
>>>
>>>
>>> The mutex would only have to be locked when a new shader variant is
>>> added, which is a time-consuming operation by itself (compilation
>>> etc.), so there is no performance concern from adding a mutex.
>>
>>
>>> Other cases don't need the lock.
>>
>> I don't see how that can work.
>>
>> You need to lock, not only when adding variants, but _all_ accesses,
>> including traversing the variants to search for the right one.
>>
>> That is, _every_ time you need to search for a variant in the screen shader,
>> you'll need to lock the shader.  If you don't, a different thread might
>> write to the variant list/hashtable exactly at the same time you're
>> traversing it, invalidating the pointers you're using to iterate over the
>> variants.
>>
>> In other words, you'll be locking mutexes on pretty much every draw call...
>>
>> Personally I think that's a poor design.  I don't know how much is the mutex
>> lock overhead is in practice nowadays, but doing it per draw seems bad,
>> because it could all be avoided with a better design.
>
> No, I don't need the lock while traversing the variants. I don't use a
> hash table or a linked list from util. Our shader state has a pointer
> to the first shader variant. The variant itself has a pointer to the
> next variant. As long as I keep an atomic counter containing the
> number of variants, I will only have to traverse that number of
> variants. The variants itself are immutable. The list itself is
> immutable except for the "next" pointer in the last variant, which I
> will never access without the lock, because the atomic counter tells
> me how many I have.
>
> If the driver doesn't find the variant in the list, it will have to
> acquire the lock, search the list again (see "double-checked locking
> optimization"), if it still can't find it, then it will compile a new
> variant and add it to the list. Then unlock.
>
> This is fairly lockless except for the compilation and even during the
> compilation, it remains lockless for contexts that only use
> already-compiled variants.
>
> However, I do understand that using a more complicated list or hash
> table for the variants requires locking everywhere.

I see.  Reads would be lock-free, but at the expense that variants could 
only be added -- never removed --, right?

Maybe that's perfectly acceptable on most cases, but I wonder if it is 
acceptable in all cases.  Like embedded systems, or when the app is 
running some untrusted code like WebGL implementations.

I still feel that a design that forces drivers down this very narrow 
path is inferior.

>>>>
>>>>
>>>> One could avoid the threading primitives by attaching the variants to the
>>>> context themselves -- but then that's no different from what we do today
>>>> --
>>>> ie, you'd be compiling as often as today.
>>>
>>>
>>> Like I said above, it's easy to add shader variants to screen-based
>>> shaders with a mutex that has no impact on performance.
>>
>>
>>
>> I disagree with your option 1).  But if you're still determined, by all
>> means, do option 2).  I think you'll regret it though: down the road you'll
>> be either struggling with multi-thread correctness or performance.
>>
>>
>>>> I don't feel too strongly either way: maybe the benefits of having
>>>> shaders
>>>> shared by many pipe context are indeed worth the trouble.  But let it be
>>>> no
>>>> doubt: that implies making all operations on shader objects thread safe.
>>>> And that is not a trivial matter.  It's certainly not as easy as just
>>>> coping
>>>> with a null pipe_context argument.
>>>>
>>>>
>>>>> I see only 2 options out of this:
>>>>>
>>>>> 1) Removing the context pointer from the shader keys. If drivers need
>>>>> this, they should handle it by themselves. This will simplify st/mesa,
>>>>> because TCS, TES, GS won't need the variant-tracking machinery.
>>>>>
>>>>> 2) Adding PIPE_CAP_SHAREABLE_SHADERS and if the cap is 1, set the
>>>>> context pointer to NULL in all keys.
>>>>>
>>>>> What do you think?
>>>>>
>>>>> Marek
>>>>
>>>>
>>>>
>>>> I think there are other options worth considering:
>>>>
>>>>
>>>> a) I wonder if wouldn't be better the other way around: if drivers want
>>>> to
>>>> avoid compiling the same shader for many contexts, they should maintain a
>>>> shader cache internally that's shared by all contexts.
>>>>
>>>>      As that can be done much more efficient -- there are many ways to
>>>> implement a global cache that avoids frequent locking
>>>>
>>>>      And the way to avoid reimplementing the wheel on every driver would
>>>> be to
>>>> have an helper module that would take care of that -- ie, a generial
>>>> auxiliary module to maintain the shader cache in a thread safe maner.
>>>> That
>>>> module could potentially even learn to (de)serializing the cache from
>>>> disk.
>>>
>>>
>>> This is too complicated for now and not required for having
>>> screen-based shaders.
>>>
>>>>
>>>>
>>>> b) split pipe shader objects in two parts:
>>>>
>>>>      1) a global object, which would allow the pipe drivers to pre-compile
>>>> the
>>>> TGSI into an internal IR
>>>>
>>>>      2) per-context objects, so that pipe drivers could produce variants
>>>> of
>>>> that IR which takes in account the currently bound state
>>>>
>>>>     And for pipe drivers that can't pre-compile anything, 1) would be
>>>> nothing
>>>> more than a duplication of TGSI tokens.
>>>
>>>
>>> I assume most drivers just don't want per-context shaders, so adding
>>> an interface for it would be counterproductive.
>>
>>
>> Maybe I didn't explain.  State tracker can't choose between 1) and 2) -- it
>> needs to create both.
>>
>> That is,
>> - state tracker would create a screen-shader from TGSI tokens (one for the
>> whole share group)
>> - then create a context-shader from the screen-shader, one for each context
>> that shader is used by (so drivers can attach the variants to the
>> per-context shaders without locking)
>
> It's probably better and easier to generate TGSI in glLinkProgram,
> release the GLSL IR, and patch TGSI manually based on the shader key.
> This would eliminate most of the compilation overhead in st/mesa and
> it would be a good cleanup in general.

You're talking about the Mesa shader variants (1 GLSL -> N TGSI), while 
I'm talking about the pipe shader variants (1 TGSI -> N HW-specific 
machine code).

I already tried to explain this twice, yet you still make comments that 
show my message didn't get across.  I don't know how else to explain 
this.  Rob appears to have understood.  If you were serious when you 
asked what others thought could you please read what I proposed with an 
open mind?

Jose


More information about the mesa-dev mailing list