[Mesa-dev] Question: st/mesa and context-shareable shaders

Mon Sep 28 15:43:39 PDT 2015

On Mon, Sep 28, 2015 at 11:59 PM, Jose Fonseca <jfonseca at vmware.com> wrote:
> On 28/09/15 21:21, Marek Olšák wrote:
>>
>> On Mon, Sep 28, 2015 at 9:28 PM, Jose Fonseca <jfonseca at vmware.com> wrote:
>>>
>>> On 28/09/15 14:51, Marek Olšák wrote:
>>>>
>>>>
>>>> On Mon, Sep 28, 2015 at 2:55 PM, Jose Fonseca <jfonseca at vmware.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> On 27/09/15 19:14, Marek Olšák wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> For some reason, st/mesa assumes that shaders can't be shared by
>>>>>> multiple contexts and therefore has a context pointer in shader keys.
>>>>>> I'd like to get rid of this dependency, because there is no reason why
>>>>>> shaders would need to be tied to a context. In fact, shaders don't
>>>>>> even need a device, they just need a compiler.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> This is becoming a bigger issue with latest games that might prefer
>>>>>> compiling shaders in another thread using a different GL context.
>>>>>>
>>>>>> As far as I know, shaders should be context-shareable on all hardware
>>>>>> drivers.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> The thing to keep in mind is that, although in theory gallium shaders
>>>>> could
>>>>> be compiled once and reused by many pipe contexts, in practice many
>>>>> drivers
>>>>> need to produce shader variants that mix in state from other context
>>>>> specific state objects, at draw time.
>>>>>
>>>>> llvmpipe, svga pipe driver are examples of such drivers that I know.
>>>>> Maybe
>>>>> amd/nvidia gallium drivers are immune from that problem, but my
>>>>> understanding was that very few hardware has internal state objects
>>>>> that
>>>>> line up perfectly with gallium state objects, hence that this problem
>>>>> was
>>>>> widespread.
>>>>>
>>>>>
>>>>> And because until now shaders were per-context, this means that pipe
>>>>> drivers
>>>>> can attach the variants to the shader object without any worries about
>>>>> multi-threading, as it's guaranteed that no other thread is operating
>>>>> on
>>>>> them.
>>>>
>>>>
>>>>
>>>> I'm well aware of that issue.
>>>
>>>
>>>
>>> If you're aware but didn't think worth mention, then you clearly are
>>> underestimating the issue.
>>>
>>> Multi-threading is never easy.  I think it's worth for you to open your
>>> mind
>>> and think things through more thoroughly.
>>
>>
>> I thought about it very thoroughly and I'm quite sure about it. See below.
>>
>>>
>>>> You can use a mutex to prevent 2 threads
>>>>
>>>> from adding 2 shader variants simultaneously into a pipe shader. This
>>>> should prevent corrupting the linked list of variants. Looking up the
>>>> required shader variant based on states doesn't need locks if you add
>>>> an atomic counter containing how many shaders are in the list,
>>>> preventing the lookup function from having to read the ending pointer
>>>> of the list where new shaders can be added by other threads.
>>>>
>>>> Drivers also shouldn't attach state pointers to shaders, but instead
>>>> they should add their copies there. All in all, this is quite easily
>>>> fixable and therefore not a big issue. It certainly doesn't prevent
>>>> shaders from becoming screen objects.
>>>
>>>
>>>
>>>
>>>>>
>>>>> So, if you want to make pipe shaders global, you need to ensure that
>>>>> all
>>>>> shaders objects manipulations are thread-safe.  In particular, if you
>>>>> still
>>>>> want to attach variants to the shaders, you need to ensure access to
>>>>> mutable
>>>>> members are protected with mutex.  Which means you might be forced to
>>>>> lock
>>>>> mutexes on hot-paths, severely impacting performance (single- and
>>>>> multi-
>>>>> thread apps alike).
>>>>
>>>>
>>>>
>>>> The mutex would only have to be locked when a new shader variant is
>>>> added, which is a time-consuming operation by itself (compilation
>>>> etc.), so there is no performance concern from adding a mutex.
>>>
>>>
>>>
>>>> Other cases don't need the lock.
>>>
>>>
>>> I don't see how that can work.
>>>
>>> You need to lock, not only when adding variants, but _all_ accesses,
>>> including traversing the variants to search for the right one.
>>>
>>> That is, _every_ time you need to search for a variant in the screen
>>> shader,
>>> you'll need to lock the shader.  If you don't, a different thread might
>>> write to the variant list/hashtable exactly at the same time you're
>>> traversing it, invalidating the pointers you're using to iterate over the
>>> variants.
>>>
>>> In other words, you'll be locking mutexes on pretty much every draw
>>> call...
>>>
>>> Personally I think that's a poor design.  I don't know how much is the
>>> mutex
>>> lock overhead is in practice nowadays, but doing it per draw seems bad,
>>> because it could all be avoided with a better design.
>>
>>
>> No, I don't need the lock while traversing the variants. I don't use a
>> hash table or a linked list from util. Our shader state has a pointer
>> to the first shader variant. The variant itself has a pointer to the
>> next variant. As long as I keep an atomic counter containing the
>> number of variants, I will only have to traverse that number of
>> variants. The variants itself are immutable. The list itself is
>> immutable except for the "next" pointer in the last variant, which I
>> will never access without the lock, because the atomic counter tells
>> me how many I have.
>>
>> If the driver doesn't find the variant in the list, it will have to
>> acquire the lock, search the list again (see "double-checked locking
>> optimization"), if it still can't find it, then it will compile a new
>> variant and add it to the list. Then unlock.
>>
>> This is fairly lockless except for the compilation and even during the
>> compilation, it remains lockless for contexts that only use
>> already-compiled variants.
>>
>> However, I do understand that using a more complicated list or hash
>> table for the variants requires locking everywhere.
>
>
> I see.  Reads would be lock-free, but at the expense that variants could
> only be added -- never removed --, right?

Yes, that's right. If drivers only keep final bytecode and shader keys
in memory, the memory footprint should be pretty low.

>
> Maybe that's perfectly acceptable on most cases, but I wonder if it is
> acceptable in all cases.  Like embedded systems, or when the app is running
> some untrusted code like WebGL implementations.
>
> I still feel that a design that forces drivers down this very narrow path is
> inferior.

We are already forced by OpenGL if we want to compile shaders from
GLSL to native bytecode without shader variants at link time. The
dependency on the context makes that more difficult and Unreal Engine
4 always uses 2 contexts for some reason already.

>
>
>>>>>
>>>>>
>>>>> One could avoid the threading primitives by attaching the variants to
>>>>> the
>>>>> context themselves -- but then that's no different from what we do
>>>>> today
>>>>> --
>>>>> ie, you'd be compiling as often as today.
>>>>
>>>>
>>>>
>>>> Like I said above, it's easy to add shader variants to screen-based
>>>> shaders with a mutex that has no impact on performance.
>>>
>>>
>>>
>>>
>>> I disagree with your option 1).  But if you're still determined, by all
>>> means, do option 2).  I think you'll regret it though: down the road
>>> you'll
>>> be either struggling with multi-thread correctness or performance.
>>>
>>>
>>>>> I don't feel too strongly either way: maybe the benefits of having
>>>>> shaders
>>>>> shared by many pipe context are indeed worth the trouble.  But let it
>>>>> be
>>>>> no
>>>>> doubt: that implies making all operations on shader objects thread
>>>>> safe.
>>>>> And that is not a trivial matter.  It's certainly not as easy as just
>>>>> coping
>>>>> with a null pipe_context argument.
>>>>>
>>>>>
>>>>>> I see only 2 options out of this:
>>>>>>
>>>>>> 1) Removing the context pointer from the shader keys. If drivers need
>>>>>> this, they should handle it by themselves. This will simplify st/mesa,
>>>>>> because TCS, TES, GS won't need the variant-tracking machinery.
>>>>>>
>>>>>> 2) Adding PIPE_CAP_SHAREABLE_SHADERS and if the cap is 1, set the
>>>>>> context pointer to NULL in all keys.
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> Marek
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I think there are other options worth considering:
>>>>>
>>>>>
>>>>> a) I wonder if wouldn't be better the other way around: if drivers want
>>>>> to
>>>>> avoid compiling the same shader for many contexts, they should maintain
>>>>> a
>>>>> shader cache internally that's shared by all contexts.
>>>>>
>>>>>      As that can be done much more efficient -- there are many ways to
>>>>> implement a global cache that avoids frequent locking
>>>>>
>>>>>      And the way to avoid reimplementing the wheel on every driver
>>>>> would
>>>>> be to
>>>>> have an helper module that would take care of that -- ie, a generial
>>>>> auxiliary module to maintain the shader cache in a thread safe maner.
>>>>> That
>>>>> module could potentially even learn to (de)serializing the cache from
>>>>> disk.
>>>>
>>>>
>>>>
>>>> This is too complicated for now and not required for having
>>>> screen-based shaders.
>>>>
>>>>>
>>>>>
>>>>> b) split pipe shader objects in two parts:
>>>>>
>>>>>      1) a global object, which would allow the pipe drivers to
>>>>> pre-compile
>>>>> the
>>>>> TGSI into an internal IR
>>>>>
>>>>>      2) per-context objects, so that pipe drivers could produce
>>>>> variants
>>>>> of
>>>>> that IR which takes in account the currently bound state
>>>>>
>>>>>     And for pipe drivers that can't pre-compile anything, 1) would be
>>>>> nothing
>>>>> more than a duplication of TGSI tokens.
>>>>
>>>>
>>>>
>>>> I assume most drivers just don't want per-context shaders, so adding
>>>> an interface for it would be counterproductive.
>>>
>>>
>>>
>>> Maybe I didn't explain.  State tracker can't choose between 1) and 2) --
>>> it
>>> needs to create both.
>>>
>>> That is,
>>> - state tracker would create a screen-shader from TGSI tokens (one for
>>> the
>>> whole share group)
>>> - then create a context-shader from the screen-shader, one for each
>>> context
>>> that shader is used by (so drivers can attach the variants to the
>>> per-context shaders without locking)
>>
>>
>> It's probably better and easier to generate TGSI in glLinkProgram,
>> release the GLSL IR, and patch TGSI manually based on the shader key.
>> This would eliminate most of the compilation overhead in st/mesa and
>> it would be a good cleanup in general.
>
>
> You're talking about the Mesa shader variants (1 GLSL -> N TGSI), while I'm
> talking about the pipe shader variants (1 TGSI -> N HW-specific machine
> code).
>
> I already tried to explain this twice, yet you still make comments that show
> my message didn't get across.  I don't know how else to explain this.  Rob
> appears to have understood.  If you were serious when you asked what others
> thought could you please read what I proposed with an open mind?

Oh, I get it know. I somehow misunderstood that. Your idea is
interesting, though it's not useful for our driver, because we plan to
get rid of shader variants entirely and have only one shader for each
GLSL shader. This doesn't include DrawPixels and Bitmap shaders and
passing through edgeflags, which will need separate variants and
on-demand compilation, but we plan to compile everything else at link
time.

This thread is just about one of the state dependencies that we need
to get rid of for our driver at least. See my XDC talk or slides:
http://www.x.org/wiki/Events/XDC2015/Program/olsak_handling_shader_recompile.pdf

Marek