[Mesa-dev] TGSI and Tessellation Control Shader outputs

Ilia Mirkin imirkin at alum.mit.edu
Tue Sep 16 08:42:24 PDT 2014


OK, so just to summarize:

The approach suggested by Roland is to have the outputs be
one-dimensional and only representing the current invocation's
per-vertex outputs. Each invocation would also get access to other
invocations' per-vertex outputs via a 2d input array.

So a shader might look something like

TESSC
DECL IN[][0], POSITION (input patch's per-vertex position)
DECL IN[][1], GENERIC (input patch's per-vertex generic attribute)
DECL IN[][2], TCS_POSITION (output patch's per-vertex position)
DECL IN[][3], TCS_OUTPUT (output patch's per-vertex generic attribute)
DECL OUT[0], POSITION
DECL OUT[1], GENERIC
DECL OUT[2], PATCH

And then anything written to OUT[0] would be aliased via IN[][2].
Roland, does that sound right? This seems kinda nasty that there are
going to be 2 types of position/pointsize/clipdistance inputs -- do
you have a better suggestion for handling that?

I still sorta like my hacky idea of declaring the outputs
1-dimensionally, but then allowing them as sources and reading them as
though they were 2d... Although that still doesn't deal with a
foo[gl_InvocationID] += 1 type of situation very gracefully.


On Mon, Sep 1, 2014 at 1:47 PM, Roland Scheidegger <sroland at vmware.com> wrote:
> Am 01.09.2014 18:53, schrieb Ilia Mirkin:
>> On Mon, Sep 1, 2014 at 12:47 PM, Roland Scheidegger <sroland at vmware.com> wrote:
>>> Am 01.09.2014 18:19, schrieb Ilia Mirkin:
>>>> On Mon, Sep 1, 2014 at 12:00 PM, Roland Scheidegger <sroland at vmware.com> wrote:
>>>>> Am 29.08.2014 22:44, schrieb Ilia Mirkin:
>>>>>> Hello,
>>>>>>
>>>>>> I've been thinking a bit about how to properly implement TCS outputs
>>>>>> in TGSI. As a quick reminder, there are per-vertex (i.e. invocation)
>>>>>> and per-patch outputs in TCS. And while you can only write to the
>>>>>> current invocation's per-vertex outputs, you can read from any of
>>>>>> them. (With barrier() used to synchronize invocations.)
>>>>>>
>>>>>> Per-patch outputs map quite nicely onto the existing infrastructure,
>>>>>> so the rest of the questions will be about per-vertex outputs.
>>>>>>
>>>>>> One can represent per-vertex outputs as 2D output arrays. That means
>>>>>> support for them needs to be added all over (which I've actually done,
>>>>>> so I'm not complaining about the extra work but rather asking if it's
>>>>>> a good idea). And then you might have
>>>>>>
>>>>>> DCL OUT[][0], GENERIC
>>>>>> MOV ADDR[1].x, SV[0] /* invocation id */
>>>>>> MOV OUT[ADDR[1].x][0], TEMP[0] /* store value */
>>>>>> BARRIER
>>>>>> MOV TEMP[0], OUT[3][0] /* read output from invocation == 3 */
>>>>>>
>>>>>> The advantage here is that it's all nice and consistent. However the
>>>>>> disadvantage is that we have to add a totally useless read of the
>>>>>> invocation id and use it as a relative index for the store. At least
>>>>>> the nvidia shaders don't even have a way of writing other invocations'
>>>>>> data even if they wanted to (without resorting to global memory
>>>>>> accesses). So it's complicating all sorts of logic for apparently no
>>>>>> real benefit.
>>>>>>
>>>>>> Another approach might be to bypass the invocation id on storing the
>>>>>> output, but using it on reads. For example code like
>>>>>>
>>>>>> DCL OUT[0], GENERIC
>>>>>> MOV OUT[0], TEMP[0]
>>>>>> BARRIER
>>>>>> MOV TEMP[0], OUT[3][0]
>>>>>>
>>>>>> This avoids having to teach tgsi about 2d outputs (esp reladdr ones).
>>>>>> This seems a lot simpler, but it ignores the gl_InvocationID indexing
>>>>>> that happens when writing the output. However I don't think that's so
>>>>>> bad. It also means that reads and writes are interpreted a little
>>>>>> differently for OUT's, but that doesn't seem so bad either.
>>>>>>
>>>>>> Thoughts?
>>>>>>
>>>>>
>>>>> I think in the second case though it should be required to declare the
>>>>> inputs separately. It sounds to me like at least on nv50 the access
>>>>> works different in any case (even if the actual data accessed is the
>>>>> same). Though I have no idea how other hw handles this, but in any case
>>>>
>>>> On nvc0 there are load and store instructions (nv50 is a little
>>>> different, but it also doesn't support tess). When storing, there's no
>>>> way to provide it the invocation offset. When loading, there is.
>>>>
>>>>> hull shader from d3d11 uses 2d addressed inputs but 1d addressed outputs
>>>>> too -
>>>>> https://urldefense.proofpoint.com/v1/url?u=http://msdn.microsoft.com/en-us/library/windows/desktop/hh447211%28v%3Dvs.85%29.aspx&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=F4msKE2WxRzA%2BwN%2B25muztFm5TSPwE8HKJfWfR2NgfY%3D%0A&m=nYcD1FcBz0UnqCOOj%2B2wurf%2F3rjQNi1sQmGxNT2xfPQ%3D%0A&s=f81f9c26e90f61f613539e68b7a0cfe070451d77be957c6dc28b2107b03fe497
>>>>> (though I don't know how that looks like at the ddi level). Probably GL
>>>>
>>>> Hmmm... well from a quick read of it, they've bypassed this problem by
>>>> creating substages with inputs consuming previous stages' outputs.
>>> Doesn't exactly look like this to me. They still have this both as input
>>> and output in multiple stages.
>>>
>>>>
>>>>> used 2d outputs because it indeed looks more consistent (or perhaps some
>>>>> extension could lift the restriction that only the current invocation be
>>>>> written, though I'm not sure if that would ever make sense).
>>>>> So I think if it doesn't actually make sense to try writing to other
>>>>> outputs, option 2) makes more sense. I think though in this case the
>>>>> outputs should probably be strictly write-only, I'd guess it would get
>>>>> messy otherwise if you try to read some other invocations data vs.
>>>>> reading back the current one.
>>>>
>>>> If they were write-only, how would you read another invocation's
>>>> outputs? Or are you suggesting that some new input type be used which
>>>> maps onto the invocations' outputs?
>>>
>>> Yes that's what d3d11 seems to do (as far as I can tell they just have
>>> input control points and output control points). That's why you'd
>>> declare it both as inputs and outputs, even though it is sort of the
>>> same. Can't really tell though if this makes more sense as the gl model,
>>> but this looks cleaner to me than accessing the same var differently (1d
>>> output, 2d input).
>>
>> One thing that occurred to me, and it's a problem with any approach
>> that hides any aspect of what's going on, which is that you might have
>> like
>>
>> out int foo[];
>> ...
>> foo[gl_InvocationID] = ...
>> if (...) foo[gl_InvocationID] += 1;
>>
>> Now, it would be nice if the += 1 step could be done without the
>> (presumably expensive) shader input load, instead reusing whatever
>> TEMP was used above. Not sure whether that's too important though.
>>
>
> I think you could do that easily either way (you need to recognize it's
> really reading the current invocation data anyway, and it's not really
> dependent on the tgsi representation).
> btw there is some other reason why I think separate inputs/outputs has
> some merit: outputs are usually uninitialized - but if you just declare
> outputs as a 2d array then this is obviously not really the case, except
> for the one with invocationID (and subject to the barrier stuff actually
> for the other ones though that is true even for separate
> inputs/outputs), which makes for a somewhat awkard register model I
> guess. But still it should be workable.
>
> Roland
>


More information about the mesa-dev mailing list