[Mesa-dev] TGSI and Tessellation Control Shader outputs

Mon Sep 1 10:47:34 PDT 2014

Am 01.09.2014 18:53, schrieb Ilia Mirkin:
> On Mon, Sep 1, 2014 at 12:47 PM, Roland Scheidegger <sroland at vmware.com> wrote:
>> Am 01.09.2014 18:19, schrieb Ilia Mirkin:
>>> On Mon, Sep 1, 2014 at 12:00 PM, Roland Scheidegger <sroland at vmware.com> wrote:
>>>> Am 29.08.2014 22:44, schrieb Ilia Mirkin:
>>>>> Hello,
>>>>>
>>>>> I've been thinking a bit about how to properly implement TCS outputs
>>>>> in TGSI. As a quick reminder, there are per-vertex (i.e. invocation)
>>>>> and per-patch outputs in TCS. And while you can only write to the
>>>>> current invocation's per-vertex outputs, you can read from any of
>>>>> them. (With barrier() used to synchronize invocations.)
>>>>>
>>>>> Per-patch outputs map quite nicely onto the existing infrastructure,
>>>>> so the rest of the questions will be about per-vertex outputs.
>>>>>
>>>>> One can represent per-vertex outputs as 2D output arrays. That means
>>>>> support for them needs to be added all over (which I've actually done,
>>>>> so I'm not complaining about the extra work but rather asking if it's
>>>>> a good idea). And then you might have
>>>>>
>>>>> DCL OUT[][0], GENERIC
>>>>> MOV ADDR[1].x, SV[0] /* invocation id */
>>>>> MOV OUT[ADDR[1].x][0], TEMP[0] /* store value */
>>>>> BARRIER
>>>>> MOV TEMP[0], OUT[3][0] /* read output from invocation == 3 */
>>>>>
>>>>> The advantage here is that it's all nice and consistent. However the
>>>>> disadvantage is that we have to add a totally useless read of the
>>>>> invocation id and use it as a relative index for the store. At least
>>>>> the nvidia shaders don't even have a way of writing other invocations'
>>>>> data even if they wanted to (without resorting to global memory
>>>>> accesses). So it's complicating all sorts of logic for apparently no
>>>>> real benefit.
>>>>>
>>>>> Another approach might be to bypass the invocation id on storing the
>>>>> output, but using it on reads. For example code like
>>>>>
>>>>> DCL OUT[0], GENERIC
>>>>> MOV OUT[0], TEMP[0]
>>>>> BARRIER
>>>>> MOV TEMP[0], OUT[3][0]
>>>>>
>>>>> This avoids having to teach tgsi about 2d outputs (esp reladdr ones).
>>>>> This seems a lot simpler, but it ignores the gl_InvocationID indexing
>>>>> that happens when writing the output. However I don't think that's so
>>>>> bad. It also means that reads and writes are interpreted a little
>>>>> differently for OUT's, but that doesn't seem so bad either.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>
>>>> I think in the second case though it should be required to declare the
>>>> inputs separately. It sounds to me like at least on nv50 the access
>>>> works different in any case (even if the actual data accessed is the
>>>> same). Though I have no idea how other hw handles this, but in any case
>>>
>>> On nvc0 there are load and store instructions (nv50 is a little
>>> different, but it also doesn't support tess). When storing, there's no
>>> way to provide it the invocation offset. When loading, there is.
>>>
>>>> hull shader from d3d11 uses 2d addressed inputs but 1d addressed outputs
>>>> too -
>>>> https://urldefense.proofpoint.com/v1/url?u=http://msdn.microsoft.com/en-us/library/windows/desktop/hh447211%28v%3Dvs.85%29.aspx&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=F4msKE2WxRzA%2BwN%2B25muztFm5TSPwE8HKJfWfR2NgfY%3D%0A&m=nYcD1FcBz0UnqCOOj%2B2wurf%2F3rjQNi1sQmGxNT2xfPQ%3D%0A&s=f81f9c26e90f61f613539e68b7a0cfe070451d77be957c6dc28b2107b03fe497
>>>> (though I don't know how that looks like at the ddi level). Probably GL
>>>
>>> Hmmm... well from a quick read of it, they've bypassed this problem by
>>> creating substages with inputs consuming previous stages' outputs.
>> Doesn't exactly look like this to me. They still have this both as input
>> and output in multiple stages.
>>
>>>
>>>> used 2d outputs because it indeed looks more consistent (or perhaps some
>>>> extension could lift the restriction that only the current invocation be
>>>> written, though I'm not sure if that would ever make sense).
>>>> So I think if it doesn't actually make sense to try writing to other
>>>> outputs, option 2) makes more sense. I think though in this case the
>>>> outputs should probably be strictly write-only, I'd guess it would get
>>>> messy otherwise if you try to read some other invocations data vs.
>>>> reading back the current one.
>>>
>>> If they were write-only, how would you read another invocation's
>>> outputs? Or are you suggesting that some new input type be used which
>>> maps onto the invocations' outputs?
>>
>> Yes that's what d3d11 seems to do (as far as I can tell they just have
>> input control points and output control points). That's why you'd
>> declare it both as inputs and outputs, even though it is sort of the
>> same. Can't really tell though if this makes more sense as the gl model,
>> but this looks cleaner to me than accessing the same var differently (1d
>> output, 2d input).
> 
> One thing that occurred to me, and it's a problem with any approach
> that hides any aspect of what's going on, which is that you might have
> like
> 
> out int foo[];
> ...
> foo[gl_InvocationID] = ...
> if (...) foo[gl_InvocationID] += 1;
> 
> Now, it would be nice if the += 1 step could be done without the
> (presumably expensive) shader input load, instead reusing whatever
> TEMP was used above. Not sure whether that's too important though.
> 

I think you could do that easily either way (you need to recognize it's
really reading the current invocation data anyway, and it's not really
dependent on the tgsi representation).
btw there is some other reason why I think separate inputs/outputs has
some merit: outputs are usually uninitialized - but if you just declare
outputs as a 2d array then this is obviously not really the case, except
for the one with invocationID (and subject to the barrier stuff actually
for the other ones though that is true even for separate
inputs/outputs), which makes for a somewhat awkard register model I
guess. But still it should be workable.

Roland