[Mesa-dev] [PATCH 02/23] tgsi: add Stream{X, Y, Z, W} fields to tgsi_declaration_semantic

Fri Dec 2 21:48:31 UTC 2016

Am 02.12.2016 um 20:44 schrieb Nicolai Hähnle:
> On 02.12.2016 19:46, Roland Scheidegger wrote:
>> Am 02.12.2016 um 18:23 schrieb Nicolai Hähnle:
>>> On 30.11.2016 21:37, Roland Scheidegger wrote:
>>>> Am 30.11.2016 um 20:19 schrieb Nicolai Hähnle:
>>>>> On 30.11.2016 19:06, Roland Scheidegger wrote:
>>>>>> Am 30.11.2016 um 14:35 schrieb Nicolai Hähnle:
>>>>>>> From: Nicolai Hähnle <nicolai.haehnle at amd.com>
>>>>>>>
>>>>>>> This is for geometry shader outputs. Without it, drivers have no
>>>>>>> way of
>>>>>>> knowing which stream each output is intended for, and have to
>>>>>>> conservatively write all outputs to all streams.
>>>>>>>
>>>>>>> Separate stream numbers for each component are required due to
>>>>>>> output
>>>>>>> packing.
>>>>>> Are you sure this is true?
>>>>>> This is an area I don't know much about, but
>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.opengl.org_wiki_Layout-5FQualifier-5F-28GLSL-29&d=DgIDaQ&c=uilaK90D4TOVoH58JNXRgQ&r=_QIjpv-UJ77xEQY8fIYoQtr5qv8wKrPJc7v7_-CYAb0&m=fVpTGTYN2KTEhU17RpFTxEULrsIfC3bdpEin0k8NIYE&s=uamnHj-9Xr12ctr0gHDfCMIMHq8DyUBtKIwHQQpjDLs&e=
>>>>>>
>>>>>>
>>>>>> tells me "Stream
>>>>>> assignments for a geometry shader are required to be the same for all
>>>>>> members of a block, but offsets are not."
>>>>>>
>>>>>> Therefore I don't think output packing should ever happen across
>>>>>> multiple streams. I think it would be MUCH nicer if the semantic
>>>>>> needed
>>>>>> just one stream member...
>>>>>
>>>>> There are two variants of that question, I guess.
>>>>>
>>>>> The answer to the first variant is: Yes, this is currently true.
>>>>> lower_packed_varyings will happily pack outputs from different vertex
>>>>> streams into the same vec4. This affects quite a lot of programs, e.g.
>>>>> you see it in piglit arb_gpu_shader5-xfb-streams.
>>>>>
>>>>> The second question is: Do we want it to be true? I agree that it
>>>>> would
>>>>> be convenient to be able to use a single Stream member. Also,
>>>>> isolating
>>>>> the stream0 components from the rest would lead to slightly more
>>>>> efficient shaders for us in some cases.
>>>>>
>>>>> I opted against it so far because I didn't want to think through the
>>>>> implications of changing lower_packed_varyings. The main question I
>>>>> have
>>>>> is: if you account for the size of the GS output in # of components,
>>>>> then it could happen that the number of output vec4s ends up being
>>>>> larger than (max # of output components) / 4. Will that be a problem
>>>>> somewhere?
>>>>
>>>> I don't know if that would be a problem, but if it is I'd assume this
>>>> would be fixable (since the number of actual components ultimately
>>>> doesn't change).
>>>> Having outputs belonging to multiple streams in a single output just
>>>> seems weird...
>>>> That said, I wonder if it actually would be possible to do that with
>>>> d3d11 too.
>>>> With shader model 5 you'd have:
>>>> dcl_stream 0
>>>> dcl_output o0.xy
>>>> dcl_stream 1
>>>> dcl_output o0.zw // legal or not???
>>>>
>>>> Though the shader model 4/5 rules are a bit weird for packing
>>>> inputs/outputs, I'm not even sure two dcl_output are legal for the same
>>>> reg without a dcl_stream in between them (but you can pack system
>>>> values
>>>> together with ordinary inputs/outputs).
>>>>
>>>> So maybe just allowing this is the right solution...
>>>
>>> I played around with the DX shader compiler, and I have some annoying
>>> news. SM5 actually uses not just the same output register but even the
>>> same component for multiple streams -- see the output I've pasted at the
>>> end.
>>>
>>> So how to proceed? To simplify things going forward, I'm mostly
>>> convinced that the GLSL output packing should be changed to pack outputs
>>> by stream. As I mentioned previously, this has other minor advantages
>>> for us anyway.
>>>
>>> Then one possibility to accomodate SM5 would be to have a Stream
>>> bitmask, one bit per stream, as part of the output semantics. The
>>> downside of this is that I wanted to use the WriteMask as an additional
>>> optimization to avoid writing out unused components, and you'd then need
>>> separate WriteMasks for each stream.
>>>
>>> The other possibility, which I prefer, would be to have just a single
>>> Stream field indicating one stream number per output register, and
>>> aliasing is just not allowed despite what SM5 wants.
> 
> I have to go back on that unfortunately: I forgot that it's possible to
> create location aliasing across vertex streams via ARB_enhanced_layouts.
> I looked hard and found nothing in the spec that would forbid it, and
> our closed source driver also allows it.
Oh hmm it looks like you can basically assign individual locations
to anything, as long as components don't overlap the same component in
another declaration?
So yes I guess you're right this is indeed needed.
Just too bad it's complex and still not quite enough to meet d3d11
demands, but looks reasonable then.

Roland


> So my plan now is to leave the StreamXYZW stuff as is. I will send
> around a v2 of this series to account for this use case (because there's
> still a problem in the GLSL-to-TGSI translation), plus some
> radeonsi-specific additions.
> 
> I'm also going to send a piglit test around that exercises this.
> 
> Cheers,
> Nicolai
> 
>>>
>>> TGSI -> SM5 conversion is trivial.
>>>
>>> SM5 -> TGSI conversion is also possible despite the aliasing on the DX
>>> side, because the doc says this about emit_stream: "Af[t]er the emit,
>>> all data in all output registers for all streams become uninitialized,
>>> not just the stream emitted to."
>> Oh that's pretty interesting, since emit didn't have that part about
>> outputs becoming uninitialized. Maybe that's just what was needed to
>> keep implementations sane when allowing the crazy "same output multiple
>> stream" stuff... Or I suppose it's not actually that crazy then...
>>
>>
>>> (https://urldefense.proofpoint.com/v2/url?u=https-3A__msdn.microsoft.com_en-2Dus_library_windows_desktop_hh447051-28v-3Dvs.85-29.aspx&d=DgIDaQ&c=uilaK90D4TOVoH58JNXRgQ&r=_QIjpv-UJ77xEQY8fIYoQtr5qv8wKrPJc7v7_-CYAb0&m=EBMBRMVpTcLbno2cH7eaI5WJW9VY3tec7RBNULl1btw&s=HJ2sRJpROX7JfDvjHycEwHAx6YzJa8RUa1biVttH-zM&e=
>>>
>>> ). So you have to look-ahead to the next emit_stream for disambiguation,
>>> but it's clearly doable.
>>>
>>> Any objections to that approach?
>> Sounds good to me. I agree it would be complicated for tgsi to do what
>> sm5 wants directly - there's other stuff we already have to translate
>> anyway there (the packing of system values and ordinary inputs/outputs).
>> I suppose when we'll need this we could just use multiple outputs
>> instead of one when they share a stream.
>>
>> Roland
>>
>>
>>>
>>> Thanks,
>>> Nicolai
>>> ---
>>> //
>>> // Generated by Microsoft (R) HLSL Shader Compiler 10.0.10011.16384
>>> //
>>> //
>>> //
>>> // Input signature:
>>> //
>>> // Name                 Index   Mask Register SysValue  Format   Used
>>> // -------------------- ----- ------ -------- -------- ------- ------
>>> // SV_POSITION              0   xyzw        0      POS   float   xyzw
>>> // TEXCOORD                 0   xyz         1     NONE   float
>>> // TEXCOORD                 1   xy          2     NONE   float
>>> //
>>> //
>>> // Output signature:
>>> //
>>> // Name                 Index   Mask Register SysValue  Format   Used
>>> // -------------------- ----- ------ -------- -------- ------- ------
>>> // m0:TEXCOORD              0   x           0     NONE   float   x
>>> // m0:TEXCOORD              1    y          0     NONE   float    y
>>> // m1:TEXCOORD              0   x           0     NONE   float   x
>>> // m1:TEXCOORD              1    y          0     NONE   float    y
>>> //
>>> gs_5_0
>>> dcl_globalFlags refactoringAllowed
>>> dcl_input_siv v[3][0].xyzw, position
>>> dcl_input v[3][1].xyz
>>> dcl_input v[3][2].xy
>>> dcl_inputprimitive triangle
>>> dcl_stream m0
>>> dcl_outputtopology pointlist
>>> dcl_output o0.x
>>> dcl_output o0.y
>>> dcl_stream m1
>>> dcl_outputtopology pointlist
>>> dcl_output o0.x
>>> dcl_output o0.y
>>> dcl_maxout 12
>>> mov o0.xy, v[0][0].xzxx
>>> emit_stream m0
>>> mov o0.xy, v[1][0].ywyy
>>> emit_stream m1
>>> ret
>>> // Approximately 5 instruction slots used
>>