[Mesa-dev] [PATCH 02/23] tgsi: add Stream{X, Y, Z, W} fields to tgsi_declaration_semantic

Fri Dec 2 17:23:11 UTC 2016

On 30.11.2016 21:37, Roland Scheidegger wrote:
> Am 30.11.2016 um 20:19 schrieb Nicolai Hähnle:
>> On 30.11.2016 19:06, Roland Scheidegger wrote:
>>> Am 30.11.2016 um 14:35 schrieb Nicolai Hähnle:
>>>> From: Nicolai Hähnle <nicolai.haehnle at amd.com>
>>>>
>>>> This is for geometry shader outputs. Without it, drivers have no way of
>>>> knowing which stream each output is intended for, and have to
>>>> conservatively write all outputs to all streams.
>>>>
>>>> Separate stream numbers for each component are required due to output
>>>> packing.
>>> Are you sure this is true?
>>> This is an area I don't know much about, but
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.opengl.org_wiki_Layout-5FQualifier-5F-28GLSL-29&d=DgIDaQ&c=uilaK90D4TOVoH58JNXRgQ&r=_QIjpv-UJ77xEQY8fIYoQtr5qv8wKrPJc7v7_-CYAb0&m=fVpTGTYN2KTEhU17RpFTxEULrsIfC3bdpEin0k8NIYE&s=uamnHj-9Xr12ctr0gHDfCMIMHq8DyUBtKIwHQQpjDLs&e=
>>> tells me "Stream
>>> assignments for a geometry shader are required to be the same for all
>>> members of a block, but offsets are not."
>>>
>>> Therefore I don't think output packing should ever happen across
>>> multiple streams. I think it would be MUCH nicer if the semantic needed
>>> just one stream member...
>>
>> There are two variants of that question, I guess.
>>
>> The answer to the first variant is: Yes, this is currently true.
>> lower_packed_varyings will happily pack outputs from different vertex
>> streams into the same vec4. This affects quite a lot of programs, e.g.
>> you see it in piglit arb_gpu_shader5-xfb-streams.
>>
>> The second question is: Do we want it to be true? I agree that it would
>> be convenient to be able to use a single Stream member. Also, isolating
>> the stream0 components from the rest would lead to slightly more
>> efficient shaders for us in some cases.
>>
>> I opted against it so far because I didn't want to think through the
>> implications of changing lower_packed_varyings. The main question I have
>> is: if you account for the size of the GS output in # of components,
>> then it could happen that the number of output vec4s ends up being
>> larger than (max # of output components) / 4. Will that be a problem
>> somewhere?
>
> I don't know if that would be a problem, but if it is I'd assume this
> would be fixable (since the number of actual components ultimately
> doesn't change).
> Having outputs belonging to multiple streams in a single output just
> seems weird...
> That said, I wonder if it actually would be possible to do that with
> d3d11 too.
> With shader model 5 you'd have:
> dcl_stream 0
> dcl_output o0.xy
> dcl_stream 1
> dcl_output o0.zw // legal or not???
>
> Though the shader model 4/5 rules are a bit weird for packing
> inputs/outputs, I'm not even sure two dcl_output are legal for the same
> reg without a dcl_stream in between them (but you can pack system values
> together with ordinary inputs/outputs).
>
> So maybe just allowing this is the right solution...

I played around with the DX shader compiler, and I have some annoying 
news. SM5 actually uses not just the same output register but even the 
same component for multiple streams -- see the output I've pasted at the 
end.

So how to proceed? To simplify things going forward, I'm mostly 
convinced that the GLSL output packing should be changed to pack outputs 
by stream. As I mentioned previously, this has other minor advantages 
for us anyway.

Then one possibility to accomodate SM5 would be to have a Stream 
bitmask, one bit per stream, as part of the output semantics. The 
downside of this is that I wanted to use the WriteMask as an additional 
optimization to avoid writing out unused components, and you'd then need 
separate WriteMasks for each stream.

The other possibility, which I prefer, would be to have just a single 
Stream field indicating one stream number per output register, and 
aliasing is just not allowed despite what SM5 wants.

TGSI -> SM5 conversion is trivial.

SM5 -> TGSI conversion is also possible despite the aliasing on the DX 
side, because the doc says this about emit_stream: "Af[t]er the emit, 
all data in all output registers for all streams become uninitialized, 
not just the stream emitted to." 
(https://msdn.microsoft.com/en-us/library/windows/desktop/hh447051(v=vs.85).aspx). 
So you have to look-ahead to the next emit_stream for disambiguation, 
but it's clearly doable.

Any objections to that approach?

Thanks,
Nicolai
---
//
// Generated by Microsoft (R) HLSL Shader Compiler 10.0.10011.16384
//
//
//
// Input signature:
//
// Name                 Index   Mask Register SysValue  Format   Used
// -------------------- ----- ------ -------- -------- ------- ------
// SV_POSITION              0   xyzw        0      POS   float   xyzw
// TEXCOORD                 0   xyz         1     NONE   float
// TEXCOORD                 1   xy          2     NONE   float
//
//
// Output signature:
//
// Name                 Index   Mask Register SysValue  Format   Used
// -------------------- ----- ------ -------- -------- ------- ------
// m0:TEXCOORD              0   x           0     NONE   float   x
// m0:TEXCOORD              1    y          0     NONE   float    y
// m1:TEXCOORD              0   x           0     NONE   float   x
// m1:TEXCOORD              1    y          0     NONE   float    y
//
gs_5_0
dcl_globalFlags refactoringAllowed
dcl_input_siv v[3][0].xyzw, position
dcl_input v[3][1].xyz
dcl_input v[3][2].xy
dcl_inputprimitive triangle
dcl_stream m0
dcl_outputtopology pointlist
dcl_output o0.x
dcl_output o0.y
dcl_stream m1
dcl_outputtopology pointlist
dcl_output o0.x
dcl_output o0.y
dcl_maxout 12
mov o0.xy, v[0][0].xzxx
emit_stream m0
mov o0.xy, v[1][0].ywyy
emit_stream m1
ret
// Approximately 5 instruction slots used