[Mesa-dev] [PATCH] glsl_to_tgsi: indirect array information

Wed Jan 23 04:24:12 PST 2013

On 23.01.2013 11:31, Christian König wrote:
> On 23.01.2013 03:18, Tom Stellard wrote:
>> On Wed, Jan 23, 2013 at 02:20:21AM +0100, Christoph Bumiller wrote:
>>> On 23.01.2013 02:07, Vadim Girlin wrote:
>>>> On 01/23/2013 04:42 AM, Christoph Bumiller wrote:
>>>>> On 23.01.2013 01:21, Vadim Girlin wrote:
>>>>>> On 01/23/2013 03:59 AM, Vincent Lejeune wrote:
>>>>>>>
>>>>>>> ----- Mail original -----
>>>>>>>> De : Vadim Girlin <vadimgirlin at gmail.com>
>>>>>>>> À : Christoph Bumiller <e0425955 at student.tuwien.ac.at>
>>>>>>>> Cc : mesa-dev at lists.freedesktop.org
>>>>>>>> Envoyé le : Mercredi 23 janvier 2013 0h44
>>>>>>>> Objet : Re: [Mesa-dev] [PATCH] glsl_to_tgsi: indirect array
>>>>>>>> information
>>>>>>>>
>>>>>>>> On 01/22/2013 10:59 PM, Christoph Bumiller wrote:
>>>>>>>>>     On 21.01.2013 21:10, Vadim Girlin wrote:
>>>>>>>>>>     Provide the information about indirectly addressable arrays
>>>>>>>>>> (ranges of
>>>>>>>> temps) in
>>>>>>>>>>     the shader to the drivers. TGSI representation itself isn't
>>>>>>>> modified, array
>>>>>>>>>>     information is passed as an additional data in the
>>>>>>>>>> pipe_shader_state,
>>>>>>>> so the
>>>>>>>>>>     drivers can use it as a hint for optimization.
>>>>>>>>>>     ---
>>>>>>>>>>
>>>>>>>>>>     It's far from being an ideal solution, but I saw the
>>>>>>>>>> discussions
>>>>>>>> about that
>>>>>>>>>>     problem starting from 2009 IIRC, and we still have no
>>>>>>>>>> solution
>>>>>>>>>> (neither
>>>>>>>> good
>>>>>>>>>>     nor bad) despite the years passed. I hope we can use this not
>>>>>>>>>> very
>>>>>>>> intrusive
>>>>>>>>>>     approach until we get something better.
>>>>>>>>>>
>>>>>>>>>     I'd rather not have any hacks in the interface, let alone ones
>>>>>>>>> that
>>>>>>>>>     solve the problem only partially (you still won't know which
>>>>>>>>> array is
>>>>>>>>>     accessed by a particular instruction, which is important for
>>>>>>>>>     optimization and essential in some cases for making
>>>>>>>>> INPUT/OUTPUT
>>>>>>>>> arrays
>>>>>>>>>     work), and not just because it reduces the pressure on
>>>>>>>>> people to
>>>>>>>>>     implement a proper solution.
>>>>>>>>>
>>>>>>>>>     With this, you just get to know which range of TEMPs are
>>>>>>>>> indirectly
>>>>>>>>>     addressed and which ones are not, and you can do the same by
>>>>>>>>> simply
>>>>>>>>>     creating multiple declarations of TEMPs, one for each
>>>>>>>>> array, and
>>>>>>>>> adding
>>>>>>>>>     a single bit of info to tgsi_declaration (which has 7 bits of
>>>>>>>>> padding
>>>>>>>>>     anyway, so ample space), which is a lot less ugly, and doesn't
>>>>>>>>> suffer
>>>>>>>>>     from an arbitrary limit, and doesn't require any
>>>>>>>>> modification of
>>>>>>>> drivers
>>>>>>>>>     either.
>>>>>>>>>
>>>>>>>> Array accessed by any indirect operand can be identified by the
>>>>>>>> immediate offset, e.g. TEMP[ADDR[0].x+1] implies the array starting
>>>>>>>> from
>>>>>>>> 1, thus we can find it's entry in the information provided by this
>>>>>>>> patch
>>>>>>>> to get the addressable range for every indirect operand. If I'm not
>>>>>>>> missing something, glsl_to_tgsi accumulates all other parts of the
>>>>>>>> offset in the address register before the indirect access. If I'm
>>>>>>>> wrong,
>>>>>>>> we can fix it to ensure such behavior.
>>>>>>> I'm not sure about that ; when I worked on indirect addressing of
>>>>>>> const memory,
>>>>>>> I discovered when tracking vp/fo regression that the immediate
>>>>>>> offset
>>>>>>> is the result of
>>>>>>>     glsl_to_tgsi constant propagation and not related to the
>>>>>>> underlying
>>>>>>> array.
>>>>>>> This means that the dynamic index can be negative, which is not
>>>>>>> always
>>>>>>> desirable depending on the hw. (In R600 case, const fetch
>>>>>>> instruction
>>>>>>> does not
>>>>>>> support negative index. MOVA inst does).
>>>>>>>
>>>>>>> For instance, the following pseudo code snippet is fine for an index
>>>>>>> value of -4 :
>>>>>>>
>>>>>>> uniform int index;
>>>>>>>
>>>>>>> float array[4];
>>>>>>> float data = array[6 + index];
>>>>>>>
>>>>>>> and is lowered to
>>>>>>> MOV TEMP[0] TEMP[ADDR[0].x + 6];
>>>>>>>
>>>>>> I tried the following shader:
>>>>>>
>>>>>>> uniform int index;
>>>>>>>
>>>>>>> void main()
>>>>>>> {
>>>>>>>       float array[4] = float[4](0.1, 0.2, 0.3, 0.4);
>>>>>>>       float data = array[6 + index];
>>>>>>>       gl_FragColor = vec4(data, 1.0, 0.0, 1.0);
>>>>>>> }
>>>>>> Resulting TGSI:
>>>>>>
>>>>>>> --------------------------------------------------------------
>>>>>>> FRAG
>>>>>>> PROPERTY FS_COLOR0_WRITES_ALL_CBUFS 1
>>>>>>> DCL OUT[0], COLOR
>>>>>>> DCL CONST[0]
>>>>>>> DCL TEMP[0], LOCAL
>>>>>>> DCL TEMP[1], LOCAL
>>>>>>> DCL TEMP[2], LOCAL
>>>>>>> DCL TEMP[3], LOCAL
>>>>>>> DCL TEMP[4], LOCAL
>>>>>>> DCL TEMP[5], LOCAL
>>>>>>> DCL TEMP[6], LOCAL
>>>>>>> DCL TEMP[7], LOCAL
>>>>>>> DCL ADDR[0]
>>>>>>> IMM[0] FLT32 {    0.1000,     0.2000,     0.3000,     0.4000}
>>>>>>> IMM[1] FLT32 {    1.0000,     0.0000,     0.0000,     0.0000}
>>>>>>> IMM[2] INT32 {6, 0, 0, 0}
>>>>>>>     0: MOV TEMP[1].yzw, IMM[1].yxyx
>>>>>>>     1: MOV TEMP[2], IMM[0].xxxx
>>>>>>>     2: MOV TEMP[3], IMM[0].yyyy
>>>>>>>     3: MOV TEMP[4], IMM[0].zzzz
>>>>>>>     4: MOV TEMP[5], IMM[0].wwww
>>>>>>>     5: UADD TEMP[6].x, IMM[2].xxxx, CONST[0].xxxx
>>>>>>>     6: UARL ADDR[0].x, TEMP[6].xxxx
>>>>>>>     7: MOV TEMP[1].x, TEMP[ADDR[0].x+2].xxxx
>>>>>>>     8: MOV_SAT OUT[0], TEMP[1]
>>>>>>>     9: END
>>>>>>> --------------------------------------------------------------
>>>>>> Also I tried the following:
>>>>>>
>>>>>>> uniform float array[4];
>>>>>>> uniform int index;
>>>>>>>
>>>>>>> void main()
>>>>>>> {
>>>>>>>       float data = array[6 + index];
>>>>>>>       gl_FragColor = vec4(data, 1.0, 0.0, 1.0);
>>>>>>> }
>>>>>> Resulting TGSI:
>>>>>>
>>>>>>> --------------------------------------------------------------
>>>>>>> FRAG
>>>>>>> PROPERTY FS_COLOR0_WRITES_ALL_CBUFS 1
>>>>>>> DCL OUT[0], COLOR
>>>>>>> DCL CONST[0..4]
>>>>>>> DCL TEMP[0], LOCAL
>>>>>>> DCL TEMP[1], LOCAL
>>>>>>> DCL ADDR[0]
>>>>>>> IMM[0] FLT32 {    1.0000,     0.0000,     0.0000,     0.0000}
>>>>>>> IMM[1] INT32 {6, 0, 0, 0}
>>>>>>>     0: MOV TEMP[0].yzw, IMM[0].yxyx
>>>>>>>     1: UADD TEMP[1].x, IMM[1].xxxx, CONST[0].xxxx
>>>>>>>     2: UARL ADDR[0].x, TEMP[1].xxxx
>>>>>>>     3: MOV TEMP[0].x, CONST[ADDR[0].x+1].xxxx
>>>>>>>     4: MOV_SAT OUT[0], TEMP[0]
>>>>>>>     5: END
>>>>>>> --------------------------------------------------------------
>>>>>> So far immediate offset in the indirect operand is always equal to
>>>>>> the
>>>>>> start offset of the array. Could you provide some more complete
>>>>>> example
>>>>>> that demonstrates the problem, please.
>>>>>>
>>>>>> Vadim
>>>>>>
>>>>> Not really, because shaders like
>>>>>
>>>>> float array[8];
>>>>>
>>>>> uniform int pos;
>>>>>
>>>>> void main()
>>>>> {
>>>>>      array[0] = 1.0;
>>>>>      array[1] = 2.0;
>>>>>      array[2] = 3.0;
>>>>>      array[3] = 4.0;
>>>>>      gl_FragColor = vec4(array[pos - 16],
>>>>>                          array[pos - 17],
>>>>>                  array[pos - 18],
>>>>>                  array[pos - 19]);
>>>>> }
>>>>>
>>>>> yield the terribly unoptimized
>>>>>
>>>>>     0: MOV TEMP[1].x, IMM[0].xxxx
>>>>>     1: MOV TEMP[2].x, IMM[0].yyyy
>>>>>     2: MOV TEMP[3].x, IMM[0].zzzz
>>>>>     3: MOV TEMP[4].x, IMM[0].wwww
>>>>>     4: UADD TEMP[9].x, CONST[0].xxxx, IMM[1].xxxx
>>>>>     5: UARL ADDR[0].x, TEMP[9].xxxx
>>>>>     6: MOV TEMP[10].x, TEMP[ADDR[0].x+1].xxxx
>>>>>     7: UADD TEMP[11].x, CONST[0].xxxx, IMM[1].yyyy
>>>>>     8: UARL ADDR[0].x, TEMP[11].xxxx
>>>>>     9: MOV TEMP[10].y, TEMP[ADDR[0].x+1].xxxx
>>>>>    10: UADD TEMP[12].x, CONST[0].xxxx, IMM[1].zzzz
>>>>>    11: UARL ADDR[0].x, TEMP[12].xxxx
>>>>>    12: MOV TEMP[10].z, TEMP[ADDR[0].x+1].xxxx
>>>>>    13: UADD TEMP[13].x, CONST[0].xxxx, IMM[1].wwww
>>>>>    14: UARL ADDR[0].x, TEMP[13].xxxx
>>>>>    15: MOV TEMP[10].w, TEMP[ADDR[0].x+1].xxxx
>>>>>    16: MOV OUT[0], TEMP[10]
>>>>>    17: END
>>>>>
>>>>> instead of simply adjusting the offset and NOT emitting tons of ARLs.
>>>>> But this is NOT guaranteed behaviour and neither should it be (I did
>>>>> suggest that in the past, but some people disagreed and they convinced
>>>>> me).
>>>>>
>>>> I agree that it's terribly unoptimized, but it doesn't change the fact
>>>> that currently we can use immediate offset to match it to the array
>>>> info. Is anybody going to optimize this tomorrow to break this patch?
>>>>
>>> If you intend to mandate this behaviour so that drivers can rely on it,
>>> you forgot to place a big fat warning somewhere shader authors can see
>>> it. Having this kind of behaviour and in addition leaving it
>>> undocumented is unacceptable.
>>>
>>>> We can discuss it forever though and probably it doesn't makes sense.
>>>> This discussion won't be any different from the previous discussions of
>>>> that problem - the result will be same - we'll have no working
>>>> solution.
>>>>
>>> I prefer no solution at all over a bad one that makes it easier for
>>> people to ignore the issue and thus increase the probability of this
>>> never being fixed.
>>>
>>> Indirect TEMPs aren't that common, you'll survive until there is a poper
>>> solution.
>>> (And even when they're present, a lot of the direct TEMP accesses will
>>> likely be saved by a pass that promotes memory access to registers.)
>>>
>>> Doesn't anyone work on gallium+compiler full time anymore who can do
>>> this properly ?
>>>
>> I think it would be helpful if we could agree on a solution and document
>> that decision so that someone who was interested in implementing this
>> correctly would know what to do and not have to worry about wasting time
>> on something that wouldn't be accepted.
>>
>> So far we have 4 proposed solutions:
>>
>> 1. Pass additional array information in pipe_shader_state
>>
>>    
>> http://lists.freedesktop.org/archives/mesa-dev/2013-January/033111.html
>>
>> 2. Store all values that may be accessed by indirect addressing in
>>     the TGSI_FILE_TEMPORARY_ARRAY file.
>>
>>    
>> http://lists.freedesktop.org/archives/mesa-dev/2012-November/029575.html
>>
>> 3. Use temporary register range declaration to distinguish between array
>>     objects and then identify the objects using a constant indirect
>>     offset.
>>
>>    
>> http://lists.freedesktop.org/archives/mesa-dev/2012-November/029764.html
>>
>> 4. Clearly define arrays in the TGSI declarations with a numeric
>>     identifier.
>>
>>    
>> http://lists.freedesktop.org/archives/mesa-dev/2012-November/030476.html
>>
>> It doesn't seem like we can come to a consensus on any one
>> implementation, so let's indicate our preferences like this:
>>
>> + Identify your preferred solution
>> + Identify all solutions that you would be acceptable to you, even if you
>>    think they have some flaws.
>>
>> Maybe this will help us choose a solution (If it doesn't then we
>> can just ask Brian to decide).
>>
>> For me, I prefer solution #4 even though it is the most work, but
>> solution #3 would be acceptable to me.
>>
>> What does everyone else think?
> 
> Definitely solution #4. Solution #3 would only be acceptable if we had
> some time pressure, but I don't see that right now.

I concur. #4 looks like the solution we'd be using if were just in the
design phase of TGSI.

> 
> Also I don't think solution #4 would be so much work to do, we only need
> to emit each array separately (something which the available patches
> should already do), and then store the resulting index into
> "DimensionIndex".
> 

Actually, DimensionIndex is used for 2D inputs and outputs in the
HS,DS,GS stages.
Of course, that doesn't affect TEMPs, but GL also allows input and
output arrays, and for hardware that encourages (or requires) you to lay
out inputs/outputs in a different order than the one declared in TGSI,
or special treatment based on semantic (clamp before writing color,
special locations for patch constants (tessellation stuff), etc.), it
would be essential to know which of the input/output ranges will be
accessed.

BUT, it looks like we could simply reuse the padding (14 bits) in
tgsi_dimension for a second index, so it's not actually much harder to
handle all the other files, too.

> That behavior should of course be driver specific until all drivers
> understand that new form of addressing.
> 

Also, I think if we kept indices running continuously like in solution
#3 (at least initially), existing driver code would not need any adjustment.

> Christian.
> 
>>
>> -Tom
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
>