[Mesa-dev] [PATCH] glsl_to_tgsi: indirect array information

Tue Jan 22 18:18:08 PST 2013

On Wed, Jan 23, 2013 at 02:20:21AM +0100, Christoph Bumiller wrote:
> On 23.01.2013 02:07, Vadim Girlin wrote:
> > On 01/23/2013 04:42 AM, Christoph Bumiller wrote:
> >> On 23.01.2013 01:21, Vadim Girlin wrote:
> >>> On 01/23/2013 03:59 AM, Vincent Lejeune wrote:
> >>>>
> >>>>
> >>>> ----- Mail original -----
> >>>>> De : Vadim Girlin <vadimgirlin at gmail.com>
> >>>>> À : Christoph Bumiller <e0425955 at student.tuwien.ac.at>
> >>>>> Cc : mesa-dev at lists.freedesktop.org
> >>>>> Envoyé le : Mercredi 23 janvier 2013 0h44
> >>>>> Objet : Re: [Mesa-dev] [PATCH] glsl_to_tgsi: indirect array
> >>>>> information
> >>>>>
> >>>>> On 01/22/2013 10:59 PM, Christoph Bumiller wrote:
> >>>>>>    On 21.01.2013 21:10, Vadim Girlin wrote:
> >>>>>>>    Provide the information about indirectly addressable arrays
> >>>>>>> (ranges of
> >>>>> temps) in
> >>>>>>>    the shader to the drivers. TGSI representation itself isn't
> >>>>> modified, array
> >>>>>>>    information is passed as an additional data in the
> >>>>>>> pipe_shader_state,
> >>>>> so the
> >>>>>>>    drivers can use it as a hint for optimization.
> >>>>>>>    ---
> >>>>>>>
> >>>>>>>    It's far from being an ideal solution, but I saw the discussions
> >>>>> about that
> >>>>>>>    problem starting from 2009 IIRC, and we still have no solution
> >>>>>>> (neither
> >>>>> good
> >>>>>>>    nor bad) despite the years passed. I hope we can use this not
> >>>>>>> very
> >>>>> intrusive
> >>>>>>>    approach until we get something better.
> >>>>>>>
> >>>>>>
> >>>>>>    I'd rather not have any hacks in the interface, let alone ones
> >>>>>> that
> >>>>>>    solve the problem only partially (you still won't know which
> >>>>>> array is
> >>>>>>    accessed by a particular instruction, which is important for
> >>>>>>    optimization and essential in some cases for making INPUT/OUTPUT
> >>>>>> arrays
> >>>>>>    work), and not just because it reduces the pressure on people to
> >>>>>>    implement a proper solution.
> >>>>>>
> >>>>>>    With this, you just get to know which range of TEMPs are
> >>>>>> indirectly
> >>>>>>    addressed and which ones are not, and you can do the same by
> >>>>>> simply
> >>>>>>    creating multiple declarations of TEMPs, one for each array, and
> >>>>>> adding
> >>>>>>    a single bit of info to tgsi_declaration (which has 7 bits of
> >>>>>> padding
> >>>>>>    anyway, so ample space), which is a lot less ugly, and doesn't
> >>>>>> suffer
> >>>>>>    from an arbitrary limit, and doesn't require any modification of
> >>>>> drivers
> >>>>>>    either.
> >>>>>>
> >>>>>
> >>>>> Array accessed by any indirect operand can be identified by the
> >>>>> immediate offset, e.g. TEMP[ADDR[0].x+1] implies the array starting
> >>>>> from
> >>>>> 1, thus we can find it's entry in the information provided by this
> >>>>> patch
> >>>>> to get the addressable range for every indirect operand. If I'm not
> >>>>> missing something, glsl_to_tgsi accumulates all other parts of the
> >>>>> offset in the address register before the indirect access. If I'm
> >>>>> wrong,
> >>>>> we can fix it to ensure such behavior.
> >>>>
> >>>> I'm not sure about that ; when I worked on indirect addressing of
> >>>> const memory,
> >>>> I discovered when tracking vp/fo regression that the immediate offset
> >>>> is the result of
> >>>>    glsl_to_tgsi constant propagation and not related to the underlying
> >>>> array.
> >>>> This means that the dynamic index can be negative, which is not always
> >>>> desirable depending on the hw. (In R600 case, const fetch instruction
> >>>> does not
> >>>> support negative index. MOVA inst does).
> >>>>
> >>>> For instance, the following pseudo code snippet is fine for an index
> >>>> value of -4 :
> >>>>
> >>>> uniform int index;
> >>>>
> >>>> float array[4];
> >>>> float data = array[6 + index];
> >>>>
> >>>> and is lowered to
> >>>> MOV TEMP[0] TEMP[ADDR[0].x + 6];
> >>>>
> >>>
> >>> I tried the following shader:
> >>>
> >>>> uniform int index;
> >>>>
> >>>> void main()
> >>>> {
> >>>>      float array[4] = float[4](0.1, 0.2, 0.3, 0.4);
> >>>>      float data = array[6 + index];
> >>>>      gl_FragColor = vec4(data, 1.0, 0.0, 1.0);
> >>>> }
> >>>
> >>> Resulting TGSI:
> >>>
> >>>> --------------------------------------------------------------
> >>>> FRAG
> >>>> PROPERTY FS_COLOR0_WRITES_ALL_CBUFS 1
> >>>> DCL OUT[0], COLOR
> >>>> DCL CONST[0]
> >>>> DCL TEMP[0], LOCAL
> >>>> DCL TEMP[1], LOCAL
> >>>> DCL TEMP[2], LOCAL
> >>>> DCL TEMP[3], LOCAL
> >>>> DCL TEMP[4], LOCAL
> >>>> DCL TEMP[5], LOCAL
> >>>> DCL TEMP[6], LOCAL
> >>>> DCL TEMP[7], LOCAL
> >>>> DCL ADDR[0]
> >>>> IMM[0] FLT32 {    0.1000,     0.2000,     0.3000,     0.4000}
> >>>> IMM[1] FLT32 {    1.0000,     0.0000,     0.0000,     0.0000}
> >>>> IMM[2] INT32 {6, 0, 0, 0}
> >>>>    0: MOV TEMP[1].yzw, IMM[1].yxyx
> >>>>    1: MOV TEMP[2], IMM[0].xxxx
> >>>>    2: MOV TEMP[3], IMM[0].yyyy
> >>>>    3: MOV TEMP[4], IMM[0].zzzz
> >>>>    4: MOV TEMP[5], IMM[0].wwww
> >>>>    5: UADD TEMP[6].x, IMM[2].xxxx, CONST[0].xxxx
> >>>>    6: UARL ADDR[0].x, TEMP[6].xxxx
> >>>>    7: MOV TEMP[1].x, TEMP[ADDR[0].x+2].xxxx
> >>>>    8: MOV_SAT OUT[0], TEMP[1]
> >>>>    9: END
> >>>> --------------------------------------------------------------
> >>>
> >>> Also I tried the following:
> >>>
> >>>> uniform float array[4];
> >>>> uniform int index;
> >>>>
> >>>> void main()
> >>>> {
> >>>>      float data = array[6 + index];
> >>>>      gl_FragColor = vec4(data, 1.0, 0.0, 1.0);
> >>>> }
> >>>
> >>> Resulting TGSI:
> >>>
> >>>> --------------------------------------------------------------
> >>>> FRAG
> >>>> PROPERTY FS_COLOR0_WRITES_ALL_CBUFS 1
> >>>> DCL OUT[0], COLOR
> >>>> DCL CONST[0..4]
> >>>> DCL TEMP[0], LOCAL
> >>>> DCL TEMP[1], LOCAL
> >>>> DCL ADDR[0]
> >>>> IMM[0] FLT32 {    1.0000,     0.0000,     0.0000,     0.0000}
> >>>> IMM[1] INT32 {6, 0, 0, 0}
> >>>>    0: MOV TEMP[0].yzw, IMM[0].yxyx
> >>>>    1: UADD TEMP[1].x, IMM[1].xxxx, CONST[0].xxxx
> >>>>    2: UARL ADDR[0].x, TEMP[1].xxxx
> >>>>    3: MOV TEMP[0].x, CONST[ADDR[0].x+1].xxxx
> >>>>    4: MOV_SAT OUT[0], TEMP[0]
> >>>>    5: END
> >>>> --------------------------------------------------------------
> >>>
> >>> So far immediate offset in the indirect operand is always equal to the
> >>> start offset of the array. Could you provide some more complete example
> >>> that demonstrates the problem, please.
> >>>
> >>> Vadim
> >>>
> >>
> >> Not really, because shaders like
> >>
> >> float array[8];
> >>
> >> uniform int pos;
> >>
> >> void main()
> >> {
> >>     array[0] = 1.0;
> >>     array[1] = 2.0;
> >>     array[2] = 3.0;
> >>     array[3] = 4.0;
> >>     gl_FragColor = vec4(array[pos - 16],
> >>                         array[pos - 17],
> >>                 array[pos - 18],
> >>                 array[pos - 19]);
> >> }
> >>
> >> yield the terribly unoptimized
> >>
> >>    0: MOV TEMP[1].x, IMM[0].xxxx
> >>    1: MOV TEMP[2].x, IMM[0].yyyy
> >>    2: MOV TEMP[3].x, IMM[0].zzzz
> >>    3: MOV TEMP[4].x, IMM[0].wwww
> >>    4: UADD TEMP[9].x, CONST[0].xxxx, IMM[1].xxxx
> >>    5: UARL ADDR[0].x, TEMP[9].xxxx
> >>    6: MOV TEMP[10].x, TEMP[ADDR[0].x+1].xxxx
> >>    7: UADD TEMP[11].x, CONST[0].xxxx, IMM[1].yyyy
> >>    8: UARL ADDR[0].x, TEMP[11].xxxx
> >>    9: MOV TEMP[10].y, TEMP[ADDR[0].x+1].xxxx
> >>   10: UADD TEMP[12].x, CONST[0].xxxx, IMM[1].zzzz
> >>   11: UARL ADDR[0].x, TEMP[12].xxxx
> >>   12: MOV TEMP[10].z, TEMP[ADDR[0].x+1].xxxx
> >>   13: UADD TEMP[13].x, CONST[0].xxxx, IMM[1].wwww
> >>   14: UARL ADDR[0].x, TEMP[13].xxxx
> >>   15: MOV TEMP[10].w, TEMP[ADDR[0].x+1].xxxx
> >>   16: MOV OUT[0], TEMP[10]
> >>   17: END
> >>
> >> instead of simply adjusting the offset and NOT emitting tons of ARLs.
> >> But this is NOT guaranteed behaviour and neither should it be (I did
> >> suggest that in the past, but some people disagreed and they convinced
> >> me).
> >>
> > 
> > I agree that it's terribly unoptimized, but it doesn't change the fact
> > that currently we can use immediate offset to match it to the array
> > info. Is anybody going to optimize this tomorrow to break this patch?
> > 
> 
> If you intend to mandate this behaviour so that drivers can rely on it,
> you forgot to place a big fat warning somewhere shader authors can see
> it. Having this kind of behaviour and in addition leaving it
> undocumented is unacceptable.
> 
> > We can discuss it forever though and probably it doesn't makes sense.
> > This discussion won't be any different from the previous discussions of
> > that problem - the result will be same - we'll have no working solution.
> > 
> 
> I prefer no solution at all over a bad one that makes it easier for
> people to ignore the issue and thus increase the probability of this
> never being fixed.
> 
> Indirect TEMPs aren't that common, you'll survive until there is a poper
> solution.
> (And even when they're present, a lot of the direct TEMP accesses will
> likely be saved by a pass that promotes memory access to registers.)
> 
> Doesn't anyone work on gallium+compiler full time anymore who can do
> this properly ?
> 

I think it would be helpful if we could agree on a solution and document
that decision so that someone who was interested in implementing this
correctly would know what to do and not have to worry about wasting time
on something that wouldn't be accepted.

So far we have 4 proposed solutions:

1. Pass additional array information in pipe_shader_state

   http://lists.freedesktop.org/archives/mesa-dev/2013-January/033111.html

2. Store all values that may be accessed by indirect addressing in
   the TGSI_FILE_TEMPORARY_ARRAY file.

   http://lists.freedesktop.org/archives/mesa-dev/2012-November/029575.html

3. Use temporary register range declaration to distinguish between array
   objects and then identify the objects using a constant indirect
   offset.

   http://lists.freedesktop.org/archives/mesa-dev/2012-November/029764.html

4. Clearly define arrays in the TGSI declarations with a numeric
   identifier.

   http://lists.freedesktop.org/archives/mesa-dev/2012-November/030476.html

It doesn't seem like we can come to a consensus on any one
implementation, so let's indicate our preferences like this:

+ Identify your preferred solution
+ Identify all solutions that you would be acceptable to you, even if you
  think they have some flaws.

Maybe this will help us choose a solution (If it doesn't then we
can just ask Brian to decide).

For me, I prefer solution #4 even though it is the most work, but
solution #3 would be acceptable to me.

What does everyone else think?

-Tom