[Mesa-dev] Allowing the reading of outputs for some drivers

Tue Nov 15 05:52:20 PST 2011

----- Original Message -----
> On 11/14/2011 07:16 AM, Marek Olšák wrote:
> > On Mon, Nov 14, 2011 at 2:49 PM, Vadim
> > Girlin<vadimgirlin at gmail.com>  wrote:
> >> On Mon, 2011-11-14 at 05:22 -0800, Jose Fonseca wrote:
> >>>
> >>> ----- Original Message -----
> >>>> Hi,
> >>>>
> >>>> I found some problem with glsl_to_tgsi: remove_output_reads
> >>>> function.
> >>>> It's replacing outputs with temps, producing incorrect results
> >>>> with
> >>>> relative addressing. You can see it e.g. with
> >>>> "vs-varying-array-mat2-col-rd.shader_test". Here is a dump:
> >>>>
> >>>>> VERT
> >>>>> DCL IN[0]
> >>>>> DCL OUT[0], POSITION
> >>>>> DCL OUT[1], GENERIC[12]
> >>>>> DCL OUT[2], GENERIC[13]
> >>>>> DCL OUT[3], GENERIC[14]
> >>>>> DCL OUT[4], GENERIC[15]
> >>>>> DCL OUT[5], GENERIC[16]
> >>>>> DCL OUT[6], GENERIC[17]
> >>>>> DCL OUT[7], GENERIC[18]
> >>>>> DCL CONST[0..5]
> >>>>> DCL TEMP[0..1]
> >>>>> DCL ADDR[0]
> >>>>> IMM FLT32 {    1.0000,     2.0000,     3.0000,     4.0000}
> >>>>> IMM FLT32 {    5.0000,     6.0000,     7.0000,     8.0000}
> >>>>> IMM FLT32 {    9.0000,    10.0000,    11.0000,    12.0000}
> >>>>> IMM FLT32 {    0.0000,     1.0000,     0.0000,     0.0000}
> >>>>>    0: MUL TEMP[0], CONST[2], IN[0].xxxx
> >>>>>    1: MAD TEMP[1], CONST[3], IN[0].yyyy, TEMP[0]
> >>>>>    2: MAD TEMP[1], CONST[4], IN[0].zzzz, TEMP[1]
> >>>>>    3: MAD OUT[0], CONST[5], IN[0].wwww, TEMP[1]
> >>>>>    4: MOV OUT[2], IMM[0].xyyy
> >>>>>    5: MOV OUT[3], IMM[0].zwww
> >>>>>    6: MOV TEMP[0], IMM[1].xyyy
> >>>>
> >>>> OUT[2-7] is a "varying mat2x2[3] m;", OUT[4] is replaced with
> >>>> the
> >>>> temp
> >>>> in the instruction 6.
> >>>>
> >>>>>    7: MOV OUT[5], IMM[1].zwww
> >>>>>    8: MOV OUT[6], IMM[2].xyyy
> >>>>>    9: MOV OUT[7], IMM[2].zwww
> >>>>>   10: ARL ADDR[0].x, CONST[1].xxxx
> >>>>>   11: SNE TEMP[1], TEMP[ADDR[0].x].xyyy, CONST[0].xyyy
> >>>>
> >>>> Instruction 11 contains the read with the relative addressing
> >>>> using
> >>>> this temp, which is incorrect.
> >>>>
> >>>> I'm not sure how to fix it,
> >>>
> >>> The way to fix this is to allocate a consecutive range of temps
> >>> at start, when there are indirect writes to output registers and
> >>> at least one read.
> >>>
> >>>> but AFAICS at least for r600g this step
> >>>> could be skipped completely - r600 can read outputs without any
> >>>> problem, they are located in the general-purpose registers.
> >>>> Removing
> >>>> calls to remove_output_reads and assert(src.File !=
> >>>> TGSI_FILE_OUTPUT)
> >>>> in the ureg_emit_src produces correct result and test passes on
> >>>> evergreen (total number of fixed tests is about 60).
> >>>>
> >>>> Probably it makes sense to make this step optional and ask the
> >>>> driver
> >>>> whether to use it, if I'm not missing something?
> >>>
> >>> The drawback of doing this is that TGSI will look even more
> >>> different between drivers. This means that when somebody makes a
> >>> change to the state tracker, and tested with one driver, it may
> >>> break other drivers. This also means that comparing a driver to
> >>> llvmpipe/softpipe will be less meaningfull, as different paths
> >>> will be taken in the state tracker.
> >>>
> >>
> >> I think if it's needed to compare the drivers, than it's possible
> >> to
> >> switch of the the cap for debugging. I see the drawbacks, but this
> >> is
> >> also about performance. Currently with r600g (and possibly with
> >> other
> >> drivers) we have to spend some time for unneccessary shader
> >> modification, to get less efficient shader code as a result.
> >
> > I am 100% sure nobody will turn on/off CAPs just to test something.
> >
> > I gotta agree with José on this one.
> 
> I guess I don't follow.  Different hardware can do different things,
> and
> the code for that hardware will look different.  What's the problem?
>  It
> seems silly to spend CPU time rearranging the code and then hoping
> the
> driver will spend more CPU time to put it back the way it was.
>  People
> using these drivers *do* care about CPU performance, after all. :)

Developer time is important too. And having more code paths shared with other drivers (even at the expense of a few extra CPU cycles every time a shader is created) means that developers has more time to focus on features that can yield substantial improvements on true hotspots (e.g., every time a pixel is rendered).

This particular case may not be the best example. But there is a trade off: more specialization means more maintenance burden.

And merely sharing the C code is not enough: if one is not sharing the same code paths, then's not really sharing the code. It's looks like but it's not: it may be using an unique path to its driver, which means that other developers may never test it so it can get easily broken, and vice versa.

The gallium interface often makes such compromises.  And every now and then I see commits from Intel yanking some obscure optimization code path on i965 that's not worth trouble of maintaining, so I suppose you do to.

Jose