[Mesa-dev] [PATCH 0/2 v2] Add support for clip distances in Gallium

Thu Dec 15 11:55:27 PST 2011

----- Original Message -----
> On 15.12.2011 20:09, Jose Fonseca wrote:
> > ----- Original Message -----
> >> On 12/14/2011 12:58 AM, Ian Romanick wrote:
> >>> On 12/13/2011 01:25 PM, Jose Fonseca wrote:
> >>>>
> >>>> ----- Original Message -----
> >>>>> On 12/13/2011 03:09 PM, Jose Fonseca wrote:
> >>>>>> ----- Original Message -----
> >>>>>>> On 12/13/2011 12:26 PM, Bryan Cain wrote:
> >>>>>>>> On 12/13/2011 02:11 PM, Jose Fonseca wrote:
> >>>>>>>>> ----- Original Message -----
> >>>>>>>>>> This is an updated version of the patch set I sent to the
> >>>>>>>>>> list
> >>>>>>>>>> a
> >>>>>>>>>> few
> >>>>>>>>>> hours
> >>>>>>>>>> ago.
> >>>>>>>>>> There is now a TGSI property called
> >>>>>>>>>> TGSI_PROPERTY_NUM_CLIP_DISTANCES
> >>>>>>>>>> that drivers can use to determine how many of the 8
> >>>>>>>>>> available
> >>>>>>>>>> clip
> >>>>>>>>>> distances
> >>>>>>>>>> are actually used by a shader.
> >>>>>>>>> Can't the info in TGSI_PROPERTY_NUM_CLIP_DISTANCES be
> >>>>>>>>> easily
> >>>>>>>>> derived from the shader, and queried through
> >>>>>>>>> src/gallium/auxiliary/tgsi/tgsi_scan.h ?
> >>>>>>>> No.  The clip distances can be indirectly addressed (there
> >>>>>>>> are
> >>>>>>>> up
> >>>>>>>> to 2
> >>>>>>>> of them in vec4 form for a total of 8 floats), which makes
> >>>>>>>> it
> >>>>>>>> impossible
> >>>>>>>> to determine which ones are used by analyzing the shader.
> >>>>>>> The description is almost complete. :)  The issue is that the
> >>>>>>> shader
> >>>>>>> may
> >>>>>>> declare
> >>>>>>>
> >>>>>>> out float gl_ClipDistance[4];
> >>>>>>>
> >>>>>>> the use non-constant addressing of the array.  The compiler
> >>>>>>> knows
> >>>>>>> that
> >>>>>>> gl_ClipDistance has at most 4 elements, but post-hoc analysis
> >>>>>>> would
> >>>>>>> not
> >>>>>>> be able to determine that.  Often the fixed-function hardware
> >>>>>>> (see
> >>>>>>> below) needs to know which clip distance values are actually
> >>>>>>> written.
> >>>>>> But don't all the clip distances written by the shader need to
> >>>>>> be
> >>>>>> declared?
> >>>>>>
> >>>>>> E.g.:
> >>>>>>
> >>>>>> DCL OUT[0], CLIPDIST[0]
> >>>>>> DCL OUT[1], CLIPDIST[1]
> >>>>>> DCL OUT[2], CLIPDIST[2]
> >>>>>> DCL OUT[3], CLIPDIST[3]
> >>>>>>
> >>>>>> therefore a trivial analysis of the declarations convey that?
> >>>>> No.  Clip distance is an array of up to 8 floats in GLSL, but
> >>>>> it's
> >>>>> represented in the hardware as 2 vec4s.  You can tell by
> >>>>> analyzing
> >>>>> the
> >>>>> declarations whether there are more than 4 clip distances in
> >>>>> use,
> >>>>> but
> >>>>> not which components the shader writes to.
> >>>>> TGSI_PROPERTY_NUM_CLIP_DISTANCES is the number of components in
> >>>>> use,
> >>>>> not
> >>>>> the number of full vectors.
> >>>> Lets imagine
> >>>>
> >>>>    out float gl_ClipDistance[6];
> >>>>
> >>>> Each a clip distance is a scalar float.
> >>>>
> >>>> Either all hardware represents the 8 clip distances as two 4
> >>>> vectors,
> >>>> and we do:
> >>>>
> >>>>    DCL OUT[0].xywz, CLIPDIST[0]
> >>>>    DCL OUT[1].xy, CLIPDIST[1]
> >>>>
> >>>> using the full range of struct tgsi_declaration::UsageMask [1]
> >>>> or
> >>>> we
> >>>> represent them as as scalars:
> >>>>
> >>>>    DCL OUT[0].x, CLIPDIST[0]
> >>>>    DCL OUT[1].x, CLIPDIST[1]
> >>>>    DCL OUT[2].x, CLIPDIST[2]
> >>>>    DCL OUT[3].x, CLIPDIST[3]
> >>>>    DCL OUT[4].x, CLIPDIST[4]
> >>>>    DCL OUT[5].x, CLIPDIST[5]
> >>>>
> >>>> If indirect addressing is allowed as I read bore, then maybe the
> >>>> later
> >>>> is better.
> >>> As far as I'm aware, all hardware represents it as the former,
> >>> and
> >>> we
> >>> have a lowering pass to fix-up the float[] accesses to be vec4[]
> >>> accesses.
> >> GeForce8+ = scalar architecture, no vectors, addresses are byte
> >> based,
> >> can access individual components just fine.
> > Ok. So we should avoid baking this vec4 assumption in TGSI
> > semantics.
> >
> >> Something like:
> >>
> >> gl_ClipDistance[i - 12] = some_value;
> >>
> >> DCL OUT[0].xyzw, POSITION
> >> DCL OUT[1-8].x, CLIPDIST[0-7]
> >>
> >> MOV OUT<1>[ADDR[0].x - 12].x, TEMP[0].xxxx
> >>         *              **
> >>
> >> *   - tgsi_dimension.Index specifying the base address by
> >> referencing
> >> a
> >> declaration
> >> **  - tgsi_src_register.Index
> >>
> >> is the only way I see to make this work nicely on all hardware.
> >> (This is also needed if OUT[i] and OUT[i + 1] cannot be assigned
> >> to
> >> contiguous hardware resources because of semantic.)
> > I think that having indexable temps, like D3D10, would be a
> > cleaner:
> 
> The problem is that we need an indexable version of every file then
> (at
> least INPUT, OUTPUT), and then all the nice 32 bit structs break down
> when we get more than 16 files.
> 
> D3D doesn't have these because indirect IN/OUT isn't allowed there,
> but
> it is in GL and the hardware can do it.

Indirect IN/OUT is allowed on D3D9 , http://msdn.microsoft.com/en-us/library/windows/desktop/bb172963%28v=vs.85%29.aspx , but it looks like SM4 indeed doens't allow,  http://msdn.microsoft.com/en-us/library/windows/desktop/ff471378%28v=VS.85%29.aspx , which means that indirect input needs spilling the inputs into a indexable temporary.

> Also, having an indexable version of every file seems odd, especially
> since we need a way to distinguish individual arrays inside that file
> anyway (just SM4 uses 2 indices to access INDEXABLE_TEMP; for INPUT
> we'll need 3 indices).

Fair enough.

> >   DCL OUT[0].xyzw, POSITION
> >   DCL OUT[1][0-7].x, CLIPDIST[0-7]
> >
> >   MOV OUT[1][ADDR[0].x - 12].x, TEMP[0].xxxx
> >
> > I propose we first add this new kind of temp at a first stage, then
> > prohibit indirect addressing of all but this kind of temps.
> 
> There's already TEMPORARY_ARRAY, but no one wants to use it because
> it's
> not clear how to distinguish individual arrays ...

There's a reference implementation in src/gallium/auxiliary/tgsi/tgsi_exec.c but I can't find any use of it outside.

I'm growing fonder to your proposal. Besides avoiding a new set of files, botching this into register ranges also allows a smoother transition.

Instead of 

  MOV OUT<1>[ADDR[0].x - 12].x, TEMP[0].xxxx

I think this syntax would be cleare:

  MOV OUT[1 + (ADDR[0].x - 12)].x, TEMP[0].xxxx

(no semantic change what so ever.

Another approach (instead of adding a new indexing term to the indirect addressing token) would be to force the indirect register batch to be the start of a range.

   DCL OUT[0].xyzw, POSITION
   DCL OUT[1][0-7].x, CLIPDIST[0-7]
   DCL IMM[0], {12} 

   SUB TEMP[2], TEMP[2], IMM[0].x
   ARR ADDR[0], TEMP[2]

   MOV OUT[ADDR[0].x + 1].x, TEMP[0].xxxx

but it does seem a bit convoluted...

> >> For constrained hardware the driver can build the clunky
> >>
> >> c := ADDR[0].x % 4
> >> i := ADDR[0].x / 4
> >> IF [c == 0]
> >>   MOV OUT[i].x, TEMP[0].xxxx
> >> ELSE
> >> IF [c == 1]
> >>   MOV OUT[i].y, TEMP[0].xxxx
> >> ELSE
> >> IF [c == 2]
> >>   MOV OUT[i].z, TEMP[0].xxxx
> >> ELSE
> >>   MOV OUT[i].w, TEMP[0].xxxx
> >> ENDIF
> >>
> >> itself.
> > Sounds good plan to me.
> >
> > BTW, I took a look at inputs/outputs UsageMasks and although we
> > don't use them, I really think we really should, as having that
> > info readily accessible would allow to avoid wasting
> > time/bandwidth copying attributes which are not needed.
> 
> Yes I'm already computing these masks in my TGSI parser, getting them
> from higher up would be nicer and potentially more reliable.

Good.

Jose