[Mesa-dev] RFC: TGSI scalar arrays

Wed Mar 20 07:41:58 PDT 2013

Sorry, this has become longer than I anticipated ...

I've been toying with adding support for TGSI_FILE_INPUT/OUTPUT arrays
because, since I cannot allocate varyings in the same order that the
register index specifies, I need it:

===
EXAMPLE:
OUT[0], CLIPDIST[1], must be allocated at address 0x2c0 in hardware
output space
OUT[1], CLIPDIST[0], 0x2d0
OUT[2], GENERIC[0], between 0x80 and 0x280
OUT[3], GENERIC[1], between 0x80 and 0x280

And without array specification
MOV OUT[TEMP[0].x-1], IMM[0]
would leave me no clue as to whether use 0x80 or 0x2c0 as base address.
===

Now that I'm on it, I'm considering to go a step further, which is
adding indirect scalar/component access.
This is motivated by float gl_ClipDistance[], which, if accessed
indirectly, currently leaves us no choice than generating code like this:

if ((index & 3) == 0) access x component; else
if ((index & 3) == 1) access y component; ...

This is undesirable and the hardware can do better (as it actually
supports accessing individual components since address registers contain
an address in bytes and we can do scalar read/write).

A second motivation is varying packing, which is required by the GL
spec, and may lead to use of TEMP arrays, which, albeit improved now,
will impair performance when used (on nv50 they go to uncached memory
which is very slow).

That case occurs if, for instance, a varying float[8] is accessed
indirectly and has to be packed into
OUT[0..1].xyzw, GENERIC[0..1]
instead of
OUT[0..7].x, GENERIC[0..7]

So far I've come up with 2 choices (all available only if the driver
supports e.g. PIPE_CAP_TGSI_SCALAR_REGISTERS):

1. SCALAR DECLARATIONS

Using float gl_ClipDistance[8] as example, it could be declared as:

OUT[0..7].x, CLIPDIST, ARRAY(1) where the .x now means that it's a
single component per OUT[index]

Now this obviously means that a single OUT[i] doesn't always consume 16
bytes / 4 components anymore, which may be a somewhat disturbing, since
the address of an output can't be directly inferred solely from its
index anymore.
However, that doesn't really constitute a problem if all access is
either direct or comes with an ARRAY() reference.

For varying packing, which happens only for user defined variables, and
hence TGSI_SEMANTIC_GENERIC, it gets a bit uglier:

(NOTE: GL requires us to be able to support exactly the amount of
components we report, failing due to alignment is not allowed. Hence the
GLSL compiler may put some variables at unaligned locations, see
ir_variable.location_frac):

A GENERIC semantic index should always cover 4 components so that a
fixed location can be assigned for it (drivers usually do this since it
makes an extra dynamic linkage pass when shaders are changed
unnecessary, as intended by GL_ARB_separate_shader_objects).

So, this would be valid:
OUT[0..3].x, GENERIC[0]
OUT[4..5].xy, GENERIC[1]
OUT[6], GENERIC[2]
Note how 3 OUT[indices] only consume 1 GENERIC[index].

If we, instead, allocated semantic index per register index instead of
per 4 components, we would have:
OUT[0..3].x, GENERIC[0]
OUT[4..5].xy, GENERIC[4]
OUT[6], GENERIC[6]
This would >waste space<, since GENERIC[4,6] would have to go to
output_space[addresses 0x40, 0x60] so it could link with
IN[6], GENERIC[6]
where we have no information about the size of GENERIC[0 .. 5], and
wasting space like that means the advertised number of varying
components cannot be satisfied.

And as a last step, if varyings are placed at non-vec4 boundaries, we
would have to be able to specify fractional semantic indices, like this:
OUT[0..2].x, GENERIC[0].x
OUT[3].x, GENERIC[0].w

2. SCALAR ADDRESS REGISTER VALUES

All this can be avoided by always declaring full vec4s, and adding the
possibility of doing indirect addressing on a per-component basis:

varying float a[4] becomes:
uniform int i;
a[i+5] = 999 becomes:

OUT[0].xyzw, ARRAY(1)
UARL_SCALAR ADDR[0].x, CONST[0].xxxx
MOV OUT(array 1)[ADDR[0].x+1].y, IMM[0].xxxx

The only difficulty with this is that we have to split acess TGSI
instructions accessing unaligned vectors:
(NOTE: this can always be avoided with TGSI_FILE_TEMPORARY, but varyings
may have to be packed).

With suggestion (1), 2 packed (and hence unaligned) vec3 arrays and a
single vec2 would look like this:
OUT[0..3].xyz, GENERIC[0].x
OUT[4..5].xyz, GENERIC[3].x
OUT[6].xy, GENERIC[4].zw
and we could still do:
ADD OUT[5].xyz, TEMP[0], TEMP[1]

Now, these would have to merged declared as:
OUT[0..4].xyzw

and the 2nd vec3 would be { OUT[0].w, OUT[1].xyz }

instead of simply OUT[1].xyz

A problem with this is that the GLSL compiler, while it can do the
packing into vec4s and splitting up access, cannot, iirc, access
individual components of a vec4 indirectly like TGSI would be able to.
To avoid TEMP arrays we'd have to disable the last phase of varying
packing (that actually converts the code to using vec4s).
It would still be able to assign fractional locations to guarantee that
linkage works, but glsl-to-tgsi would likely have to split access at
vec4 boundaries itself (more work), and declare the whole packed range
as a single TGSI array.
However, assuming that varyings with the *same* semantic can always be
assigned to contiguous slots (output memory space locations) by the
driver, and this really only happens for TGSI_SEMANTIC_GENERIC (user
varyings), the problem in the example at the top shouldn't arise, and
we're able to group all those into a single array.

Now, I hope someone was able to get through this and would like to
comment :)
Thanks in advance,
Christoph