[Mesa-dev] RFC: TGSI scalar arrays

Wed Mar 20 09:05:21 PDT 2013

Am 20.03.2013 15:41, schrieb Christoph Bumiller:
> Sorry, this has become longer than I anticipated ...
> 
> I've been toying with adding support for TGSI_FILE_INPUT/OUTPUT arrays
> because, since I cannot allocate varyings in the same order that the
> register index specifies, I need it:
> 
> ===
> EXAMPLE:
> OUT[0], CLIPDIST[1], must be allocated at address 0x2c0 in hardware
> output space
> OUT[1], CLIPDIST[0], 0x2d0
> OUT[2], GENERIC[0], between 0x80 and 0x280
> OUT[3], GENERIC[1], between 0x80 and 0x280
> 
> And without array specification
> MOV OUT[TEMP[0].x-1], IMM[0]
> would leave me no clue as to whether use 0x80 or 0x2c0 as base address.
> ===
> 
> Now that I'm on it, I'm considering to go a step further, which is
> adding indirect scalar/component access.
> This is motivated by float gl_ClipDistance[], which, if accessed
> indirectly, currently leaves us no choice than generating code like this:
> 
> if ((index & 3) == 0) access x component; else
> if ((index & 3) == 1) access y component; ...
> 
> This is undesirable and the hardware can do better (as it actually
> supports accessing individual components since address registers contain
> an address in bytes and we can do scalar read/write).
> 
> A second motivation is varying packing, which is required by the GL
> spec, and may lead to use of TEMP arrays, which, albeit improved now,
> will impair performance when used (on nv50 they go to uncached memory
> which is very slow).
> 
> That case occurs if, for instance, a varying float[8] is accessed
> indirectly and has to be packed into
> OUT[0..1].xyzw, GENERIC[0..1]
> instead of
> OUT[0..7].x, GENERIC[0..7]
> 
> So far I've come up with 2 choices (all available only if the driver
> supports e.g. PIPE_CAP_TGSI_SCALAR_REGISTERS):
> 
> 
> 1. SCALAR DECLARATIONS
> 
> Using float gl_ClipDistance[8] as example, it could be declared as:
> 
> OUT[0..7].x, CLIPDIST, ARRAY(1) where the .x now means that it's a
> single component per OUT[index]
> 
> Now this obviously means that a single OUT[i] doesn't always consume 16
> bytes / 4 components anymore, which may be a somewhat disturbing, since
> the address of an output can't be directly inferred solely from its
> index anymore.
> However, that doesn't really constitute a problem if all access is
> either direct or comes with an ARRAY() reference.
> 
> For varying packing, which happens only for user defined variables, and
> hence TGSI_SEMANTIC_GENERIC, it gets a bit uglier:
> 
> (NOTE: GL requires us to be able to support exactly the amount of
> components we report, failing due to alignment is not allowed. Hence the
> GLSL compiler may put some variables at unaligned locations, see
> ir_variable.location_frac):
> 
> A GENERIC semantic index should always cover 4 components so that a
> fixed location can be assigned for it (drivers usually do this since it
> makes an extra dynamic linkage pass when shaders are changed
> unnecessary, as intended by GL_ARB_separate_shader_objects).
> 
> So, this would be valid:
> OUT[0..3].x, GENERIC[0]
> OUT[4..5].xy, GENERIC[1]
> OUT[6], GENERIC[2]
> Note how 3 OUT[indices] only consume 1 GENERIC[index].
> 
> If we, instead, allocated semantic index per register index instead of
> per 4 components, we would have:
> OUT[0..3].x, GENERIC[0]
> OUT[4..5].xy, GENERIC[4]
> OUT[6], GENERIC[6]
> This would >waste space<, since GENERIC[4,6] would have to go to
> output_space[addresses 0x40, 0x60] so it could link with
> IN[6], GENERIC[6]
> where we have no information about the size of GENERIC[0 .. 5], and
> wasting space like that means the advertised number of varying
> components cannot be satisfied.
> 
> 
> And as a last step, if varyings are placed at non-vec4 boundaries, we
> would have to be able to specify fractional semantic indices, like this:
> OUT[0..2].x, GENERIC[0].x
> OUT[3].x, GENERIC[0].w
> 
> 
> 
> 2. SCALAR ADDRESS REGISTER VALUES
> 
> All this can be avoided by always declaring full vec4s, and adding the
> possibility of doing indirect addressing on a per-component basis:
> 
> varying float a[4] becomes:
> uniform int i;
> a[i+5] = 999 becomes:
> 
> OUT[0].xyzw, ARRAY(1)
> UARL_SCALAR ADDR[0].x, CONST[0].xxxx
> MOV OUT(array 1)[ADDR[0].x+1].y, IMM[0].xxxx
> 
> The only difficulty with this is that we have to split acess TGSI
> instructions accessing unaligned vectors:
> (NOTE: this can always be avoided with TGSI_FILE_TEMPORARY, but varyings
> may have to be packed).
> 
> With suggestion (1), 2 packed (and hence unaligned) vec3 arrays and a
> single vec2 would look like this:
> OUT[0..3].xyz, GENERIC[0].x
> OUT[4..5].xyz, GENERIC[3].x
> OUT[6].xy, GENERIC[4].zw
> and we could still do:
> ADD OUT[5].xyz, TEMP[0], TEMP[1]
> 
> Now, these would have to merged declared as:
> OUT[0..4].xyzw
> 
> and the 2nd vec3 would be { OUT[0].w, OUT[1].xyz }
> 
> instead of simply OUT[1].xyz
> 
> A problem with this is that the GLSL compiler, while it can do the
> packing into vec4s and splitting up access, cannot, iirc, access
> individual components of a vec4 indirectly like TGSI would be able to.
> To avoid TEMP arrays we'd have to disable the last phase of varying
> packing (that actually converts the code to using vec4s).
> It would still be able to assign fractional locations to guarantee that
> linkage works, but glsl-to-tgsi would likely have to split access at
> vec4 boundaries itself (more work), and declare the whole packed range
> as a single TGSI array.
> However, assuming that varyings with the *same* semantic can always be
> assigned to contiguous slots (output memory space locations) by the
> driver, and this really only happens for TGSI_SEMANTIC_GENERIC (user
> varyings), the problem in the example at the top shouldn't arise, and
> we're able to group all those into a single array.
> 
> 
> Now, I hope someone was able to get through this and would like to
> comment :)

Not sure I fully understand this, but I'm thinking "whenever in doubt,
use something close to what dx10 does" since that's likely going to work
reasonable with different hw. Maybe declaring those special values
differently (not just as output reg) would help?

Roland