[Mesa-dev] RFC: TGSI scalar arrays

Wed Mar 20 10:30:43 PDT 2013

Am 20.03.2013 17:46, schrieb Christoph Bumiller:
> On 20.03.2013 17:05, Roland Scheidegger wrote:
>> Am 20.03.2013 15:41, schrieb Christoph Bumiller:
>>> Sorry, this has become longer than I anticipated ...
>>>
>>> I've been toying with adding support for TGSI_FILE_INPUT/OUTPUT arrays
>>> because, since I cannot allocate varyings in the same order that the
>>> register index specifies, I need it:
>>>
>>> ===
>>> EXAMPLE:
>>> OUT[0], CLIPDIST[1], must be allocated at address 0x2c0 in hardware
>>> output space
>>> OUT[1], CLIPDIST[0], 0x2d0
>>> OUT[2], GENERIC[0], between 0x80 and 0x280
>>> OUT[3], GENERIC[1], between 0x80 and 0x280
>>>
>>> And without array specification
>>> MOV OUT[TEMP[0].x-1], IMM[0]
>>> would leave me no clue as to whether use 0x80 or 0x2c0 as base address.
>>> ===
>>>
>>> Now that I'm on it, I'm considering to go a step further, which is
>>> adding indirect scalar/component access.
>>> This is motivated by float gl_ClipDistance[], which, if accessed
>>> indirectly, currently leaves us no choice than generating code like this:
>>>
>>> if ((index & 3) == 0) access x component; else
>>> if ((index & 3) == 1) access y component; ...
>>>
>>> This is undesirable and the hardware can do better (as it actually
>>> supports accessing individual components since address registers contain
>>> an address in bytes and we can do scalar read/write).
>>>
>>> A second motivation is varying packing, which is required by the GL
>>> spec, and may lead to use of TEMP arrays, which, albeit improved now,
>>> will impair performance when used (on nv50 they go to uncached memory
>>> which is very slow).
>>>
>>> That case occurs if, for instance, a varying float[8] is accessed
>>> indirectly and has to be packed into
>>> OUT[0..1].xyzw, GENERIC[0..1]
>>> instead of
>>> OUT[0..7].x, GENERIC[0..7]
>>>
>>> So far I've come up with 2 choices (all available only if the driver
>>> supports e.g. PIPE_CAP_TGSI_SCALAR_REGISTERS):
>>>
>>>
>>> 1. SCALAR DECLARATIONS
>>>
>>> Using float gl_ClipDistance[8] as example, it could be declared as:
>>>
>>> OUT[0..7].x, CLIPDIST, ARRAY(1) where the .x now means that it's a
>>> single component per OUT[index]
>>>
>>> Now this obviously means that a single OUT[i] doesn't always consume 16
>>> bytes / 4 components anymore, which may be a somewhat disturbing, since
>>> the address of an output can't be directly inferred solely from its
>>> index anymore.
>>> However, that doesn't really constitute a problem if all access is
>>> either direct or comes with an ARRAY() reference.
>>>
>>> For varying packing, which happens only for user defined variables, and
>>> hence TGSI_SEMANTIC_GENERIC, it gets a bit uglier:
>>>
>>> (NOTE: GL requires us to be able to support exactly the amount of
>>> components we report, failing due to alignment is not allowed. Hence the
>>> GLSL compiler may put some variables at unaligned locations, see
>>> ir_variable.location_frac):
>>>
>>> A GENERIC semantic index should always cover 4 components so that a
>>> fixed location can be assigned for it (drivers usually do this since it
>>> makes an extra dynamic linkage pass when shaders are changed
>>> unnecessary, as intended by GL_ARB_separate_shader_objects).
>>>
>>> So, this would be valid:
>>> OUT[0..3].x, GENERIC[0]
>>> OUT[4..5].xy, GENERIC[1]
>>> OUT[6], GENERIC[2]
>>> Note how 3 OUT[indices] only consume 1 GENERIC[index].
>>>
>>> If we, instead, allocated semantic index per register index instead of
>>> per 4 components, we would have:
>>> OUT[0..3].x, GENERIC[0]
>>> OUT[4..5].xy, GENERIC[4]
>>> OUT[6], GENERIC[6]
>>> This would >waste space<, since GENERIC[4,6] would have to go to
>>> output_space[addresses 0x40, 0x60] so it could link with
>>> IN[6], GENERIC[6]
>>> where we have no information about the size of GENERIC[0 .. 5], and
>>> wasting space like that means the advertised number of varying
>>> components cannot be satisfied.
>>>
>>>
>>> And as a last step, if varyings are placed at non-vec4 boundaries, we
>>> would have to be able to specify fractional semantic indices, like this:
>>> OUT[0..2].x, GENERIC[0].x
>>> OUT[3].x, GENERIC[0].w
>>>
>>>
>>>
>>> 2. SCALAR ADDRESS REGISTER VALUES
>>>
>>> All this can be avoided by always declaring full vec4s, and adding the
>>> possibility of doing indirect addressing on a per-component basis:
>>>
>>> varying float a[4] becomes:
>>> uniform int i;
>>> a[i+5] = 999 becomes:
>>>
>>> OUT[0].xyzw, ARRAY(1)
>>> UARL_SCALAR ADDR[0].x, CONST[0].xxxx
>>> MOV OUT(array 1)[ADDR[0].x+1].y, IMM[0].xxxx
>>>
>>> The only difficulty with this is that we have to split acess TGSI
>>> instructions accessing unaligned vectors:
>>> (NOTE: this can always be avoided with TGSI_FILE_TEMPORARY, but varyings
>>> may have to be packed).
>>>
>>> With suggestion (1), 2 packed (and hence unaligned) vec3 arrays and a
>>> single vec2 would look like this:
>>> OUT[0..3].xyz, GENERIC[0].x
>>> OUT[4..5].xyz, GENERIC[3].x
>>> OUT[6].xy, GENERIC[4].zw
>>> and we could still do:
>>> ADD OUT[5].xyz, TEMP[0], TEMP[1]
>>>
>>> Now, these would have to merged declared as:
>>> OUT[0..4].xyzw
>>>
>>> and the 2nd vec3 would be { OUT[0].w, OUT[1].xyz }
>>>
>>> instead of simply OUT[1].xyz
>>>
>>> A problem with this is that the GLSL compiler, while it can do the
>>> packing into vec4s and splitting up access, cannot, iirc, access
>>> individual components of a vec4 indirectly like TGSI would be able to.
>>> To avoid TEMP arrays we'd have to disable the last phase of varying
>>> packing (that actually converts the code to using vec4s).
>>> It would still be able to assign fractional locations to guarantee that
>>> linkage works, but glsl-to-tgsi would likely have to split access at
>>> vec4 boundaries itself (more work), and declare the whole packed range
>>> as a single TGSI array.
>>> However, assuming that varyings with the *same* semantic can always be
>>> assigned to contiguous slots (output memory space locations) by the
>>> driver, and this really only happens for TGSI_SEMANTIC_GENERIC (user
>>> varyings), the problem in the example at the top shouldn't arise, and
>>> we're able to group all those into a single array.
>>>
>>>
>>> Now, I hope someone was able to get through this and would like to
>>> comment :)
>> Not sure I fully understand this, but I'm thinking "whenever in doubt,
>> use something close to what dx10 does" since that's likely going to work
>> reasonable with different hw. Maybe declaring those special values
>> differently (not just as output reg) would help?
> What DX10 does is making indirect access of varyings illegal. That's not
> possible with OpenGL ...

Hmm I thought dcl_indexRange would be used for indirect access of varyings?

Roland