[Mesa-dev] RFC: TGSI scalar arrays

Wed Mar 20 09:46:03 PDT 2013

On 20.03.2013 17:05, Roland Scheidegger wrote:
> Am 20.03.2013 15:41, schrieb Christoph Bumiller:
>> Sorry, this has become longer than I anticipated ...
>>
>> I've been toying with adding support for TGSI_FILE_INPUT/OUTPUT arrays
>> because, since I cannot allocate varyings in the same order that the
>> register index specifies, I need it:
>>
>> ===
>> EXAMPLE:
>> OUT[0], CLIPDIST[1], must be allocated at address 0x2c0 in hardware
>> output space
>> OUT[1], CLIPDIST[0], 0x2d0
>> OUT[2], GENERIC[0], between 0x80 and 0x280
>> OUT[3], GENERIC[1], between 0x80 and 0x280
>>
>> And without array specification
>> MOV OUT[TEMP[0].x-1], IMM[0]
>> would leave me no clue as to whether use 0x80 or 0x2c0 as base address.
>> ===
>>
>> Now that I'm on it, I'm considering to go a step further, which is
>> adding indirect scalar/component access.
>> This is motivated by float gl_ClipDistance[], which, if accessed
>> indirectly, currently leaves us no choice than generating code like this:
>>
>> if ((index & 3) == 0) access x component; else
>> if ((index & 3) == 1) access y component; ...
>>
>> This is undesirable and the hardware can do better (as it actually
>> supports accessing individual components since address registers contain
>> an address in bytes and we can do scalar read/write).
>>
>> A second motivation is varying packing, which is required by the GL
>> spec, and may lead to use of TEMP arrays, which, albeit improved now,
>> will impair performance when used (on nv50 they go to uncached memory
>> which is very slow).
>>
>> That case occurs if, for instance, a varying float[8] is accessed
>> indirectly and has to be packed into
>> OUT[0..1].xyzw, GENERIC[0..1]
>> instead of
>> OUT[0..7].x, GENERIC[0..7]
>>
>> So far I've come up with 2 choices (all available only if the driver
>> supports e.g. PIPE_CAP_TGSI_SCALAR_REGISTERS):
>>
>>
>> 1. SCALAR DECLARATIONS
>>
>> Using float gl_ClipDistance[8] as example, it could be declared as:
>>
>> OUT[0..7].x, CLIPDIST, ARRAY(1) where the .x now means that it's a
>> single component per OUT[index]
>>
>> Now this obviously means that a single OUT[i] doesn't always consume 16
>> bytes / 4 components anymore, which may be a somewhat disturbing, since
>> the address of an output can't be directly inferred solely from its
>> index anymore.
>> However, that doesn't really constitute a problem if all access is
>> either direct or comes with an ARRAY() reference.
>>
>> For varying packing, which happens only for user defined variables, and
>> hence TGSI_SEMANTIC_GENERIC, it gets a bit uglier:
>>
>> (NOTE: GL requires us to be able to support exactly the amount of
>> components we report, failing due to alignment is not allowed. Hence the
>> GLSL compiler may put some variables at unaligned locations, see
>> ir_variable.location_frac):
>>
>> A GENERIC semantic index should always cover 4 components so that a
>> fixed location can be assigned for it (drivers usually do this since it
>> makes an extra dynamic linkage pass when shaders are changed
>> unnecessary, as intended by GL_ARB_separate_shader_objects).
>>
>> So, this would be valid:
>> OUT[0..3].x, GENERIC[0]
>> OUT[4..5].xy, GENERIC[1]
>> OUT[6], GENERIC[2]
>> Note how 3 OUT[indices] only consume 1 GENERIC[index].
>>
>> If we, instead, allocated semantic index per register index instead of
>> per 4 components, we would have:
>> OUT[0..3].x, GENERIC[0]
>> OUT[4..5].xy, GENERIC[4]
>> OUT[6], GENERIC[6]
>> This would >waste space<, since GENERIC[4,6] would have to go to
>> output_space[addresses 0x40, 0x60] so it could link with
>> IN[6], GENERIC[6]
>> where we have no information about the size of GENERIC[0 .. 5], and
>> wasting space like that means the advertised number of varying
>> components cannot be satisfied.
>>
>>
>> And as a last step, if varyings are placed at non-vec4 boundaries, we
>> would have to be able to specify fractional semantic indices, like this:
>> OUT[0..2].x, GENERIC[0].x
>> OUT[3].x, GENERIC[0].w
>>
>>
>>
>> 2. SCALAR ADDRESS REGISTER VALUES
>>
>> All this can be avoided by always declaring full vec4s, and adding the
>> possibility of doing indirect addressing on a per-component basis:
>>
>> varying float a[4] becomes:
>> uniform int i;
>> a[i+5] = 999 becomes:
>>
>> OUT[0].xyzw, ARRAY(1)
>> UARL_SCALAR ADDR[0].x, CONST[0].xxxx
>> MOV OUT(array 1)[ADDR[0].x+1].y, IMM[0].xxxx
>>
>> The only difficulty with this is that we have to split acess TGSI
>> instructions accessing unaligned vectors:
>> (NOTE: this can always be avoided with TGSI_FILE_TEMPORARY, but varyings
>> may have to be packed).
>>
>> With suggestion (1), 2 packed (and hence unaligned) vec3 arrays and a
>> single vec2 would look like this:
>> OUT[0..3].xyz, GENERIC[0].x
>> OUT[4..5].xyz, GENERIC[3].x
>> OUT[6].xy, GENERIC[4].zw
>> and we could still do:
>> ADD OUT[5].xyz, TEMP[0], TEMP[1]
>>
>> Now, these would have to merged declared as:
>> OUT[0..4].xyzw
>>
>> and the 2nd vec3 would be { OUT[0].w, OUT[1].xyz }
>>
>> instead of simply OUT[1].xyz
>>
>> A problem with this is that the GLSL compiler, while it can do the
>> packing into vec4s and splitting up access, cannot, iirc, access
>> individual components of a vec4 indirectly like TGSI would be able to.
>> To avoid TEMP arrays we'd have to disable the last phase of varying
>> packing (that actually converts the code to using vec4s).
>> It would still be able to assign fractional locations to guarantee that
>> linkage works, but glsl-to-tgsi would likely have to split access at
>> vec4 boundaries itself (more work), and declare the whole packed range
>> as a single TGSI array.
>> However, assuming that varyings with the *same* semantic can always be
>> assigned to contiguous slots (output memory space locations) by the
>> driver, and this really only happens for TGSI_SEMANTIC_GENERIC (user
>> varyings), the problem in the example at the top shouldn't arise, and
>> we're able to group all those into a single array.
>>
>>
>> Now, I hope someone was able to get through this and would like to
>> comment :)
> Not sure I fully understand this, but I'm thinking "whenever in doubt,
> use something close to what dx10 does" since that's likely going to work
> reasonable with different hw. Maybe declaring those special values
> differently (not just as output reg) would help?
What DX10 does is making indirect access of varyings illegal. That's not
possible with OpenGL ...


> Roland
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev