[Mesa-dev] mediump support: future work

Tue May 5 01:59:58 UTC 2020

On Mon, May 4, 2020 at 5:09 PM Marek Olšák <maraeo at gmail.com> wrote:
>
> 16-bit varyings only make sense if they are packed, i.e. we need to fit 2 16-bit 4D varyings into 1 vec4 slot to save memory for IO. Without that, AMD (and most others?) won't benefit from 16-bit IO much.
>

I guess for !flat varyings that mostly makes sense if you are manually
interpolating in the fs?  We can, but don't have to and it doesn't
seem like a benefit to do so.  Maybe it would be a win for flat
varyings, but unclear, we might win more from switching to the
instruction we use for interpolated varyings instead of the one that
bypasses interpolation.  At least that seems to be what blob does on
new gens.

> 16-bit uniforms would help everybody, because there is potential for uniform packing, saving memory (and cache lines).
>

it does mean futzing w/ uniforms before uploading.. I'm not sure (for
us) that is a win vs just using the hw builtin automagic fp32->fp16
push-constant conversion.. the push constant upload is pipelined with
draws afaict for newer gens, and from shader standpoint, other than
the restrictions about which instructions can use const src and when,
they are basically free to load.. ie. loading cN.m as hcN.m is free.

so might also what to be a driver option?

> The other items are just for eliminating conversion instructions. We must have more vectorized 16-bit vec2 instructions than "conversion instructions + vec2 packing instructions" for mediump to pay off. We also don't get decreased register usage if we are not vectorized, so mediump is a tough sell at the moment.

we don't really have "vectorized fp16".. we have a sort of "vectorish"
mode where a scalar instruction can repeat, incrementing dst register
and optionally incrementing individual src registers (ie. we can do
.yyy or .yzw swizzles but not others).  That is orthogonal to fp16
(but there may be lower latency for fp16) and mostly seems to help
reducing the latency to load src registers (since hw can load a
non-incremented src register once for each of the scalar instructions
packed together).  Scalar 16b instructions might be a win, but it is a
bit more complicated to tease out the instruction cycles vs the
register load cost.

balancing register pressure vs "vectorish" instructions is a thing I'm
still working on.  But ignoring that fp16 is a win for us because of
register pressure.. ie. a full-reg conflicts with two half-regs.

For sure, a lot of the gain involves avoiding excessive conversions,
but in a lot of common cases we can fold conversion into alu
instruction in the backend..

BR,
-R

>
> Marek
>
> On Mon, May 4, 2020 at 7:03 PM Rob Clark <robdclark at gmail.com> wrote:
>>
>> On Mon, May 4, 2020 at 11:44 AM Marek Olšák <maraeo at gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > This is the status of mediump support in Mesa. What I listed is what AMD GPUs can do. "Yes" means what Mesa supports.
>> >
>> > Feature FP16 support Int16 support
>> > ALU Yes No
>> > Uniforms No No
>> > VS in No No
>> > VS out / FS in No No
>> > FS out No No
>> > TCS, TES, GS out / in No No
>> > Sampler coordinates (only coord, derivs, lod, bias; not offset and compare) No ---
>> > Image coordinates --- No
>> > Return value from samplers (incl. sampler buffers) Yes
>> > No
>> > Return value from image loads (incl. image buffers) No No
>> > Data source for image stores (incl. image buffers) No No
>> > If 16-bit sampler/image instructions are surrounded by conversions, promote them to 32 bits No No
>> >
>> > Please let me know if you don't see the table correctly.
>> >
>> > I'd like to know if I can enable some of them using the existing FP16 CAP. The only drivers supporting FP16 are currently Freedreno and Panfrost.
>> >
>>
>> I think in general it should be ok.
>>
>> I think for ir3 we want 32b inputs/outputs for geom stages
>> (vs/hs/ds/gs).  For frag outs we use nir_lower_mediump_outputs.. maybe
>> this is a good approach to continue, to use a simple nir lowering pass
>> for cases where a shader stage can directly take 16b input/output.
>> For frag inputs we fold the narrowing conversion in to the varying
>> fetch instruction in backend.
>>
>> int16 would be pretty useful, for loop counters especially.. these can
>> have a long live-range and currently wastefully occupy a full 32b reg.
>>
>> Uniforms we haven't cared too much about, since we can (usually) read
>> a 32b uniform as a 16b and fold that directly into alu instructions..
>> we handle that in the backend.
>>
>> Pushing mediump support further would be great, and we can definitely
>> help if it ends up needing changes in freedreno backend.  The deqp
>> coverage in CI should give us pretty good confidence about whether or
>> not we are breaking things in the ir3 backend.
>>
>> BR,
>> -R