[Bug 92760] Add FP64 support to the i965 shader backends

Sat May 21 21:10:48 UTC 2016

https://bugs.freedesktop.org/show_bug.cgi?id=92760

--- Comment #84 from Francisco Jerez <currojerez at riseup.net> ---
(In reply to Iago Toral from comment #83)
>[...]
>> due to an instruction decompression bug (or feature?) that causes it
>> to increment the register offset of the second half by a fixed amount
>> of one GRF regardless of the vertical stride.  The end result is the
>> instruction behaves as if the source had the logical region
>> r0.0<4>.xyxy (where the swizzles are actual logical swizzles, i.e. in
>> units of whole 64bit components) and we have managed to get the ZW
>> components to cross the vec2 boundary and read the XY components from
>> the first oword of the register.
>
> For the sake of completeness, one idea that we discussed was to try and exploit
> this to access components Z/W like:
>
> r0.2<0,2,1>.xyzw:df
>
> That is, we start the region at offset 16B (i.e. right where Z for the first
> vertex starts) and then use the vstride=0 trick to read past the first GRF and
> access the Z/W data for the second vertex in the SIMD4x2 execution.
>
> I have to say, however, that I tried to do some quick tests to verify that this
> actually works and it didn't for me. I think in this case we might be violating
> a few hardware restrictions because:
>

I'm pretty sure I tried that on HSW at some point and it worked fine for
me, I'll see if I can dig up the exact assembly code I used.

> a) The region (for an execsize of 8) would span over 3 GRFs (which I think is
> not allowed?)

Not really, the region would span four rows (because the width has a
fixed value of two for DF instructions in Align16 mode and as you say
the execution size is eight), like:

 r0.2 r0.3
 r0.2 r0.3
 r1.2 r1.3
 r1.2 r1.3

So the whole region is contained in two GRF registers which is allowed
by the hardware.

> b) The first and last GRFs covered would select less DF elements than the
> second. I think I read something in the docs that required that regions
> selected "homogeneous" parts of the registers selected.
>

The first GRF (r0) would be read by four channels and the second GRF
(r1) by another four, so the hardware should be happy AFAICT. ;)

>>  If vector splitting is applied
>> wisely in addition to this, we effectively gain one degree of freedom
>> and can now represent a large portion of the dvec4 swizzle space by
>> splitting into two instructions only (close to 70% of all swizzle
>> combinations).  A particularly fruitful way to split (works for
>> roughly 55% of the swizzle space) involves dividing the XW and YZ
>> components into separate instructions, because on the one hand it
>> ensures full orthogonality between the swizzle of each component and
>> on the other hand it allows shifting the second component in each pair
>> with respect to the first by 16B freely using the vstride trick, and
>
> To clarify this last sentence a bit further, the idea here is that with an
> execsize of 8 we use regions with a vstride of 2 (so 4 regions in total, each

VStride doesn't necessarily has to be 2, and the fact that you can
control the vstride independently from the other regioning parameters
triplicates the number of different swizzles you can represent using
this trick.  E.g. if you want both components of the XW (or YZ)
instruction to read from the same oword of the input (either the first
or the second one) you set vstride to zero and then pick the appropriate
swizzle for each component, which is always possible because the
swizzles for the X and W components of the instruction (or respectively
Y and Z) are fully independent.

> one 16B large). With XW and YZ swizzles each component in the swizzle falls in
> a separate 16B region and we select each of the components using a different
> 32-bit swizzle. For example, with XW, for the first vertex in the SIMD4x2
> execution, component X falls in the first 16B region and W in the second (for
> the second vertex they fall inthe 3rd and 4th respectively). We select X in the
> first region using a xy 32-bit swizzle and we select W in the second region by
> using a zw 32-bit swizzle.
>
> This means that we could implement something like:
>
> mov(8) r0.xyzw:df r2.xyzw:df
>

That's the identity swizzle you could implement as-is ;).  Or is it just
an example for the sake of explanation?

> By splitting it in two like this:
>
> mov(8) r0.xw:df r2.xw:df
> mov(8) r0.yz:df r2.yz:df
>
> And then generating:
>
> mov(8) r0<1>.xw r2<2,2,1>.xyzw
> mov(8) r0<1>.yz r2<2,2,1>.zwxy
>

The swizzles of the second instruction seem backwards, they would have
to be:

| mov(8) r0<1>.yz r2<2,2,1>.xyzw

for the Y component of the instruction to read r2.1 (the Y component of
the original dvec4) and the Z component of the instruction to read r2.2
(the Z component of the vector).  The assembly you have written would be
equivalent to the logical instruction (with whole-DF swizzle notation):

| mov(8) r0.xyzw:df r2.xxww:df

>> at the same time it sidesteps the writemask limitation (for
>> comparison, splitting into XY+ZW or XZ+YW wouldn't have all of these
>> properties at the same time).  In cases where the vstride trick
>> doesn't work (e.g. BDW) SIMD splitting would have to be applied
>> afterwards to unroll the instruction into a pair of SIMD4 instructions
>> with vstride=0 and so that the base register of the second one is
>> shifted by one GRF from the first one -- The end result is still
>> strictly better than scalarizing the original instruction completely
>> in most cases since the latter requires four compressed instructions
>> while the proposed approach generates the same number of uncompressed
>> instructions.
>
> Are we still interested in enabling the vec4 backend on BDW+ now that these
> gens are fully scalar?
>

I guess it doesn't matter, I wouldn't worry too much about BDW+ right
now, but I think we can support it almost for free in terms of
implementation complexity (because you'll need a SIMD lowering pass for
other reasons anyway), so it should still be possible to make them go
back to vector in the future if it happens to be useful (e.g. where
register pressure kills the scalar back-end).

>> The following table summarizes how the splitting could be carried out
>> for each swizzle combination.  Each row corresponds to some set of
>> combinations of the first two swizzle components and each column
>> corresponds to some set of combinations of the last two swizzle
>> components.  'v' is used as a shorthand for the set of DF swizzles
>> that read from the first oword of a vec4 (i.e. X or Y) while '^' is
>> used for the second oword (i.e. Z or W).  The letters indicate the
>> kind of splitting that needs to be carried out, from A which has the
>> highest likelihood to be supported natively as a single instruction,
>> to E which is the worst and needs to be fully scalarized in all cases.
>> The meaning of the class subscripts is explained below together with
>> the description of each splitting class.
>> 
>>       vv  v^   ^v   ^^
>>   vv  A   B    B    A
>>   v^  Cy  Dyz  B    B
>>   ^v  Cx  B    Dxw  B
>>   ^^  E   Cz   Cw   A
>> 
>> E.g. swizzle XZYX reads the first oword for all components except the
>> second which reads the second oword, so it would belong to the set
>> v^vv, which corresponds to splitting class Cy.  Because lower
>> (i.e. worse) splitting classes are typically more general than higher
>> ones (with some exceptions discussed below), it would be allowable for
>> an implementation to demote swizzle combinations to a splitting class
>> worse than the one shown on the table in order to avoid implementing
>> certain splitting classes -- E.g. it would be valid for an
>> implementation to ignore the table above altogether and implement
>> splitting class E exclusively.
>> 
>>  - Class A (48/256 swizzles) can be subdivided into:
>>    - Class A+ (12/256 swizzles) which can always be supported
>>      natively.  It applies for swizzles such that the first and third
>>      components pair up correctly and the second and fourth components
>>      pair up as well -- To be more precise, two swizzle components i
>>      and j are said to "pair up" correctly if they are equal modulo
>>      two, i.e. if they refer to the same subcomponent within the oword
>>      they refer to respectively -- E.g. x pairs up with z and y pairs
>>      up with w.
>
> As far as I understand, some of these don't need the vstride trick (like XYZW,
> which we can represent with a region like <2,2,1>.xyzw:df) while others do
> (like XYXY, which would need something like <0,2,1>.xyzw:df)
>

Yeah, it depends on whether the second pair of vector components of the
instruction map to the same oword of the input (in which case you need
the trick) or a different one (in which case you don't).

>>    - Class A- (36/256 swizzles) which applies in every other case and
>>      requires splitting into XW+YZ components.
>
> For example, XXYY. In this case we would split it in XX and YY. The former
> would be represented using a <2,2,1>.xyxy region  (which actually selects XXZZ)
> and the latter using <2,2,1>.zwzw. (which actually selects YYWW).
>

Hm, the example XXYY does belong to class A-, but (like everything else
in this class) you would have to split the instruction into XW+YZ
components and then emit code like:

| mov(8) r0.xw:df r2<0,2,1>.xyzw:df
| mov(8) r0.yz:df r2<0,2,1>.zwxy:df

>>[...]
>> This approach could be extended to instructions with multiple sources
>> without much difficulty by dividing the instruction iteratively for
>> each source -- The caveat is that for multiple-source instructions the
>> probability of having to make use of one of the worse classes
>> increases, but the situation doesn't change qualitatively and the best-
>> (one instruction) and worst-case scenarios (full scalarization) remain
>> unchanged.
>
> Yeah, 2-operand instructions are going to lead to more splitting in the end but
> the good thing about this is that we can start by fully scalarizing and then
> improve things by supporting more swizzle classes iteratively.
>

Agreed. :)

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-3d-bugs/attachments/20160521/ea98d352/attachment.html>