[Bug 92760] Add FP64 support to the i965 shader backends

Wed May 18 10:32:54 UTC 2016

https://bugs.freedesktop.org/show_bug.cgi?id=92760

--- Comment #83 from Iago Toral <itoral at igalia.com> ---
(In reply to Francisco Jerez from comment #82)
> I've been running some experiments on vec4 hardware and playing some
> combinatorics trying to come up with a plan to address the limitations
> of Align16 double-precision instructions we've been discussing.  Part
> of the following is optional and could be left out for a first
> approximation without sacrificing functional correctness -- The point
> is to have a reasonably workable initial implementation that still
> allows some room for improvement in the future.

Thanks for the very detailed post Curro! it is going to be super useful to have
it documented here going forward :)

Now that we have landed the scalar backend I will be back to this.

> The high-level idea is to have the front-end feed the back-end with
> (mostly) unlowered dvec4 instructions that would be processed by the
> back-end optimizer as usual, without considering the extremely
> annoying limitations of the hardware.  After the back-end optimization
> loop some additional (fully self-contained) machinery would lower
> double-precision instructions into a form the EUs would be able to
> deal with.

Right, that sounds a like a good idea and will also make for a much cleaner
solution too like you pointed out since it would allow us to separate the parts
that deal with all the peculiar aspects of fp64/vec4.

(...)

> The following trick allows representing many swizzles that cross vec2
> boundaries by exploiting some of the hardware's stupidity.  Assume you
> have a compressed SIMD4x2 DF instruction and a given source has
> vstride set to zero and identity swizzle.  For each channel the
> hardware would read the following locations from the register file, at
> least ideally:
> 
>  x0 <- r0.0:df
>  y0 <- r0.1:df
>  z0 <- r0.0:df
>  w0 <- r0.1:df
>  x1 <- r0.0:df
>  y1 <- r0.1:df
>  z1 <- r0.0:df
>  w1 <- r0.1:df
> 
> assuming that r0.0 is the base register.  This is actually what
> happens on BDW+ and doesn't help us much except for representing
> uniforms.  Pre-Gen8 hardware however behaves like:
> 
>  x0 <- r0.0:df
>  y0 <- r0.1:df
>  z0 <- r0.0:df
>  w0 <- r0.1:df
>  x1 <- r1.0:df
>  y1 <- r1.1:df
>  z1 <- r1.0:df
>  w1 <- r1.1:df
> 
> due to an instruction decompression bug (or feature?) that causes it
> to increment the register offset of the second half by a fixed amount
> of one GRF regardless of the vertical stride.  The end result is the
> instruction behaves as if the source had the logical region
> r0.0<4>.xyxy (where the swizzles are actual logical swizzles, i.e. in
> units of whole 64bit components) and we have managed to get the ZW
> components to cross the vec2 boundary and read the XY components from
> the first oword of the register.

For the sake of completeness, one idea that we discussed was to try and exploit
this to access components Z/W like:

r0.2<0,2,1>.xyzw:df

That is, we start the region at offset 16B (i.e. right where Z for the first
vertex starts) and then use the vstride=0 trick to read past the first GRF and
access the Z/W data for the second vertex in the SIMD4x2 execution.

I have to say, however, that I tried to do some quick tests to verify that this
actually works and it didn't for me. I think in this case we might be violating
a few hardware restrictions because:

a) The region (for an execsize of 8) would span over 3 GRFs (which I think is
not allowed?)
b) The first and last GRFs covered would select less DF elements than the
second. I think I read something in the docs that required that regions
selected "homogeneous" parts of the registers selected.

>  If vector splitting is applied
> wisely in addition to this, we effectively gain one degree of freedom
> and can now represent a large portion of the dvec4 swizzle space by
> splitting into two instructions only (close to 70% of all swizzle
> combinations).  A particularly fruitful way to split (works for
> roughly 55% of the swizzle space) involves dividing the XW and YZ
> components into separate instructions, because on the one hand it
> ensures full orthogonality between the swizzle of each component and
> on the other hand it allows shifting the second component in each pair
> with respect to the first by 16B freely using the vstride trick, and

To clarify this last sentence a bit further, the idea here is that with an
execsize of 8 we use regions with a vstride of 2 (so 4 regions in total, each
one 16B large). With XW and YZ swizzles each component in the swizzle falls in
a separate 16B region and we select each of the components using a different
32-bit swizzle. For example, with XW, for the first vertex in the SIMD4x2
execution, component X falls in the first 16B region and W in the second (for
the second vertex they fall inthe 3rd and 4th respectively). We select X in the
first region using a xy 32-bit swizzle and we select W in the second region by
using a zw 32-bit swizzle.

This means that we could implement something like:

mov(8) r0.xyzw:df r2.xyzw:df

By splitting it in two like this:

mov(8) r0.xw:df r2.xw:df
mov(8) r0.yz:df r2.yz:df

And then generating:

mov(8) r0<1>.xw r2<2,2,1>.xyzw
mov(8) r0<1>.yz r2<2,2,1>.zwxy

> at the same time it sidesteps the writemask limitation (for
> comparison, splitting into XY+ZW or XZ+YW wouldn't have all of these
> properties at the same time).  In cases where the vstride trick
> doesn't work (e.g. BDW) SIMD splitting would have to be applied
> afterwards to unroll the instruction into a pair of SIMD4 instructions
> with vstride=0 and so that the base register of the second one is
> shifted by one GRF from the first one -- The end result is still
> strictly better than scalarizing the original instruction completely
> in most cases since the latter requires four compressed instructions
> while the proposed approach generates the same number of uncompressed
> instructions.

Are we still interested in enabling the vec4 backend on BDW+ now that these
gens are fully scalar?

> The following table summarizes how the splitting could be carried out
> for each swizzle combination.  Each row corresponds to some set of
> combinations of the first two swizzle components and each column
> corresponds to some set of combinations of the last two swizzle
> components.  'v' is used as a shorthand for the set of DF swizzles
> that read from the first oword of a vec4 (i.e. X or Y) while '^' is
> used for the second oword (i.e. Z or W).  The letters indicate the
> kind of splitting that needs to be carried out, from A which has the
> highest likelihood to be supported natively as a single instruction,
> to E which is the worst and needs to be fully scalarized in all cases.
> The meaning of the class subscripts is explained below together with
> the description of each splitting class.
> 
>       vv  v^   ^v   ^^
>   vv  A   B    B    A
>   v^  Cy  Dyz  B    B
>   ^v  Cx  B    Dxw  B
>   ^^  E   Cz   Cw   A
> 
> E.g. swizzle XZYX reads the first oword for all components except the
> second which reads the second oword, so it would belong to the set
> v^vv, which corresponds to splitting class Cy.  Because lower
> (i.e. worse) splitting classes are typically more general than higher
> ones (with some exceptions discussed below), it would be allowable for
> an implementation to demote swizzle combinations to a splitting class
> worse than the one shown on the table in order to avoid implementing
> certain splitting classes -- E.g. it would be valid for an
> implementation to ignore the table above altogether and implement
> splitting class E exclusively.
> 
>  - Class A (48/256 swizzles) can be subdivided into:
>    - Class A+ (12/256 swizzles) which can always be supported
>      natively.  It applies for swizzles such that the first and third
>      components pair up correctly and the second and fourth components
>      pair up as well -- To be more precise, two swizzle components i
>      and j are said to "pair up" correctly if they are equal modulo
>      two, i.e. if they refer to the same subcomponent within the oword
>      they refer to respectively -- E.g. x pairs up with z and y pairs
>      up with w.

As far as I understand, some of these don't need the vstride trick (like XYZW,
which we can represent with a region like <2,2,1>.xyzw:df) while others do
(like XYXY, which would need something like <0,2,1>.xyzw:df)

>    - Class A- (36/256 swizzles) which applies in every other case and
>      requires splitting into XW+YZ components.

For example, XXYY. In this case we would split it in XX and YY. The former
would be represented using a <2,2,1>.xyxy region  (which actually selects XXZZ)
and the latter using <2,2,1>.zwzw. (which actually selects YYWW).

>  - Class B (96/256 swizzles) can be handled by splitting into XW+YZ
>    regardless of the swizzles.
> 
>  - Class Ci (64/256 swizzles) can be subdivided into:
>    - Class Ci+ (32/256 swizzles) requires splitting into a single
>      scalar component 'i' plus a dvec3 with all remaining components.
>      It applies when the swizzle of the "lone" component in the dvec3
>      (i.e. i XOR 1) is paired up correctly.
> 
>    - Class Ci- (32/256 swizzles) applies in every other case and requires
>      splitting into the scalar component 'i', the scalar component 'i
>      XOR 3' and a dvec2 with all remaining components.
> 
>  - Class Dij (32/256 swizzles) can be subdivided into:
>    - Class Dij+ (8/256 swizzles) requires splitting into XZ+YW and
>      applies when the swizzles for the first and third components and
>      second and fourth components pair up respectively.
>   
>    - Class Dij- (24/256 swizzles) applies otherwise and requires
>      splitting into scalar component 'i', scalar component 'j' and a
>      dvec2 with the two remaining components.
> 
>  - Class E (16/256 swizzles) requires full unrolling into scalar
>    components.
> 
> The following diagram summarizes how classes can be demoted into each
> other (which implies that one is able to represent a superset of the
> swizzles from the other).  When the original class had subscripts the
> union for all possible subscripts is intended in the diagram.  The
> symbol '->' denotes 'can be demoted to' and symbol '~~' denotes 'is
> equivalent to'.
> 
>   A+ -> A- ~~ B -> C- -> D- -> E
>                C+ -^ D+ -^
> 
> This approach could be extended to instructions with multiple sources
> without much difficulty by dividing the instruction iteratively for
> each source -- The caveat is that for multiple-source instructions the
> probability of having to make use of one of the worse classes
> increases, but the situation doesn't change qualitatively and the best-
> (one instruction) and worst-case scenarios (full scalarization) remain
> unchanged.

Yeah, 2-operand instructions are going to lead to more splitting in the end but
the good thing about this is that we can start by fully scalarizing and then
improve things by supporting more swizzle classes iteratively.

> P.S.: Would be great to be able to use LaTeX markup in bugzilla. :P

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-3d-bugs/attachments/20160518/cd7ff312/attachment-0001.html>