[Bug 92760] Add FP64 support to the i965 shader backends

Mon May 23 06:30:55 UTC 2016

https://bugs.freedesktop.org/show_bug.cgi?id=92760

--- Comment #86 from Iago Toral <itoral at igalia.com> ---
(In reply to Francisco Jerez from comment #84)
> (In reply to Iago Toral from comment #83)
> >[...]
> >> due to an instruction decompression bug (or feature?) that causes it
> >> to increment the register offset of the second half by a fixed amount
> >> of one GRF regardless of the vertical stride.  The end result is the
> >> instruction behaves as if the source had the logical region
> >> r0.0<4>.xyxy (where the swizzles are actual logical swizzles, i.e. in
> >> units of whole 64bit components) and we have managed to get the ZW
> >> components to cross the vec2 boundary and read the XY components from
> >> the first oword of the register.
> >
> > For the sake of completeness, one idea that we discussed was to try and exploit
> > this to access components Z/W like:
> >
> > r0.2<0,2,1>.xyzw:df
> >
> > That is, we start the region at offset 16B (i.e. right where Z for the first
> > vertex starts) and then use the vstride=0 trick to read past the first GRF and
> > access the Z/W data for the second vertex in the SIMD4x2 execution.
> >
> > I have to say, however, that I tried to do some quick tests to verify that this
> > actually works and it didn't for me. I think in this case we might be violating
> > a few hardware restrictions because:
> >
> 
> I'm pretty sure I tried that on HSW at some point and it worked fine for
> me, I'll see if I can dig up the exact assembly code I used.

Don't worry, I am not 100% certain that this is the example I tried. Maybe I
was trying uniforms or something slightly different. Let me try to hit the
problem again and if I do then we can check what is going on with it.

> > a) The region (for an execsize of 8) would span over 3 GRFs (which I think is
> > not allowed?)
> 
> Not really, the region would span four rows (because the width has a
> fixed value of two for DF instructions in Align16 mode and as you say
> the execution size is eight), like:
> 
>  r0.2 r0.3
>  r0.2 r0.3
>  r1.2 r1.3
>  r1.2 r1.3
> 
> So the whole region is contained in two GRF registers which is allowed
> by the hardware.

Yeah, you're right, for some reason I was thinking of the same case but with a
vstride of 2.

> > b) The first and last GRFs covered would select less DF elements than the
> > second. I think I read something in the docs that required that regions
> > selected "homogeneous" parts of the registers selected.
> >
> 
> The first GRF (r0) would be read by four channels and the second GRF
> (r1) by another four, so the hardware should be happy AFAICT. ;)

Yep.

> >>  If vector splitting is applied
> >> wisely in addition to this, we effectively gain one degree of freedom
> >> and can now represent a large portion of the dvec4 swizzle space by
> >> splitting into two instructions only (close to 70% of all swizzle
> >> combinations).  A particularly fruitful way to split (works for
> >> roughly 55% of the swizzle space) involves dividing the XW and YZ
> >> components into separate instructions, because on the one hand it
> >> ensures full orthogonality between the swizzle of each component and
> >> on the other hand it allows shifting the second component in each pair
> >> with respect to the first by 16B freely using the vstride trick, and
> >
> > To clarify this last sentence a bit further, the idea here is that with an
> > execsize of 8 we use regions with a vstride of 2 (so 4 regions in total, each
> 
> VStride doesn't necessarily has to be 2, and the fact that you can
> control the vstride independently from the other regioning parameters
> triplicates the number of different swizzles you can represent using
> this trick.  E.g. if you want both components of the XW (or YZ)
> instruction to read from the same oword of the input (either the first
> or the second one) you set vstride to zero and then pick the appropriate
> swizzle for each component, which is always possible because the
> swizzles for the X and W components of the instruction (or respectively
> Y and Z) are fully independent.
> 
> > one 16B large). With XW and YZ swizzles each component in the swizzle falls in
> > a separate 16B region and we select each of the components using a different
> > 32-bit swizzle. For example, with XW, for the first vertex in the SIMD4x2
> > execution, component X falls in the first 16B region and W in the second (for
> > the second vertex they fall inthe 3rd and 4th respectively). We select X in the
> > first region using a xy 32-bit swizzle and we select W in the second region by
> > using a zw 32-bit swizzle.
> > This means that we could implement something like:
> >
> > mov(8) r0.xyzw:df r2.xyzw:df
> >
> 
> That's the identity swizzle you could implement as-is ;).  Or is it just
> an example for the sake of explanation?

Yeah, I just wanted to make an example to showcase how we would split and use
vstride.

> > By splitting it in two like this:
> >
> > mov(8) r0.xw:df r2.xw:df
> > mov(8) r0.yz:df r2.yz:df
> >
> > And then generating:
> >
> > mov(8) r0<1>.xw r2<2,2,1>.xyzw
> > mov(8) r0<1>.yz r2<2,2,1>.zwxy
> >
> 
> The swizzles of the second instruction seem backwards, they would have
> to be:
> 
> | mov(8) r0<1>.yz r2<2,2,1>.xyzw
> 
> for the Y component of the instruction to read r2.1 (the Y component of
> the original dvec4) and the Z component of the instruction to read r2.2
> (the Z component of the vector).

Oops, yes. Thanks for pointing this out! :)

>  The assembly you have written would be
> equivalent to the logical instruction (with whole-DF swizzle notation):
> 
> | mov(8) r0.xyzw:df r2.xxww:df
> 
> >> at the same time it sidesteps the writemask limitation (for
> >> comparison, splitting into XY+ZW or XZ+YW wouldn't have all of these
> >> properties at the same time).  In cases where the vstride trick
> >> doesn't work (e.g. BDW) SIMD splitting would have to be applied
> >> afterwards to unroll the instruction into a pair of SIMD4 instructions
> >> with vstride=0 and so that the base register of the second one is
> >> shifted by one GRF from the first one -- The end result is still
> >> strictly better than scalarizing the original instruction completely
> >> in most cases since the latter requires four compressed instructions
> >> while the proposed approach generates the same number of uncompressed
> >> instructions.
> >
> > Are we still interested in enabling the vec4 backend on BDW+ now that these
> > gens are fully scalar?
> >
> 
> I guess it doesn't matter, I wouldn't worry too much about BDW+ right
> now, but I think we can support it almost for free in terms of
> implementation complexity (because you'll need a SIMD lowering pass for
> other reasons anyway), so it should still be possible to make them go
> back to vector in the future if it happens to be useful (e.g. where
> register pressure kills the scalar back-end).

Right.

> >> The following table summarizes how the splitting could be carried out
> >> for each swizzle combination.  Each row corresponds to some set of
> >> combinations of the first two swizzle components and each column
> >> corresponds to some set of combinations of the last two swizzle
> >> components.  'v' is used as a shorthand for the set of DF swizzles
> >> that read from the first oword of a vec4 (i.e. X or Y) while '^' is
> >> used for the second oword (i.e. Z or W).  The letters indicate the
> >> kind of splitting that needs to be carried out, from A which has the
> >> highest likelihood to be supported natively as a single instruction,
> >> to E which is the worst and needs to be fully scalarized in all cases.
> >> The meaning of the class subscripts is explained below together with
> >> the description of each splitting class.
> >> 
> >>       vv  v^   ^v   ^^
> >>   vv  A   B    B    A
> >>   v^  Cy  Dyz  B    B
> >>   ^v  Cx  B    Dxw  B
> >>   ^^  E   Cz   Cw   A
> >> 
> >> E.g. swizzle XZYX reads the first oword for all components except the
> >> second which reads the second oword, so it would belong to the set
> >> v^vv, which corresponds to splitting class Cy.  Because lower
> >> (i.e. worse) splitting classes are typically more general than higher
> >> ones (with some exceptions discussed below), it would be allowable for
> >> an implementation to demote swizzle combinations to a splitting class
> >> worse than the one shown on the table in order to avoid implementing
> >> certain splitting classes -- E.g. it would be valid for an
> >> implementation to ignore the table above altogether and implement
> >> splitting class E exclusively.
> >> 
> >>  - Class A (48/256 swizzles) can be subdivided into:
> >>    - Class A+ (12/256 swizzles) which can always be supported
> >>      natively.  It applies for swizzles such that the first and third
> >>      components pair up correctly and the second and fourth components
> >>      pair up as well -- To be more precise, two swizzle components i
> >>      and j are said to "pair up" correctly if they are equal modulo
> >>      two, i.e. if they refer to the same subcomponent within the oword
> >>      they refer to respectively -- E.g. x pairs up with z and y pairs
> >>      up with w.
> >
> > As far as I understand, some of these don't need the vstride trick (like XYZW,
> > which we can represent with a region like <2,2,1>.xyzw:df) while others do
> > (like XYXY, which would need something like <0,2,1>.xyzw:df)
> >
> 
> Yeah, it depends on whether the second pair of vector components of the
> instruction map to the same oword of the input (in which case you need
> the trick) or a different one (in which case you don't).
> 
> >>    - Class A- (36/256 swizzles) which applies in every other case and
> >>      requires splitting into XW+YZ components.
> >
> > For example, XXYY. In this case we would split it in XX and YY. The former
> > would be represented using a <2,2,1>.xyxy region  (which actually selects XXZZ)
> > and the latter using <2,2,1>.zwzw. (which actually selects YYWW).
> >
> 
> Hm, the example XXYY does belong to class A-, but (like everything else
> in this class) you would have to split the instruction into XW+YZ
> components and then emit code like:
> 
> | mov(8) r0.xw:df r2<0,2,1>.xyzw:df
> | mov(8) r0.yz:df r2<0,2,1>.zwxy:df

Oh yes, absolutely.

> >>[...]
> >> This approach could be extended to instructions with multiple sources
> >> without much difficulty by dividing the instruction iteratively for
> >> each source -- The caveat is that for multiple-source instructions the
> >> probability of having to make use of one of the worse classes
> >> increases, but the situation doesn't change qualitatively and the best-
> >> (one instruction) and worst-case scenarios (full scalarization) remain
> >> unchanged.
> >
> > Yeah, 2-operand instructions are going to lead to more splitting in the end but
> > the good thing about this is that we can start by fully scalarizing and then
> > improve things by supporting more swizzle classes iteratively.
> >
> 
> Agreed. :)

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-3d-bugs/attachments/20160523/39409e7d/attachment-0001.html>