<html>
<head>
<base href="https://bugs.freedesktop.org/" />
</head>
<body>
<p>
<div>
<b><a class="bz_bug_link
bz_status_NEW "
title="NEW - Add FP64 support to the i965 shader backends"
href="https://bugs.freedesktop.org/show_bug.cgi?id=92760#c82">Comment # 82</a>
on <a class="bz_bug_link
bz_status_NEW "
title="NEW - Add FP64 support to the i965 shader backends"
href="https://bugs.freedesktop.org/show_bug.cgi?id=92760">bug 92760</a>
from <span class="vcard"><a class="email" href="mailto:currojerez@riseup.net" title="Francisco Jerez <currojerez@riseup.net>"> <span class="fn">Francisco Jerez</span></a>
</span></b>
<pre>I've been running some experiments on vec4 hardware and playing some
combinatorics trying to come up with a plan to address the limitations
of Align16 double-precision instructions we've been discussing. Part
of the following is optional and could be left out for a first
approximation without sacrificing functional correctness -- The point
is to have a reasonably workable initial implementation that still
allows some room for improvement in the future.
The high-level idea is to have the front-end feed the back-end with
(mostly) unlowered dvec4 instructions that would be processed by the
back-end optimizer as usual, without considering the extremely
annoying limitations of the hardware. After the back-end optimization
loop some additional (fully self-contained) machinery would lower
double-precision instructions into a form the EUs would be able to
deal with. The bulk of the required infrastructure would be:
- A swizzle lowering pass that would eliminate unsupported source
regions by breaking SIMD vector instructions into chunks so that
each individual instruction would write a subset of the destination
components and only use legal swizzle combinations (in cases where
there is source/destination overlap some minor copying would be
necessary) -- The execution of each generated vector chunk could
potentially be pipelined if the destination dependency controls are
set up correctly. More details on how the splitting could be done
below.
- A writemask lowering pass that would run after swizzle lowering and
split the resulting instructions into smaller chunks to clean up
illegal writemask combinations (XY and ZW). The swizzle lowering
approach proposed below attempts to minimize the amount of
additional division that this pass would need to do by emitting
legal writemask combinations where possible -- In fact the most
naive swizzle lowering algorithm conceivable that simply expands
the instruction into scalar components would render this pass
unnecessary so this could be left out completely in the initial
implementation.
- A SIMD width lowering pass that would divide SIMD4x2 instructions
that are too wide for the hardware to execute or hit some sort of
restriction that can be eliminated by dividing the instruction into
independent SIMD4 channels. The SIMD lowering pass from the scalar
back-end could be taken as starting point.
The following trick allows representing many swizzles that cross vec2
boundaries by exploiting some of the hardware's stupidity. Assume you
have a compressed SIMD4x2 DF instruction and a given source has
vstride set to zero and identity swizzle. For each channel the
hardware would read the following locations from the register file, at
least ideally:
x0 <- r0.0:df
y0 <- r0.1:df
z0 <- r0.0:df
w0 <- r0.1:df
x1 <- r0.0:df
y1 <- r0.1:df
z1 <- r0.0:df
w1 <- r0.1:df
assuming that r0.0 is the base register. This is actually what
happens on BDW+ and doesn't help us much except for representing
uniforms. Pre-Gen8 hardware however behaves like:
x0 <- r0.0:df
y0 <- r0.1:df
z0 <- r0.0:df
w0 <- r0.1:df
x1 <- r1.0:df
y1 <- r1.1:df
z1 <- r1.0:df
w1 <- r1.1:df
due to an instruction decompression bug (or feature?) that causes it
to increment the register offset of the second half by a fixed amount
of one GRF regardless of the vertical stride. The end result is the
instruction behaves as if the source had the logical region
r0.0<4>.xyxy (where the swizzles are actual logical swizzles, i.e. in
units of whole 64bit components) and we have managed to get the ZW
components to cross the vec2 boundary and read the XY components from
the first oword of the register. If vector splitting is applied
wisely in addition to this, we effectively gain one degree of freedom
and can now represent a large portion of the dvec4 swizzle space by
splitting into two instructions only (close to 70% of all swizzle
combinations). A particularly fruitful way to split (works for
roughly 55% of the swizzle space) involves dividing the XW and YZ
components into separate instructions, because on the one hand it
ensures full orthogonality between the swizzle of each component and
on the other hand it allows shifting the second component in each pair
with respect to the first by 16B freely using the vstride trick, and
at the same time it sidesteps the writemask limitation (for
comparison, splitting into XY+ZW or XZ+YW wouldn't have all of these
properties at the same time). In cases where the vstride trick
doesn't work (e.g. BDW) SIMD splitting would have to be applied
afterwards to unroll the instruction into a pair of SIMD4 instructions
with vstride=0 and so that the base register of the second one is
shifted by one GRF from the first one -- The end result is still
strictly better than scalarizing the original instruction completely
in most cases since the latter requires four compressed instructions
while the proposed approach generates the same number of uncompressed
instructions.
The following table summarizes how the splitting could be carried out
for each swizzle combination. Each row corresponds to some set of
combinations of the first two swizzle components and each column
corresponds to some set of combinations of the last two swizzle
components. 'v' is used as a shorthand for the set of DF swizzles
that read from the first oword of a vec4 (i.e. X or Y) while '^' is
used for the second oword (i.e. Z or W). The letters indicate the
kind of splitting that needs to be carried out, from A which has the
highest likelihood to be supported natively as a single instruction,
to E which is the worst and needs to be fully scalarized in all cases.
The meaning of the class subscripts is explained below together with
the description of each splitting class.
vv v^ ^v ^^
vv A B B A
v^ Cy Dyz B B
^v Cx B Dxw B
^^ E Cz Cw A
E.g. swizzle XZYX reads the first oword for all components except the
second which reads the second oword, so it would belong to the set
v^vv, which corresponds to splitting class Cy. Because lower
(i.e. worse) splitting classes are typically more general than higher
ones (with some exceptions discussed below), it would be allowable for
an implementation to demote swizzle combinations to a splitting class
worse than the one shown on the table in order to avoid implementing
certain splitting classes -- E.g. it would be valid for an
implementation to ignore the table above altogether and implement
splitting class E exclusively.
- Class A (48/256 swizzles) can be subdivided into:
- Class A+ (12/256 swizzles) which can always be supported
natively. It applies for swizzles such that the first and third
components pair up correctly and the second and fourth components
pair up as well -- To be more precise, two swizzle components i
and j are said to "pair up" correctly if they are equal modulo
two, i.e. if they refer to the same subcomponent within the oword
they refer to respectively -- E.g. x pairs up with z and y pairs
up with w.
- Class A- (36/256 swizzles) which applies in every other case and
requires splitting into XW+YZ components.
- Class B (96/256 swizzles) can be handled by splitting into XW+YZ
regardless of the swizzles.
- Class Ci (64/256 swizzles) can be subdivided into:
- Class Ci+ (32/256 swizzles) requires splitting into a single
scalar component 'i' plus a dvec3 with all remaining components.
It applies when the swizzle of the "lone" component in the dvec3
(i.e. i XOR 1) is paired up correctly.
- Class Ci- (32/256 swizzles) applies in every other case and requires
splitting into the scalar component 'i', the scalar component 'i
XOR 3' and a dvec2 with all remaining components.
- Class Dij (32/256 swizzles) can be subdivided into:
- Class Dij+ (8/256 swizzles) requires splitting into XZ+YW and
applies when the swizzles for the first and third components and
second and fourth components pair up respectively.
- Class Dij- (24/256 swizzles) applies otherwise and requires
splitting into scalar component 'i', scalar component 'j' and a
dvec2 with the two remaining components.
- Class E (16/256 swizzles) requires full unrolling into scalar
components.
The following diagram summarizes how classes can be demoted into each
other (which implies that one is able to represent a superset of the
swizzles from the other). When the original class had subscripts the
union for all possible subscripts is intended in the diagram. The
symbol '->' denotes 'can be demoted to' and symbol '~~' denotes 'is
equivalent to'.
A+ -> A- ~~ B -> C- -> D- -> E
C+ -^ D+ -^
This approach could be extended to instructions with multiple sources
without much difficulty by dividing the instruction iteratively for
each source -- The caveat is that for multiple-source instructions the
probability of having to make use of one of the worse classes
increases, but the situation doesn't change qualitatively and the best-
(one instruction) and worst-case scenarios (full scalarization) remain
unchanged.
P.S.: Would be great to be able to use LaTeX markup in bugzilla. :P</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are the QA Contact for the bug.</li>
</ul>
</body>
</html>