[Mesa-dev] [Bug 100613] Regression in Mesa 17 on s390x (zSystems)

Wed May 24 16:38:11 UTC 2017

https://bugs.freedesktop.org/show_bug.cgi?id=100613

--- Comment #25 from Ben Crocker <bcrocker at redhat.com> ---
(In reply to Ray Strode from comment #24)
...
> Yea, I guess thinking about it more, even if we can get scalar fetch to work
> with sufficient twiddling, that twiddling probably introduces extra
> operations per element, so maybe not a good idea.  I guess we should take
> another crack at attachment 131000 [details] [review] first.

I have been looking at the assembly code for the 3x16 case generated
on both big- and little-endian machines.  This case stems from the
piglit/tests/general/draw-vertices:test_short_vertices function.

First, the LLVM IR I'm focusing is the IR generated by
lp_build_gather_elem_vec (called by lp_build_gather, called by
lp_build_fetch_rgba_soa, called by fetch_vector...); the IR looks like
this:

  %"lp_build_gather_elem_ptr:72.21" = extractelement <2 x i32>
%"lp_build_fetch_rgba_soa:557.", i32 0
  %"lp_build_gather_elem_ptr:75.22" = getelementptr i8, i8* %map_ptr, i32
%"lp_build_gather_elem_ptr:72.21"
  %"lp_build_gather_elem_vec:189.23" = bitcast i8*
%"lp_build_gather_elem_ptr:75.22" to i48*
  %"lp_build_gather_elem_vec:190.24" = load i48, i48*
%"lp_build_gather_elem_vec:189.23", align 1

where I've used that last parameter to the LLVMBuild* calls, the
parameter that appears as a null string in the production code, to
contain the function name and the line number, which end up getting
inserted in LLVM's result name.

If you prefer the IR without the debug info, here it is:

  %21 = extractelement <2 x i32> %"0, i32 0
  %22 = getelementptr i8, i8* %map_ptr, i32 %21
  %23 = bitcast i8* %22 to i48*
  %24 = load i48, i48* %23, align 1

I've modified the data in the draw-vertices program so the values
stick out in the registers; the data I'm using for this example looks
like this:

(gdb) x/6h $r4
0x1029ca00:     0x0015  0x0011  0x1111  0x0015  0x0016  0x2222
                X1      Y1      Z1      X1      Y2      Z2

In general, PPC assembly code is a three-operand code where the syntax
is (usually)
    OPCODE  target, src1, src2

Load/store syntax is
    LOAD    RT, immediate-displacement(RA)
    LOADx   RT, RA, RB    ;; where effective addr is RA + RB (LOADx = load
indexed)
    LOADux  RT, RA, RB    ;; where effective addr is RA + RB (LOADux = load
indexed w/ update; RT <- MEM(RA + RB) AND RA <- RA + RB)
    STORE   RS, immediate-displacement(RA)
    STOREx  RS, RA, RB    ;; where effective addr is RA + RB (STOREx = store
indexed)
    STOREux RS, RA, RB    ;; where effective addr is RA + RB (STOREx = store
indexed w/ update)

Both the LOAD and the LOADux variants appear below.

The Rotate instructions have more complex syntax with four operands.
The target is on the left, as usual; the source operands are:
. source register;
. the (immediate) number of bits to shift;
. an (immediate) mask specification M, with the semantics
  "AND the penultimate result with a mask consisting of bits 0:M = 1, M+1:63 =
-"

Please note that these descriptions are over-simplified.

It is important to note that, whether the machine is little-endian or
big-endian, BITS IN A REGISTER ARE NUMBERED FROM THE LEFT.  I.e.,
the most significant bit is bit 0, ... the least significant bit is
bit 63.

On a (little-endian) PPC64LE machine, the assembly code looks like this:

=> 0x3fffab370404:      lwzux   r3,r4,r3        ;; Load Word w/ zero-extend &
update ; r3 <- 0x11.0015 = Y.X
   0x3fffab370408:      lhz     r4,4(r4)        ;; Load halfword w/ zero-extend
; r4 <- 0x1111 = Z
   0x3fffab37040c:      std     r2,24(r1)
   0x3fffab370410:      rldicr  r4,r4,32,31     ;; Rotate Left doubleword
immediate & clear right; imm = 32, mask = 0:31 => r4 <- 0x1111.0000.0000 =
0.Z.0.0
   0x3fffab370414:      or      r25,r3,r4       ;; r25 <- 0x1111.0011.0015 =
Z.Y.X

So, the operation of loading a 48-bit int corresponds well with
loading the 3-vector of int16's into the 64-bit target register.

On a big-endian PPC64 machine, the assembly code looks like this:

0x3fffaace0538:      lwzux   r3,r4,r3   ;; r3 = 0, r4 = 0x10273d80; r3 <-
0x15.0011, i.e. X.Y
0x3fffaace053c:      lhz     r4,4(r4)   ;; r4 <- 0x1111 = Z
0x3fffaace0540:      rldicr  r3,r3,16,47        ;; Rotate Left doubleword
immediate & clear right; imm = 16, mask = 0:47 => r3 <- 0x15.0011.0000, i.e. r3
<<= 16 = 0.X.Y.0
...
0x3fffaace0548:      or      r24,r4,r3  ;; r4 <- 0x15.0011.1111, i.e. 0.X.Y.Z

Note that no single operation--shift, justification, "zero-extend" or
anything else--can get the 16-bit fields into the proper order for
subsequent code.

Regarding Ray's specific comment about getting scalar fetch to work
with "sufficient twiddling," I think it's perfectly acceptable to
introduce extra operations, as long as we restrict the extra
operations to the big-endian path.  PPC64 (LE or BE) is fast enough so
that any performance impact will be negligible; S390 is less fast, but
I imagine production machines with more memory than the one we
experimented on here are fast enough.

And remember that this code is going into the vertex shader program,
where there are already performance hurdles like branches.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20170524/49bceb59/attachment.html>