<html> <head> <base href="https://bugs.freedesktop.org/"> </head> <body> <div> <a class="bz_bug_link bz_status_NEW " title="NEW - Regression in Mesa 17 on s390x (zSystems)" href="https://bugs.freedesktop.org/show_bug.cgi?id=100613#c25">Comment # 25</a> on <a class="bz_bug_link bz_status_NEW " title="NEW - Regression in Mesa 17 on s390x (zSystems)" href="https://bugs.freedesktop.org/show_bug.cgi?id=100613">bug 100613</a> from <a class="email" href="mailto:bcrocker@redhat.com" title="Ben Crocker <bcrocker@redhat.com>"> Ben Crocker</a> <pre>(In reply to Ray Strode from <a href="show_bug.cgi?id=100613#c24">comment #24</a>) ... > Yea, I guess thinking about it more, even if we can get scalar fetch to work > with sufficient twiddling, that twiddling probably introduces extra > operations per element, so maybe not a good idea. I guess we should take > another crack at <a href="attachment.cgi?id=131000" name="attach_131000" title="patch that didn't help at all">attachment 131000</a> <a href="attachment.cgi?id=131000&action=edit" title="patch that didn't help at all">[details]</a> <a href='page.cgi?id=splinter.html&bug=100613&attachment=131000'>[review]</a> [review] first. I have been looking at the assembly code for the 3x16 case generated on both big- and little-endian machines. This case stems from the piglit/tests/general/draw-vertices:test_short_vertices function. First, the LLVM IR I'm focusing is the IR generated by lp_build_gather_elem_vec (called by lp_build_gather, called by lp_build_fetch_rgba_soa, called by fetch_vector...); the IR looks like this: %"lp_build_gather_elem_ptr:72.21" = extractelement <2 x i32> %"lp_build_fetch_rgba_soa:557.", i32 0 %"lp_build_gather_elem_ptr:75.22" = getelementptr i8, i8* %map_ptr, i32 %"lp_build_gather_elem_ptr:72.21" %"lp_build_gather_elem_vec:189.23" = bitcast i8* %"lp_build_gather_elem_ptr:75.22" to i48* %"lp_build_gather_elem_vec:190.24" = load i48, i48* %"lp_build_gather_elem_vec:189.23", align 1 where I've used that last parameter to the LLVMBuild* calls, the parameter that appears as a null string in the production code, to contain the function name and the line number, which end up getting inserted in LLVM's result name. If you prefer the IR without the debug info, here it is: %21 = extractelement <2 x i32> %"0, i32 0 %22 = getelementptr i8, i8* %map_ptr, i32 %21 %23 = bitcast i8* %22 to i48* %24 = load i48, i48* %23, align 1 I've modified the data in the draw-vertices program so the values stick out in the registers; the data I'm using for this example looks like this: (gdb) x/6h $r4 0x1029ca00: 0x0015 0x0011 0x1111 0x0015 0x0016 0x2222 X1 Y1 Z1 X1 Y2 Z2 In general, PPC assembly code is a three-operand code where the syntax is (usually) OPCODE target, src1, src2 Load/store syntax is LOAD RT, immediate-displacement(RA) LOADx RT, RA, RB ;; where effective addr is RA + RB (LOADx = load indexed) LOADux RT, RA, RB ;; where effective addr is RA + RB (LOADux = load indexed w/ update; RT <- MEM(RA + RB) AND RA <- RA + RB) STORE RS, immediate-displacement(RA) STOREx RS, RA, RB ;; where effective addr is RA + RB (STOREx = store indexed) STOREux RS, RA, RB ;; where effective addr is RA + RB (STOREx = store indexed w/ update) Both the LOAD and the LOADux variants appear below. The Rotate instructions have more complex syntax with four operands. The target is on the left, as usual; the source operands are: . source register; . the (immediate) number of bits to shift; . an (immediate) mask specification M, with the semantics "AND the penultimate result with a mask consisting of bits 0:M = 1, M+1:63 = -" Please note that these descriptions are over-simplified. It is important to note that, whether the machine is little-endian or big-endian, BITS IN A REGISTER ARE NUMBERED FROM THE LEFT. I.e., the most significant bit is bit 0, ... the least significant bit is bit 63. On a (little-endian) PPC64LE machine, the assembly code looks like this: => 0x3fffab370404: lwzux r3,r4,r3 ;; Load Word w/ zero-extend & update ; r3 <- 0x11.0015 = Y.X 0x3fffab370408: lhz r4,4(r4) ;; Load halfword w/ zero-extend ; r4 <- 0x1111 = Z 0x3fffab37040c: std r2,24(r1) 0x3fffab370410: rldicr r4,r4,32,31 ;; Rotate Left doubleword immediate & clear right; imm = 32, mask = 0:31 => r4 <- 0x1111.0000.0000 = 0.Z.0.0 0x3fffab370414: or r25,r3,r4 ;; r25 <- 0x1111.0011.0015 = Z.Y.X So, the operation of loading a 48-bit int corresponds well with loading the 3-vector of int16's into the 64-bit target register. On a big-endian PPC64 machine, the assembly code looks like this: 0x3fffaace0538: lwzux r3,r4,r3 ;; r3 = 0, r4 = 0x10273d80; r3 <- 0x15.0011, i.e. X.Y 0x3fffaace053c: lhz r4,4(r4) ;; r4 <- 0x1111 = Z 0x3fffaace0540: rldicr r3,r3,16,47 ;; Rotate Left doubleword immediate & clear right; imm = 16, mask = 0:47 => r3 <- 0x15.0011.0000, i.e. r3 <<= 16 = 0.X.Y.0 ... 0x3fffaace0548: or r24,r4,r3 ;; r4 <- 0x15.0011.1111, i.e. 0.X.Y.Z Note that no single operation--shift, justification, "zero-extend" or anything else--can get the 16-bit fields into the proper order for subsequent code. Regarding Ray's specific comment about getting scalar fetch to work with "sufficient twiddling," I think it's perfectly acceptable to introduce extra operations, as long as we restrict the extra operations to the big-endian path. PPC64 (LE or BE) is fast enough so that any performance impact will be negligible; S390 is less fast, but I imagine production machines with more memory than the one we experimented on here are fast enough. And remember that this code is going into the vertex shader program, where there are already performance hurdles like branches.</pre> </div> <hr> You are receiving this mail because: <ul> <li>You are the QA Contact for the bug.</li> <li>You are the assignee for the bug.</li> </ul> </body> </html>