[Mesa-dev] llvmpipe broken on Skylake Pentium (LP_NATIVE_VECTOR_WIDTH=128)

Mon Oct 12 12:57:02 PDT 2015

Am 12.10.2015 um 21:27 schrieb Adam Jackson:
> I'm having some difficulty getting llvmpipe working on a Skylake
> Pentium, which has the charming property of not having AVX support at
> all (Skylake Cores have AVX2, and Xeons have AVX512, but Pentium seems
> to be the new way of spelling Celeron).  Currently I'm trying this with
> llvm 3.6.2 and Mesa 10.6.5, but llvm 3.7 doesn't seem to be any better.
> 
> The error I'm getting is:
> 
> $ DISPLAY=:2 LIBGL_DRIVERS_PATH=`pwd`/lib64/gallium LP_NATIVE_VECTOR_WIDTH=128 /usr/lib64/mesa/gloss
> LLVM ERROR: Cannot select: intrinsic %llvm.x86.sse41.pblendvb
> 
> (Setting LP_NATIVE_VECTOR_WIDTH like that seems to be effective at
> triggering this on Skylake Core, but in the name of paranoia I'm
> emulating a Pentium by patching kvm to mask off the AVX bits of
> cpuflags: https://ajax.fedorapeople.org/qemu-pseudoskl.patch )
> 
> That does indeed seem to be the pblendvb intrinsic from
> lp_build_select(), from lp_build_depth_stencil_test(), and at that
> point I get (against Xvfb, giving me a z32 depth format):
> 
> (gdb) p bld->type
> $1 = {floating = 0, fixed = 0, sign = 0, norm = 0, width = 32, length = 4}
> 
> There are several other paths through lp_build_select() that look like
> they could work, but don't.  If I turn on the if (0)'d vector select
> path, I get something like:
> 
> LLVM ERROR: Cannot select: 0xc23df0: v4i32 = X86ISD::SMAX 0xc1caf0, 0xc25020 [ORD=103] [ID=189]
>   0xc1caf0: v4i32 = X86ISD::VSRL 0xc3a810, 0xc269e0 [ORD=102] [ID=179]
>     0xc3a810: v4i32 = bitcast 0xc096b0 [ORD=95] [ID=150]
>       0xc096b0: v2i64,ch = X86ISD::VZEXT_LOAD 0xc28160, 0xc1f750<LD8[%sunkaddr145](align=4)> [ORD=95] [ID=140]
>         0xc1f750: i64 = add 0xc22560, 0xc21220 [ORD=83] [ID=102]
>           0xc22560: i64,ch = CopyFromReg 0xbca1c0, 0xc21880 [ORD=82] [ID=76]
>             0xc21880: i64 = Register %vreg26 [ID=28]
>           0xc21220: i64 = Constant<232> [ID=29]
>     0xc269e0: v4i32 = X86ISD::VZEXT_MOVL 0xc09c00 [ORD=102] [ID=169]
>       0xc09c00: v4i32 = scalar_to_vector 0xc24be0 [ORD=102] [ID=160]
>         0xc24be0: i32 = truncate 0xc22de0 [ORD=99] [ID=151]
>           0xc22de0: i64,ch = load 0xc28160, 0xc22450, 0xc0a9f0<LD4[%sunkaddr154], sext from i32> [ORD=104] [ID=141]
>             0xc22450: i64 = add 0xc22560, 0xc09f30 [ORD=97] [ID=100]
>               0xc22560: i64,ch = CopyFromReg 0xbca1c0, 0xc21880 [ORD=82] [ID=76]
>                 0xc21880: i64 = Register %vreg26 [ID=28]
>               0xc09f30: i64 = Constant<244> [ID=31]
>             0xc0a9f0: i64 = undef [ID=4]
>   0xc25020: v4i32 = bitcast 0xc06ec0 [ORD=9] [ID=119]
>     0xc06ec0: v2i64,ch = load 0xbca1c0, 0xc05fc0, 0xc0a9f0<LD16[ConstantPool]> [ORD=9] [ID=104]
>       0xc05fc0: i64 = X86ISD::Wrapper 0xc21550 [ID=78]
>         0xc21550: i64 = TargetConstantPool<<4 x i32> <i32 1, i32 1, i32 1, i32 1>> 0 [ID=47]
>       0xc0a9f0: i64 = undef [ID=4]
> In function: fs57_variant0_partial
> 
> I get the same result for either the BuildTrunc or BuildICmp paths
> through the if (0) at the top, and I also get the same result if I just
> fall through to lp_build_select_bitwise().
> 
> This doesn't seem to be the only breakage.  lp_test_format dies with:
> 
> LLVM ERROR: Cannot select: 0x231e090: v4i32 = X86ISD::UMIN 0x2346b70, 0x231bf80 [ORD=5] [ID=30]
>   0x2346b70: v4i32 = bitcast 0x2346840 [ORD=3] [ID=29]
>     0x2346840: v2i64 = scalar_to_vector 0x2346c80 [ORD=3] [ID=27]
>       0x2346c80: i64,ch = load 0x236abd0, 0x231b3d0, 0x231ba30<LD8[%4](align=4)> [ORD=3] [ID=24]
>         0x231b3d0: i64,ch = CopyFromReg 0x236abd0, 0x231b2c0 [ORD=1] [ID=20]
>           0x231b2c0: i64 = Register %vreg1 [ID=2]
>         0x231ba30: i64 = undef [ID=4]
>   0x231bf80: v4i32 = bitcast 0x231be70 [ORD=5] [ID=28]
>     0x231be70: v2i64,ch = load 0x236abd0, 0x231d920, 0x231ba30<LD16[ConstantPool]> [ORD=5] [ID=25]
>       0x231d920: i64 = X86ISD::Wrapper 0x231e3c0 [ID=22]
>         0x231e3c0: i64 = TargetConstantPool<<4 x i32> <i32 1, i32 1, i32 1, i32 1>> 0 [ID=14]
>       0x231ba30: i64 = undef [ID=4]
> In function: fetch_r32g32_uscaled_unorm8
> 
> lp_test_arit dies with:
> 
> LLVM ERROR: Cannot select: intrinsic %llvm.x86.sse41.round.ps
> 
> lp_test_conv dies with:
> 
> LLVM ERROR: Cannot select: 0xd79a20: v4i32 = X86ISD::SMIN 0xd796f0, 0xd794d0 [ORD=8] [ID=25]
>   0xd796f0: v4i32 = X86ISD::SMAX 0xd79f70, 0xd696b0 [ORD=6] [ID=23]
>     0xd79f70: v4i32 = bitcast 0xd698d0 [ORD=4] [ID=21]
>       0xd698d0: v2i64,ch = load 0xd43920, 0xd69380, 0xd69050<LD16[%3]> [ORD=4] [ID=17]
>         0xd69380: i64 = add 0xd68c10, 0xd69270 [ORD=3] [ID=13]
>           0xd68c10: i64,ch = CopyFromReg 0xd43920, 0xd68b00 [ORD=1] [ID=8]
>             0xd68b00: i64 = Register %vreg0 [ID=1]
>           0xd69270: i64 = Constant<16> [ID=4]
>         0xd69050: i64 = undef [ID=3]
>     0xd696b0: v4i32 = bitcast 0xd695a0 [ORD=5] [ID=18]
>       0xd695a0: v2i64,ch = load 0xd43920, 0xd793c0, 0xd69050<LD16[ConstantPool]> [ORD=5] [ID=14]
>         0xd793c0: i64 = X86ISD::Wrapper 0xd7a2a0 [ID=10]
>           0xd7a2a0: i64 = TargetConstantPool<<4 x i32> <i32 -32768, i32 -32768, i32 -32768, i32 -32768>> 0 [ID=6]
>         0xd69050: i64 = undef [ID=3]
>   0xd794d0: v4i32 = bitcast 0xd795e0 [ORD=7] [ID=19]
>     0xd795e0: v2i64,ch = load 0xd43920, 0xd79910, 0xd69050<LD16[ConstantPool]> [ORD=7] [ID=15]
>       0xd79910: i64 = X86ISD::Wrapper 0xd7a190 [ID=11]
>         0xd7a190: i64 = TargetConstantPool<<4 x i32> <i32 32767, i32 32767, i32 32767, i32 32767>> 0 [ID=7]
>       0xd69050: i64 = undef [ID=3]
> In function: test
> 
> All of the above lp_test_* failures can be triggered by setting
> LP_NATIVE_VECTOR_WIDTH when running make check so I don't think my kvm
> patch is to blame.
> 
> I'm a little out of my depth trying to track this down, any ideas?
> 

Note that the vector width doesn't really control if avx is used or not,
since that's a decision which llvm does on its own (we do set it
manually if we detect avx on our own, but llvm will use avx anyway even
if we don't if it thinks the cpu supports it with newer llvm versions),
albeit it would be possible to override this (but this changed
significantly between llvm versions).

You could give lp_build_create_jit_compiler_for_module() a look, in
particular builder.setMCPU(MCPU) and related stuff. I believe if you've
got a core i5 haswell there or something similar (by
llvm::sys::getHostCPUName()), it will then try to use avx, regardless if
the avx flag is present or not. This means that theoretically we should
probably mask off the not supported bits manually somehow, so they
aren't automatically derived from the cpu type set there (or set a
different cpu name, but we'd need to bend over backwards to derive the
correct type). The mechanism to do so seems kind of "meh" for jit code...

Roland