[Mesa-dev] [PATCH] gallivm: use getHostCPUFeatures on x86/llvm-4.0+.

Tue Dec 6 17:47:30 UTC 2016

Actually I've verified this quickly with llc.
With -mattr=xop, it produces

fetch_r32_float_float:                  # @fetch_r32_float_float
        .cfi_startproc
# BB#0:                                 # %entry
        vpermilps       $65, .LCPI0_0(%rip), %xmm0 # xmm0 = mem[1,0,0,1]
        vmovaps %xmm0, (%rdi)
        retq

which is very obviously garbage (it even managed to optimize out the
actual load, just the constants are left...). So this is a llvm bug with
xop indeed. I'm going to a file a bug, but in the interim I don't know
what mesa should do - this is one reason why we didn't want to enable
features which we didn't actually test previously (that said, if we
don't enable them, the llvm bugs we hit will probably never get
fixed...). We could of course force-disable xop (albeit in theory it's
nice - we really can make use of that damn missing vector shift which
otherwise requires avx2).

Roland

Am 06.12.2016 um 17:34 schrieb Roland Scheidegger:
> Interesting. Can you show the IR / assembly? I don't get any failures here.
> I'm wondering if it's trying to use XOP and there's some bug there (or
> we're relying on undefined behavior which doesn't happen to work with
> it). Albeit since there's not actually any conversion involved in this
> case (float 1 channel -> float 4 channel) the assembly here looks
> trivial and I can't see how it could go wrong.
> 
> I get (with a couple days old llvm):
> define void @fetch_r32_float_float(<4 x float>*, i8*, i32, i32, { [2048
> x i32], [128 x i64] }*) {
> entry:
>   %5 = getelementptr i8, i8* %1, i32 0
>   %6 = bitcast i8* %5 to i32*
>   %7 = load i32, i32* %6
>   %8 = zext i32 %7 to i128
>   %9 = bitcast i128 %8 to <4 x float>
>   %10 = shufflevector <4 x float> %9, <4 x float> <float 0.000000e+00,
> float 1.000000e+00, float undef, float undef>, <4 x i32> <i32 0, i32 4,
> i32 4, i32 5>
>   store <4 x float> %10, <4 x float>* %0
>   ret void
> }
> 
> fetch_r32_float_float:
>      0:         pushq   %rbp
>      1:         movq    %rsp, %rbp
>      4:         movl    (%rsi), %eax
>      6:         vmovq   %rax, %xmm0
>     11:         movabsq $140375561531392, %rax
>     21:         vmovaps (%rax), %xmm1
>     25:         vshufps $0, %xmm1, %xmm0, %xmm0
>     30:         vshufps $72, %xmm1, %xmm0, %xmm0
>     35:         vmovaps %xmm0, (%rdi)
>     39:         popq    %rbp
>     40:         retq
> 
> The only thing I can think of is maybe the load/zext in combination with
> the shuffle going wrong - the shuffle combiner in llvm has a couple xop
> cases.
> 
> fwiw printing of the values is a bit suboptimal, the "packed" 00 00 80
> bf value really is a float 0xbf800000 and you don't see the other
> channels at all albeit in this case there aren't any...
> 
> Roland
> 
> Am 06.12.2016 um 07:27 schrieb Michel Dänzer:
>> On 06/12/16 02:39 AM, Tim Rowley wrote:
>>> Use llvm provided API based on cpuid rather than our own
>>> manually mantained list of mattr enabling/disabling.
>>
>> This change broke the llvmpipe unit test lp_test_format for me:
>>
>> Testing PIPE_FORMAT_R32_FLOAT (float) ...
>> FAILED
>>   Packed: 00 00 00 00
>>   Unpacked (0,0): 1 0 0 1 obtained
>>                   0 0 0 1 expected
>> FAILED
>>   Packed: 00 00 80 bf
>>   Unpacked (0,0): 1 0 0 1 obtained
>>                   -1 0 0 1 expected
>>
>>
>> This is on:
>>
>> processor	: 0
>> vendor_id	: AuthenticAMD
>> cpu family	: 21
>> model		: 48
>> model name	: AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G
>> stepping	: 1
>> microcode	: 0x6003106
>> cpu MHz		: 4100.000
>> cache size	: 2048 KB
>> physical id	: 0
>> siblings	: 4
>> core id		: 0
>> cpu cores	: 2
>> apicid		: 16
>> initial apicid	: 0
>> fpu		: yes
>> fpu_exception	: yes
>> cpuid level	: 13
>> wp		: yes
>> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb bpext ptsc cpb hw_pstate vmmcall fsgsbase bmi1 xsaveopt arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold overflow_recov
>> bugs		: fxsave_leak sysret_ss_attrs null_seg
>> bogomips	: 8200.42
>> TLB size	: 1536 4K pages
>> clflush size	: 64
>> cache_alignment	: 64
>> address sizes	: 48 bits physical, 48 bits virtual
>> power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro [13]
>>
>>
>>
>