[Mesa-dev] [PATCH] gallivm: use getHostCPUFeatures on x86/llvm-4.0+.

Tue Dec 6 18:00:32 UTC 2016

Ok, here is the bug:
https://llvm.org/bugs/show_bug.cgi?id=31296

Roland

Am 06.12.2016 um 18:47 schrieb Roland Scheidegger:
> Actually I've verified this quickly with llc.
> With -mattr=xop, it produces
> 
> fetch_r32_float_float:                  # @fetch_r32_float_float
>         .cfi_startproc
> # BB#0:                                 # %entry
>         vpermilps       $65, .LCPI0_0(%rip), %xmm0 # xmm0 = mem[1,0,0,1]
>         vmovaps %xmm0, (%rdi)
>         retq
> 
> which is very obviously garbage (it even managed to optimize out the
> actual load, just the constants are left...). So this is a llvm bug with
> xop indeed. I'm going to a file a bug, but in the interim I don't know
> what mesa should do - this is one reason why we didn't want to enable
> features which we didn't actually test previously (that said, if we
> don't enable them, the llvm bugs we hit will probably never get
> fixed...). We could of course force-disable xop (albeit in theory it's
> nice - we really can make use of that damn missing vector shift which
> otherwise requires avx2).
> 
> Roland
> 
> 
> Am 06.12.2016 um 17:34 schrieb Roland Scheidegger:
>> Interesting. Can you show the IR / assembly? I don't get any failures here.
>> I'm wondering if it's trying to use XOP and there's some bug there (or
>> we're relying on undefined behavior which doesn't happen to work with
>> it). Albeit since there's not actually any conversion involved in this
>> case (float 1 channel -> float 4 channel) the assembly here looks
>> trivial and I can't see how it could go wrong.
>>
>> I get (with a couple days old llvm):
>> define void @fetch_r32_float_float(<4 x float>*, i8*, i32, i32, { [2048
>> x i32], [128 x i64] }*) {
>> entry:
>>   %5 = getelementptr i8, i8* %1, i32 0
>>   %6 = bitcast i8* %5 to i32*
>>   %7 = load i32, i32* %6
>>   %8 = zext i32 %7 to i128
>>   %9 = bitcast i128 %8 to <4 x float>
>>   %10 = shufflevector <4 x float> %9, <4 x float> <float 0.000000e+00,
>> float 1.000000e+00, float undef, float undef>, <4 x i32> <i32 0, i32 4,
>> i32 4, i32 5>
>>   store <4 x float> %10, <4 x float>* %0
>>   ret void
>> }
>>
>> fetch_r32_float_float:
>>      0:         pushq   %rbp
>>      1:         movq    %rsp, %rbp
>>      4:         movl    (%rsi), %eax
>>      6:         vmovq   %rax, %xmm0
>>     11:         movabsq $140375561531392, %rax
>>     21:         vmovaps (%rax), %xmm1
>>     25:         vshufps $0, %xmm1, %xmm0, %xmm0
>>     30:         vshufps $72, %xmm1, %xmm0, %xmm0
>>     35:         vmovaps %xmm0, (%rdi)
>>     39:         popq    %rbp
>>     40:         retq
>>
>> The only thing I can think of is maybe the load/zext in combination with
>> the shuffle going wrong - the shuffle combiner in llvm has a couple xop
>> cases.
>>
>> fwiw printing of the values is a bit suboptimal, the "packed" 00 00 80
>> bf value really is a float 0xbf800000 and you don't see the other
>> channels at all albeit in this case there aren't any...
>>
>> Roland
>>
>> Am 06.12.2016 um 07:27 schrieb Michel Dänzer:
>>> On 06/12/16 02:39 AM, Tim Rowley wrote:
>>>> Use llvm provided API based on cpuid rather than our own
>>>> manually mantained list of mattr enabling/disabling.
>>>
>>> This change broke the llvmpipe unit test lp_test_format for me:
>>>
>>> Testing PIPE_FORMAT_R32_FLOAT (float) ...
>>> FAILED
>>>   Packed: 00 00 00 00
>>>   Unpacked (0,0): 1 0 0 1 obtained
>>>                   0 0 0 1 expected
>>> FAILED
>>>   Packed: 00 00 80 bf
>>>   Unpacked (0,0): 1 0 0 1 obtained
>>>                   -1 0 0 1 expected
>>>
>>>
>>> This is on:
>>>
>>> processor	: 0
>>> vendor_id	: AuthenticAMD
>>> cpu family	: 21
>>> model		: 48
>>> model name	: AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G
>>> stepping	: 1
>>> microcode	: 0x6003106
>>> cpu MHz		: 4100.000
>>> cache size	: 2048 KB
>>> physical id	: 0
>>> siblings	: 4
>>> core id		: 0
>>> cpu cores	: 2
>>> apicid		: 16
>>> initial apicid	: 0
>>> fpu		: yes
>>> fpu_exception	: yes
>>> cpuid level	: 13
>>> wp		: yes
>>> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb bpext ptsc cpb hw_pstate vmmcall fsgsbase bmi1 xsaveopt arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold overflow_recov
>>> bugs		: fxsave_leak sysret_ss_attrs null_seg
>>> bogomips	: 8200.42
>>> TLB size	: 1536 4K pages
>>> clflush size	: 64
>>> cache_alignment	: 64
>>> address sizes	: 48 bits physical, 48 bits virtual
>>> power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro [13]
>>>
>>>
>>>
>>
>