[Mesa-dev] [PATCH] gallivm: use getHostCPUFeatures on x86/llvm-4.0+.

Wed Dec 7 15:02:27 UTC 2016

The bug in llvm has been fixed, can you confirm lp_test_format passes again?

Roland

Am 06.12.2016 um 19:00 schrieb Roland Scheidegger:
> Ok, here is the bug:
> https://llvm.org/bugs/show_bug.cgi?id=31296
> 
> Roland
> 
> Am 06.12.2016 um 18:47 schrieb Roland Scheidegger:
>> Actually I've verified this quickly with llc.
>> With -mattr=xop, it produces
>>
>> fetch_r32_float_float:                  # @fetch_r32_float_float
>>         .cfi_startproc
>> # BB#0:                                 # %entry
>>         vpermilps       $65, .LCPI0_0(%rip), %xmm0 # xmm0 = mem[1,0,0,1]
>>         vmovaps %xmm0, (%rdi)
>>         retq
>>
>> which is very obviously garbage (it even managed to optimize out the
>> actual load, just the constants are left...). So this is a llvm bug with
>> xop indeed. I'm going to a file a bug, but in the interim I don't know
>> what mesa should do - this is one reason why we didn't want to enable
>> features which we didn't actually test previously (that said, if we
>> don't enable them, the llvm bugs we hit will probably never get
>> fixed...). We could of course force-disable xop (albeit in theory it's
>> nice - we really can make use of that damn missing vector shift which
>> otherwise requires avx2).
>>
>> Roland
>>
>>
>> Am 06.12.2016 um 17:34 schrieb Roland Scheidegger:
>>> Interesting. Can you show the IR / assembly? I don't get any failures here.
>>> I'm wondering if it's trying to use XOP and there's some bug there (or
>>> we're relying on undefined behavior which doesn't happen to work with
>>> it). Albeit since there's not actually any conversion involved in this
>>> case (float 1 channel -> float 4 channel) the assembly here looks
>>> trivial and I can't see how it could go wrong.
>>>
>>> I get (with a couple days old llvm):
>>> define void @fetch_r32_float_float(<4 x float>*, i8*, i32, i32, { [2048
>>> x i32], [128 x i64] }*) {
>>> entry:
>>>   %5 = getelementptr i8, i8* %1, i32 0
>>>   %6 = bitcast i8* %5 to i32*
>>>   %7 = load i32, i32* %6
>>>   %8 = zext i32 %7 to i128
>>>   %9 = bitcast i128 %8 to <4 x float>
>>>   %10 = shufflevector <4 x float> %9, <4 x float> <float 0.000000e+00,
>>> float 1.000000e+00, float undef, float undef>, <4 x i32> <i32 0, i32 4,
>>> i32 4, i32 5>
>>>   store <4 x float> %10, <4 x float>* %0
>>>   ret void
>>> }
>>>
>>> fetch_r32_float_float:
>>>      0:         pushq   %rbp
>>>      1:         movq    %rsp, %rbp
>>>      4:         movl    (%rsi), %eax
>>>      6:         vmovq   %rax, %xmm0
>>>     11:         movabsq $140375561531392, %rax
>>>     21:         vmovaps (%rax), %xmm1
>>>     25:         vshufps $0, %xmm1, %xmm0, %xmm0
>>>     30:         vshufps $72, %xmm1, %xmm0, %xmm0
>>>     35:         vmovaps %xmm0, (%rdi)
>>>     39:         popq    %rbp
>>>     40:         retq
>>>
>>> The only thing I can think of is maybe the load/zext in combination with
>>> the shuffle going wrong - the shuffle combiner in llvm has a couple xop
>>> cases.
>>>
>>> fwiw printing of the values is a bit suboptimal, the "packed" 00 00 80
>>> bf value really is a float 0xbf800000 and you don't see the other
>>> channels at all albeit in this case there aren't any...
>>>
>>> Roland
>>>
>>> Am 06.12.2016 um 07:27 schrieb Michel Dänzer:
>>>> On 06/12/16 02:39 AM, Tim Rowley wrote:
>>>>> Use llvm provided API based on cpuid rather than our own
>>>>> manually mantained list of mattr enabling/disabling.
>>>>
>>>> This change broke the llvmpipe unit test lp_test_format for me:
>>>>
>>>> Testing PIPE_FORMAT_R32_FLOAT (float) ...
>>>> FAILED
>>>>   Packed: 00 00 00 00
>>>>   Unpacked (0,0): 1 0 0 1 obtained
>>>>                   0 0 0 1 expected
>>>> FAILED
>>>>   Packed: 00 00 80 bf
>>>>   Unpacked (0,0): 1 0 0 1 obtained
>>>>                   -1 0 0 1 expected
>>>>
>>>>
>>>> This is on:
>>>>
>>>> processor	: 0
>>>> vendor_id	: AuthenticAMD
>>>> cpu family	: 21
>>>> model		: 48
>>>> model name	: AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G
>>>> stepping	: 1
>>>> microcode	: 0x6003106
>>>> cpu MHz		: 4100.000
>>>> cache size	: 2048 KB
>>>> physical id	: 0
>>>> siblings	: 4
>>>> core id		: 0
>>>> cpu cores	: 2
>>>> apicid		: 16
>>>> initial apicid	: 0
>>>> fpu		: yes
>>>> fpu_exception	: yes
>>>> cpuid level	: 13
>>>> wp		: yes
>>>> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb bpext ptsc cpb hw_pstate vmmcall fsgsbase bmi1 xsaveopt arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold overflow_recov
>>>> bugs		: fxsave_leak sysret_ss_attrs null_seg
>>>> bogomips	: 8200.42
>>>> TLB size	: 1536 4K pages
>>>> clflush size	: 64
>>>> cache_alignment	: 64
>>>> address sizes	: 48 bits physical, 48 bits virtual
>>>> power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro [13]
>>>>
>>>>
>>>>
>>>
>>
>