[PATCH xf86-video-ati] Replace loop with clz to calculate log base 2 on non-x86 platforms in radeon.h

Fri Dec 2 09:31:20 UTC 2016

Am 01.12.2016 um 01:57 schrieb Michel Dänzer:
> On 01/12/16 02:52 AM, Jochen Rollwagen wrote:
>> Am 29.11.2016 um 08:32 schrieb Michel Dänzer:
>>> On 29/11/16 03:18 AM, Jochen Rollwagen wrote:
>>>> This commit replaces the loop for calculating log base 2 for
>>>> non-x86-platforms in radeon.h with a clz (count leading zeroes)-based
>>>> version to simplify the code and, well, eliminate the loop.
>>>> Note: There’s no check for val=0 case, since x86-bsr is undefined for
>>>> that case too, that should be okay.
>>>> ---
>>>>   src/radeon.h |    7 +++----
>>>>   1 file changed, 3 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/src/radeon.h b/src/radeon.h
>>>> index cbc7866..b1a1ce0 100644
>>>> --- a/src/radeon.h
>>>> +++ b/src/radeon.h
>>>> @@ -933,17 +933,16 @@ enum {
>>>>   static __inline__ int
>>>>   RADEONLog2(int val)
>>>>   {
>>>> -    int bits;
>>>>   #if (defined __i386__ || defined __x86_64__) && (defined __GNUC__)
>>>> +    int bits;
>>>> +
>>>>       __asm volatile("bsrl    %1, %0"
>>>>           : "=r" (bits)
>>>>           : "c" (val)
>>>>       );
>>>>       return bits;
>>>>   #else
>>>> -    for (bits = 0; val != 0; val >>= 1, ++bits)
>>>> -        ;
>>>> -    return bits - 1;
>>>> +    return (31 - __builtin_clz(val));
>>>>   #endif
>>>>   }
>>> Any reason for not using __builtin_clz on x86 as well? AFAICT both gcc
>>> and clang seem to generate more or less the same code with that as with
>>> the inline assembly.
>>>
>>>
>> I guess not. According to
>> http://stackoverflow.com/questions/9353973/implementation-of-builtin-clz
>> "bsr and clz are related but different.
>>
>> On x86 for clz gcc (-O2) generates:
>>
>> bsrl %edi, %eax
>> xorl $31, %eax
>> ret "
> That's not what I'm seeing. Have you compared the code generated by your
> compiler in both cases?
>
>
Well, I'm on a powerpc platform, the generated code looks like this:

RADEONLog2:
     stwu %r1,-16(%r1)
     cntlzw %r3,%r3
     subfic %r3,%r3,31
     addi %r1,%r1,16
     blr

Straightforward enough.

But i guess you’re right, the generated code is similar enough to 
justify a general approach without platform-specific inline assembler 
and leaving the rest to the compiler.

  I'll post V2 of the patch.