[Mesa-dev] [PATCH] llvmpipe: add cc clobber to inline asm

Mon Aug 20 23:11:48 UTC 2018

Am 20.08.2018 um 23:31 schrieb Grazvydas Ignotas:
> The bsr instruction modifies flags, so that needs to be indicated to the
> compiler. No effect on generated code, but still needed for correctness.
> ---
>  src/gallium/drivers/llvmpipe/lp_setup_tri.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/src/gallium/drivers/llvmpipe/lp_setup_tri.c b/src/gallium/drivers/llvmpipe/lp_setup_tri.c
> index cec6198ec63..1852ec05d56 100644
> --- a/src/gallium/drivers/llvmpipe/lp_setup_tri.c
> +++ b/src/gallium/drivers/llvmpipe/lp_setup_tri.c
> @@ -732,11 +732,12 @@ floor_pot(uint32_t n)
>     if (n == 0)
>        return 0;
>  
>     __asm__("bsr %1,%0"
>            : "=r" (n)
> -          : "rm" (n));
> +          : "rm" (n)
> +          : "cc");
>     return 1 << n;
>  #else
>     n |= (n >>  1);
>     n |= (n >>  2);
>     n |= (n >>  4);
> 

Looks alright (although my inline asm is a bit rusty), although I wonder
if maybe floor_pot() should use util_logbase2? Though it's not quite an
exact fit.

Or we could use __builtin_clz directly there based on HAVE___BUILTIN_CLZ.

As a side note, it actually seems tricky to get gcc to emit the
"correct" trivial sequence (tested with version 7.3.1).
If you do
int val = 1 << (31 - __builtin_clz(in));
it emits (-O3)
  bsr    %eax,%eax
  mov    $0x1f,%ecx
  xor    $0x1f,%eax
  sub    %eax,%ecx
  mov    $0x1,%eax
  shl    %cl,%eax
which isn't the end of the world, but it is quite optimization failure.

with -O3 -march=haswell it will figure it out:
  bsr    %eax,%edx
  mov    $0x1,%eax
  shlx   %edx,%eax,%eax

If you think you're clever and instead do
int val = 1 << (__builtin_clz(in) ^ 31);
(which is really the same thing)
gcc now is happy with -O3
  bsr    %eax,%ecx
  mov    $0x1,%eax
  shl    %cl,%eax
Naturally, the sub is gone, and gcc recognized the xor 31 on top of its
own xor 31 for the lzcnt emulation cancel each other out.

but with -O3 -march=haswell it's a bit suboptimal now:
  mov    $0x1,%edx
  lzcnt  %eax,%eax
  xor    $0x1f,%eax
  shlx   %eax,%edx,%eax

So optimization is quite funny here, depending on if the cpu can do
lzcnt or just bsr. Fun stuff...

In any case,
Reviewed-by: Roland Scheidegger <sroland at vmware.com>