[Mesa-dev] [Mesa-stable] [PATCH 3/4] i965: Fix execution size of scalar TCS barrier setup code.

Wed Aug 17 20:31:13 UTC 2016

Kenneth Graunke <kenneth at whitecape.org> writes:

> Previously, the scalar TCS backend was generating:
>
> mov(8)   g17<1>UD     0x00000000UD    { align1 WE_all 1Q compacted };
> and(8)   g17.2<1>UD   g0.2<0,1,0>UD   0x0001e000UD  { align1 WE_all 1Q };
> shl(8)   g17.2<1>UD   g17.2<8,8,1>UD  0x0000000bUD  { align1 WE_all 1Q };
> or(8)    g17.2<1>UD   g17.2<8,8,1>UD  0x00008200UD  { align1 WE_all 1Q };
> send(8)  null<1>UW    g17<8,8,1>UD
>          gateway (barrier msg) mlen 1 rlen 0 { align1 WE_all 1Q };
>
> This is rubbish - g17.2<8,8,1>UD spans two registers, and is an illegal
> region.  Not to mention it clobbers 8 channels of data when we only
> wanted to touch m0.2.
>
> Instead, we want:
>
> mov(8)   g17<1>UD     0x00000000UD    { align1 WE_all 1Q compacted };
> and(1)   g17.2<1>UD   g0.2<0,1,0>UD   0x0001e000UD  { align1 WE_all };
> shl(1)   g17.2<1>UD   g17.2<0,1,0>UD  0x0000000bUD  { align1 WE_all };
> or(1)    g17.2<1>UD   g17.2<0,1,0>UD  0x00008200UD  { align1 WE_all };
> send(8)  null<1>UW    g17<8,8,1>UD
>          gateway (barrier msg) mlen 1 rlen 0 { align1 WE_all 1Q };
>
> Using component() accomplishes this.
>
> Fixes GL44-CTS.tessellation_shader.tessellation_shader_tc_barriers.
> barrier_guarded_read_write_calls on Skylake.  Probably fixes other
> barrier issues on Gen8+.
>
> Cc: mesa-stable at lists.freedesktop.org
> Signed-off-by: Kenneth Graunke <kenneth at whitecape.org>
> ---
>  src/mesa/drivers/dri/i965/brw_fs_nir.cpp | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
> index df033e1..bb254d6 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
> @@ -2420,7 +2420,7 @@ fs_visitor::nir_emit_tcs_intrinsic(const fs_builder &bld,
>           break;
>  
>        fs_reg m0 = bld.vgrf(BRW_REGISTER_TYPE_UD, 1);
> -      fs_reg m0_2 = byte_offset(m0, 2 * sizeof(uint32_t));
> +      fs_reg m0_2 = component(m0, 2);
>  
I guess this works because component() sets the stride of the
destination to zero (which is not a real destination stride supported by
the hardware), which in turns causes brw_reg_from_fs_reg() to emit a
1-wide region for the destination, which in turn causes the exec size
hack in brw_set_dest() to kick in and override the execution size of the
instruction, ignoring the original execution size coming in from the IR.

Even though that may give you the assembly you want, it seems rather
fragile to rely on a number non-obvious things for the assembly to be
correct, and this still leaves the exec_size field of the IR instruction
equal to the shader's dispatch width (which isn't going to be the real
execution size of the instruction), so the rest of the optimizer may not
treat this instruction the way you expected even if the final assembly
looks correct.

>        const fs_builder fwa_bld = bld.exec_all();
>
How about you make this a scalar builder in addition (like
bld.exec_all().group(1, 0))?  (using component() instead of
byte_offset() above still seems like a positive change for the sake of
readability)

> -- 
> 2.9.3
>
> _______________________________________________
> mesa-stable mailing list
> mesa-stable at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-stable
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20160817/68e0272a/attachment.sig>