[Mesa-dev] [PATCH 5/6] intel/fs: Handle surface opcode sample masks via predication.

Fri Mar 2 09:15:50 UTC 2018

Hi,

On 27.02.2018 23:38, Francisco Jerez wrote:
> The main motivation is to enable HDC surface opcodes on ICL which no
> longer allows the sample mask to be provided in a message header, but
> this is enabled all the way back to IVB when possible because it
> decreases the instruction count of some shaders using HDC messages
> significantly, e.g. one of the SynMark2 CSDof compute shaders
> decreases instruction count by about 40% due to the removal of header
> setup boilerplate which in turn makes a number of send message
> payloads more easily CSE-able.  Shader-db results on SKL:
> 
>   total instructions in shared programs: 15325319 -> 15314384 (-0.07%)
>   instructions in affected programs: 311532 -> 300597 (-3.51%)
>   helped: 491
>   HURT: 1
> 
> Shader-db results on BDW where the optimization needs to be disabled
> in some cases due to hardware restrictions:
> 
>   total instructions in shared programs: 15604794 -> 15598028 (-0.04%)
>   instructions in affected programs: 220863 -> 214097 (-3.06%)
>   helped: 351
>   HURT: 0
> 
> The FPS of SynMark2 CSDof improves by 5.09% ±0.36% (n=10) on my SKL
> laptop with this change.

I tested the series with our full benchmark set, on BYT, HSW GT2, BXT 
and SKL GT2.

CSDof improved by:
* 9% on BYT
* 7-8% on BXT J4205, and on SKL GT2 desktop

(Variance on HSW was too large in this test, to conclude anything.)

Changes in all the other tests were within daily variance.

Tested-By: Eero Tamminen <eero.t.tamminen at intel.com>

	- Eero
> ---
>   src/intel/compiler/brw_fs.cpp | 42 +++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 41 insertions(+), 1 deletion(-)
> 
> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
> index 0b87d8ab14e..639432b4f49 100644
> --- a/src/intel/compiler/brw_fs.cpp
> +++ b/src/intel/compiler/brw_fs.cpp
> @@ -4432,6 +4432,8 @@ static void
>   lower_surface_logical_send(const fs_builder &bld, fs_inst *inst, opcode op,
>                              const fs_reg &sample_mask)
>   {
> +   const gen_device_info *devinfo = bld.shader->devinfo;
> +
>      /* Get the logical send arguments. */
>      const fs_reg &addr = inst->src[0];
>      const fs_reg &src = inst->src[1];
> @@ -4442,7 +4444,20 @@ lower_surface_logical_send(const fs_builder &bld, fs_inst *inst, opcode op,
>      /* Calculate the total number of components of the payload. */
>      const unsigned addr_sz = inst->components_read(0);
>      const unsigned src_sz = inst->components_read(1);
> -   const unsigned header_sz = (sample_mask.file == BAD_FILE ? 0 : 1);
> +   /* From the BDW PRM Volume 7, page 147:
> +    *
> +    *  "For the Data Cache Data Port*, the header must be present for the
> +    *   following message types: [...] Typed read/write/atomics"
> +    *
> +    * Earlier generations have a similar wording.  Because of this restriction
> +    * we don't attempt to implement sample masks via predication for such
> +    * messages prior to Gen9, since we have to provide a header anyway.  On
> +    * Gen11+ the header has been removed so we can only use predication.
> +    */
> +   const unsigned header_sz = devinfo->gen < 9 &&
> +                              (op == SHADER_OPCODE_TYPED_SURFACE_READ ||
> +                               op == SHADER_OPCODE_TYPED_SURFACE_WRITE ||
> +                               op == SHADER_OPCODE_TYPED_ATOMIC) ? 1 : 0;
>      const unsigned sz = header_sz + addr_sz + src_sz;
>   
>      /* Allocate space for the payload. */
> @@ -4462,6 +4477,31 @@ lower_surface_logical_send(const fs_builder &bld, fs_inst *inst, opcode op,
>   
>      bld.LOAD_PAYLOAD(payload, components, sz, header_sz);
>   
> +   /* Predicate the instruction on the sample mask if no header is
> +    * provided.
> +    */
> +   if (!header_sz && sample_mask.file != BAD_FILE &&
> +       sample_mask.file != IMM) {
> +      const fs_builder ubld = bld.group(1, 0).exec_all();
> +      if (inst->predicate) {
> +         assert(inst->predicate == BRW_PREDICATE_NORMAL);
> +         assert(!inst->predicate_inverse);
> +         assert(inst->flag_subreg < 2);
> +         /* Combine the sample mask with the existing predicate by using a
> +          * vertical predication mode.
> +           */
> +         inst->predicate = BRW_PREDICATE_ALIGN1_ALLV;
> +         ubld.MOV(retype(brw_flag_subreg(inst->flag_subreg + 2),
> +                         sample_mask.type),
> +                  sample_mask);
> +      } else {
> +         inst->flag_subreg = 2;
> +         inst->predicate = BRW_PREDICATE_NORMAL;
> +         ubld.MOV(retype(brw_flag_subreg(inst->flag_subreg), sample_mask.type),
> +                  sample_mask);
> +      }
> +   }
> +
>      /* Update the original instruction. */
>      inst->opcode = op;
>      inst->mlen = header_sz + (addr_sz + src_sz) * inst->exec_size / 8;
>