[Mesa-dev] [PATCH 3/3 v5] i965/fs: Combine tex/fb_write operations (opt)

Neil Roberts neil at linux.intel.com
Wed May 6 10:00:24 PDT 2015


Hi,

This optimisation doesn't seem to work with textureGather so a bunch of
Piglit tests are failing for me. I'm not sure why it didn't get picked
up by your Jenkins run.

I can't find anything in the bspec nor a known workaround to suggest
that this shouldn't work so I'm not really sure what to do. I guess we
could just disable the optimisation for textureGather but it's not great
to do that if we don't know why it doesn't work.

Originally I thought that maybe it wasn't working because textureGather
adds its own message header so it would skip out half of the
opt_sample_eot function. However that is the same case with
textureOffset and there are Piglit tests for that that also hit the
optimisation and they don't fail.

Any suggestions for what to do about this? Does anyone want to sanity
check that it's not just me that's seeing this? An example of a test
that's failing is:

 textureGather fs r 0 unorm 2D repeat -auto -fbo

If I comment out the call to opt_sample_eot then it works.

Regards,
- Neil

Ben Widawsky <benjamin.widawsky at intel.com> writes:

> Certain platforms support the ability to sample from a texture, and write it out
> to the file RT - thus saving a costly send instructions (note that this is a
> potnential win if one wanted to backport to a tag that didn't have the patch
> from Topi which removed excess MOVs from LOAD_PAYLOAD - 97caf5fa04dbd2),
>
> v2: Modify the algorithm. Instead of iterating in reverse through blocks and
> insts, since the last block/inst is the only thing which can benefit. Rebased
> on top of Ken's patching modifying is_last_send
>
> v3: Rebased over almost 2 months, and Incorporated feedback from Matt:
> Some comment typo fixes and rewordings.
> Whitespace
> Move the optimization pass outside of the optimize loop
>
> v4: Some cosmetic changes requested from Ken. These changes ensured that the
> optimization function always returned true when an optimization occurred, and
> false when one did not. This behavior did not exist with the original patch. As
> a result, having the separate helper function which Matt did not like no longer
> made sense, and so now I believe everyone should be happy.
>
> Benchmark (n=20)   %diff
> *OglBatch5         -1.4
> *OglBatch7         -1.79
> OglFillTexMulti    5.57
> OglFillTexSingle   1.16
> OglShMapPcf        0.05
> OglTexFilterAniso  3.01
> OglTexFilterTri    1.94
>
> No piglit regressions:
> (http://otc-gfxtest-01.jf.intel.com:8080/view/dev/job/bwidawsk/112/)
>
> [*] I believe my measurements are incorrect for Batch5-7. If I add this new
> optimization, but never emit the new instruction I see similar results.
>
> v5: Remove declaration of combine_tex_header since v4 dropped that function
> (Ben)
> Remove check for impossible case of an empty block (Matt)
> Set dest earlier to avoid extra special-casing in generate_tex (Matt)
>
> Cc: Kenneth Graunke <kenneth at whitecape.org>
> Cc: Jason Ekstrand <jason at jlekstrand.net>
> Signed-off-by: Ben Widawsky <ben at bwidawsk.net>
> Reviewed-by: Matt Turner <mattst88 at gmail.com>
> ---
>  src/mesa/drivers/dri/i965/brw_fs.cpp           | 90 ++++++++++++++++++++++++++
>  src/mesa/drivers/dri/i965/brw_fs.h             |  2 +
>  src/mesa/drivers/dri/i965/brw_fs_generator.cpp | 11 ++++
>  3 files changed, 103 insertions(+)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
> index eb2b57a..5bbcccb 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
> @@ -2555,6 +2555,94 @@ fs_visitor::opt_algebraic()
>     return progress;
>  }
>  
> +/**
> + * Optimize sample messages which are followed by the final RT write.
> + *
> + * CHV, and GEN9+ can mark a texturing SEND instruction with EOT to have its
> + * results sent directly to the framebuffer, bypassing the EU.  Recognize the
> + * final texturing results copied to the framebuffer write payload and modify
> + * them to write to the framebuffer directly.
> + */
> +bool
> +fs_visitor::opt_sampler_eot()
> +{
> +   brw_wm_prog_key *key = (brw_wm_prog_key*) this->key;
> +
> +   if (brw->gen < 9 && !brw->is_cherryview)
> +      return false;
> +
> +   /* FINISHME: It should be possible to implement this optimization when there
> +    * are multiple drawbuffers.
> +    */
> +   if (key->nr_color_regions != 1)
> +      return false;
> +
> +   /* Look for a texturing instruction immediately before the final FB_WRITE. */
> +   fs_inst *fb_write = (fs_inst *) cfg->blocks[cfg->num_blocks - 1]->end();
> +   assert(fb_write->eot);
> +   assert(fb_write->opcode == FS_OPCODE_FB_WRITE);
> +
> +   fs_inst *tex_inst = (fs_inst *) fb_write->prev;
> +
> +   /* There wasn't one; nothing to do. */
> +   if (unlikely(tex_inst->is_head_sentinel()) || !tex_inst->is_tex())
> +      return false;
> +
> +   /* If there's no header present, we need to munge the LOAD_PAYLOAD as well.
> +    * It's very likely to be the previous instruction.
> +    */
> +   fs_inst *load_payload = (fs_inst *) tex_inst->prev;
> +   if (load_payload->is_head_sentinel() ||
> +       load_payload->opcode != SHADER_OPCODE_LOAD_PAYLOAD)
> +      return false;
> +
> +   assert(!tex_inst->eot); /* We can't get here twice */
> +   assert((tex_inst->offset & (0xff << 24)) == 0);
> +
> +   tex_inst->offset |= fb_write->target << 24;
> +   tex_inst->eot = true;
> +   fb_write->remove(cfg->blocks[cfg->num_blocks - 1]);
> +
> +   /* If a header is present, marking the eot is sufficient. Otherwise, we need
> +    * to create a new LOAD_PAYLOAD command with the same sources and a space
> +    * saved for the header. Using a new destination register not only makes sure
> +    * we have enough space, but it will make sure the dead code eliminator kills
> +    * the instruction that this will replace.
> +    */
> +   if (tex_inst->header_present)
> +      return true;
> +
> +   fs_reg send_header = vgrf(load_payload->sources + 1);
> +   fs_reg *new_sources =
> +      ralloc_array(mem_ctx, fs_reg, load_payload->sources + 1);
> +
> +   new_sources[0] = fs_reg();
> +   for (int i = 0; i < load_payload->sources; i++)
> +      new_sources[i+1] = load_payload->src[i];
> +
> +   /* The LOAD_PAYLOAD helper seems like the obvious choice here. However, it
> +    * requires a lot of information about the sources to appropriately figure
> +    * out the number of registers needed to be used. Given this stage in our
> +    * optimization, we may not have the appropriate GRFs required by
> +    * LOAD_PAYLOAD at this point (copy propagation). Therefore, we need to
> +    * manually emit the instruction.
> +    */
> +   fs_inst *new_load_payload = new(mem_ctx) fs_inst(SHADER_OPCODE_LOAD_PAYLOAD,
> +                                                    load_payload->exec_size,
> +                                                    send_header,
> +                                                    new_sources,
> +                                                    load_payload->sources + 1);
> +
> +   new_load_payload->regs_written = load_payload->regs_written + 1;
> +   tex_inst->mlen++;
> +   tex_inst->header_present = true;
> +   tex_inst->insert_before(cfg->blocks[cfg->num_blocks - 1], new_load_payload);
> +   tex_inst->src[0] = send_header;
> +   tex_inst->dst = reg_null_ud;
> +
> +   return true;
> +}
> +
>  bool
>  fs_visitor::opt_register_renaming()
>  {
> @@ -3761,6 +3849,8 @@ fs_visitor::optimize()
>  
>     pass_num = 0;
>  
> +   OPT(opt_sampler_eot);
> +
>     if (OPT(lower_load_payload)) {
>        split_virtual_grfs();
>        OPT(register_coalesce);
> diff --git a/src/mesa/drivers/dri/i965/brw_fs.h b/src/mesa/drivers/dri/i965/brw_fs.h
> index 278a8ee..43e4115 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs.h
> +++ b/src/mesa/drivers/dri/i965/brw_fs.h
> @@ -231,6 +231,8 @@ public:
>     bool compute_to_mrf();
>     bool dead_code_eliminate();
>     bool remove_duplicate_mrf_writes();
> +
> +   bool opt_sampler_eot();
>     bool virtual_grf_interferes(int a, int b);
>     void schedule_instructions(instruction_scheduler_mode mode);
>     void insert_gen4_send_dependency_workarounds();
> diff --git a/src/mesa/drivers/dri/i965/brw_fs_generator.cpp b/src/mesa/drivers/dri/i965/brw_fs_generator.cpp
> index 40e51aa..023ff17 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs_generator.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_fs_generator.cpp
> @@ -517,6 +517,7 @@ fs_generator::generate_tex(fs_inst *inst, struct brw_reg dst, struct brw_reg src
>     int rlen = 4;
>     uint32_t simd_mode;
>     uint32_t return_format;
> +   bool is_combined_send = inst->eot;
>  
>     switch (dst.type) {
>     case BRW_REGISTER_TYPE_D:
> @@ -676,6 +677,11 @@ fs_generator::generate_tex(fs_inst *inst, struct brw_reg dst, struct brw_reg src
>        dst = vec16(dst);
>     }
>  
> +   if (is_combined_send) {
> +      assert(brw->gen >= 9 || brw->is_cherryview);
> +      rlen = 0;
> +   }
> +
>     assert(brw->gen < 7 || !inst->header_present ||
>            src.file == BRW_GENERAL_REGISTER_FILE);
>  
> @@ -781,6 +787,11 @@ fs_generator::generate_tex(fs_inst *inst, struct brw_reg dst, struct brw_reg src
>         * so has already done marking.
>         */
>     }
> +
> +   if (is_combined_send) {
> +      brw_inst_set_eot(brw, brw_last_inst, true);
> +      brw_inst_set_opcode(brw, brw_last_inst, BRW_OPCODE_SENDC);
> +   }
>  }
>  
>  
> -- 
> 2.3.5
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev


More information about the mesa-dev mailing list