[Mesa-dev] [PATCH] i965/fs: Combine tex/fb_write operations (opt)

Sun Feb 22 15:06:28 PST 2015

On Sun, Feb 08, 2015 at 02:48:02PM -0800, Matt Turner wrote:
> On Sun, Feb 8, 2015 at 1:59 PM, Ben Widawsky
> <benjamin.widawsky at intel.com> wrote:
> > Certain platforms support the ability to sample from a texture, and write it out
> > to the file RT - thus saving a costly send instructions (and before Topi's
> > recent patch, excess MOVs),
> >
> > On BSW, the performance data isn't quite what I was hoping for. Statistically,
> > it is an improvement. Several microbenchmarks improve between 0-2%, and Egypt
> 
> I wouldn't be disappointed. That's pretty nice actually. Additionally,
> I read some stuff that said this decreases energy consumption as well.
> 

Yeah - the disappointing part was it had absolutely no impact on the benchmark I
was looking to improve.

> > improves by similar amounts. No benchmarks regress with n=10. BDW and SKL also
> > support this, testing would be welcome.
> 
> SKL does, but I don't think BDW does. In fact, there's a couple of
> assertions that say the same thing in this patch. :)
> 

Oops. Yes, BDW does not support it. I'm glad I made this mistake though and that
you caught it, because it made me find the parts in the spec again. I'll leave
out the details because I am not sure what's safe to publish, but the first part
is:
"Note: The use of sampler messages with response length of zero is not
recommended unless large polygons are being drawn, such as would be found in a
GUI workload."

> > The ministat with n=10
> > Egypt:
> >    1.94611% +/- 1.35665%
> >
> > Egpyt Offscreen:
> >    1.2429% +/- 0.834759%
> >
> > v2: Modify the algorithm. Instead of iterating in reverse through blocks and
> > insts, since the last block/inst is the only thing which can benefit.
> > Rebased on top of Ken's patching modifying is_last_send
> >
> > Cc: Kenneth Graunke <kenneth at whitecape.org>
> > Cc: Jason Ekstrand <jason.ekstrand at intel.com>
> > Signed-off-by: Ben Widawsky <ben at bwidawsk.net>
> > ---
> >  src/mesa/drivers/dri/i965/brw_fs.cpp               | 101 +++++++++++++++++++++
> >  src/mesa/drivers/dri/i965/brw_fs.h                 |   3 +
> >  .../dri/i965/brw_fs_dead_code_eliminate.cpp        |   2 +-
> >  src/mesa/drivers/dri/i965/brw_fs_generator.cpp     |  12 +++
> >  4 files changed, 117 insertions(+), 1 deletion(-)
> >
> > diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
> > index 78feee2..4b8fb4a 100644
> > --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
> > +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
> > @@ -2452,6 +2452,106 @@ fs_visitor::opt_algebraic()
> >     return progress;
> >  }
> >
> > +/* Set up the sampler message to also have EOT set */
> > +void
> > +fs_visitor::combine_send_tex(struct bblock_t *block,
> > +                             fs_inst *tex_inst,
> > +                             fs_inst *fb_write)
> > +{
> > +   assert((tex_inst->offset & (0xff << 24)) == 0);
> > +   assert(!tex_inst->eot); /* We can't get here twice */
> > +
> > +   fs_inst *old_load_payload = (fs_inst *) tex_inst->prev;
> > +   /* It's unlikely yet possible the load_payload is further above */
> > +   if (old_load_payload->opcode != SHADER_OPCODE_LOAD_PAYLOAD) {
> > +      perf_debug("Potential send collapse skipped: Where's LOAD_PAYLOAD?");
> 
> We've never used perf_debug() to note that individual optimization
> passes didn't work. I'm not sure how I feel about it.

I can/will scrap this.

> 
> > +      return;
> > +   }
> > +
> > +   tex_inst->offset |= fb_write->target << 24;
> > +   tex_inst->eot = true;
> > +   fb_write->remove(block);
> > +
> > +   /* The texture instruction with EOT requires a header. If one is already
> > +    * present, there is nothing to do.
> > +    */
> > +   if (tex_inst->header_present)
> > +      return;
> > +
> > +   /* If a header is not present, we need to create a new LOAD_PAYLOAD command
> > +    * with the same sources and a space saved for the header. Using a new
> > +    * destination register not only makes sure we have enough space, but it will
> > +    * make sure the dead code eliminator kills the old command.
> 
> s/command/instruction/
> 
> > +    */
> > +   tex_inst->mlen++;
> > +   tex_inst->header_present = true;
> > +
> > +   fs_reg send_header = vgrf(old_load_payload->sources + 1);
> > +   fs_reg *new_sources =
> > +      ralloc_array(mem_ctx, fs_reg, old_load_payload->sources + 1);
> > +
> > +   new_sources[0] = fs_reg();
> > +   for (int i = 0; i < old_load_payload->sources; i++)
> > +      new_sources[i+1] = old_load_payload->src[i];
> > +
> > +   /* The LOAD_PAYLOAD helper seems like the obvious choice here. However, it
> > +    * requires a lot of information about the sources to appropriately figure
> > +    * out the number of registers needed to be used. Given this stage in our
> > +    * optimization, we may not have the appropriate GRFs required by
> > +    * LOAD_PAYLOAD at this point (copy propogation). Therefore, we need to
> 
> typo: propagation
> 
> I'm not sure what "w e may not have the appropriate GRFs ..." means?

Here is the relevant part of the original IRC conversation:
jekstrand       [08:52:30] No, the problem is uniforms and immediates.
jekstrand       [08:52:58] They can't go in a LOAD_PAYLOAD directly because we don't know how many destination registers they take up.
jekstrand       [08:54:16] for LOAD_PAYLOAD to work, we need more information than a regular instruction.  We need to know how many destination registers a given source takes up, we need to know whether it needs to use the second-half quarter control for the MOV that gets generated, etc.
jekstrand       [08:54:34] Using GRF sources more-or-less gives us this.  Immediates don't.
bwidawks        [08:54:55] right - this is what confuses me though... the immediates seem to already be there.
jekstrand       [08:55:38] Right.  The immediates can get there through copy-propagation and that's fine.  However, they're not there when it's created.
jekstrand       [08:55:43] It's all a mess

Do you have a preferred way to state this concisely?

> 
> > +    * manually emit the instruction.
> > +    *
> > +    * jekstrand: "The helper is too automagic for what you're doing."
> 
> I'd leave this out.
> 
> > +    */
> > +   fs_inst *new_load_payload = new(mem_ctx) fs_inst(SHADER_OPCODE_LOAD_PAYLOAD,
> > +                                                    old_load_payload->exec_size,
> > +                                                    send_header,
> > +                                                    new_sources,
> > +                                                    old_load_payload->sources + 1);
> > +
> > +   new_load_payload->regs_written = old_load_payload->regs_written + 1;
> > +
> > +   tex_inst->insert_before(block, new_load_payload);
> > +   tex_inst->src[0] = send_header;
> > +
> 
> Extra line here.
> 
> > +}
> > +
> > +/* Try to collapse two sends into one.
> > + *
> > + * The first send should be sampling from a texture, the second send should be
> > + * writing out the returned information to the RT.
> 
> It's nice to give some information about hardware requirements.
> Instead how about something like:
> 
> CHV, SKL, and newer can mark a texturing SEND instruction with EOT to
> have its results sent directly to the framebuffer, bypassing the EU.
> 
> Recognize texturing results copied to the framebuffer write payload
> and modify them to write to the framebuffer directly.
> 
> > + */
> > +bool
> > +fs_visitor::opt_sampler_eot()
> > +{
> > +   brw_wm_prog_key *key = (brw_wm_prog_key*) this->key;
> > +
> > +   if (brw->gen < 9 && !brw->is_cherryview)
> > +      return false;
> > +
> > +   if (key->nr_color_regions != 1)
> > +      return false;
> 
> To confirm, there's not anything that actually prevents us from doing
> this with the last write, is there? If not, let's put a FINISHME
> comment here.
> 

I am not actually sure if it is, or isn't possible - though I believe it is.
I'll add this and mention that the finisher needs to verify it's possible first.

> > +
> > +   fs_inst *fb_write = (fs_inst *) cfg->blocks[cfg->num_blocks - 1]->end();
> > +   fs_inst *tex_inst = (fs_inst *) fb_write->prev;
> > +
> > +   if (tex_inst->is_head_sentinel())
> > +      return false;
> > +
> > +   /* It will potentially be multiple iterations of optimizations before we get
> > +    * the condition we wish to optimize for.
> > +    */
> > +   if (!tex_inst->is_tex())
> > +      return false;
> > +
> 
> Can probably remove this if we move this optimization outside of the
> loop (as mentioned below).
> 
> > +   assert(fb_write->eot);
> > +   assert(fb_write->opcode == FS_OPCODE_FB_WRITE);
> > +
> > +   combine_send_tex(cfg->blocks[cfg->num_blocks - 1], tex_inst, fb_write);
> 
> I'm not sure I see a lot of value in the separate combine_send_tex()
> function. It's not that big and it's not reused. Inline it?
> 
> > +
> > +   return true;
> > +}
> > +
> >  bool
> >  fs_visitor::opt_register_renaming()
> >  {
> > @@ -3609,6 +3709,7 @@ fs_visitor::optimize()
> >        OPT(opt_peephole_predicated_break);
> >        OPT(opt_cmod_propagation);
> >        OPT(dead_code_eliminate);
> > +      OPT(opt_sampler_eot);
> 
> Do you think we really need to do this in the optimization loop?
> 
> I don't expect this to allow other optimization passes to make
> additional progress, and we can obviously do it successfully only
> once. I suspect we can do it after the optimization loop.
> 

It's possible I didn't quite spot where you want me to put the optimization. I
think that the way the code works right now, that will not work. The
optimization is depending on DCE to kill off the only LOAD_PAYLOAD and it's
corresponding MOVs. I agree that it is an optimization that can only occur once,
and generally it doesn't belong in the progress loop though.

> >        OPT(opt_peephole_sel);
> >        OPT(dead_control_flow_eliminate, this);
> >        OPT(opt_register_renaming);
> > diff --git a/src/mesa/drivers/dri/i965/brw_fs.h b/src/mesa/drivers/dri/i965/brw_fs.h
> > index b95e2c0..c9f5ab0 100644
> > --- a/src/mesa/drivers/dri/i965/brw_fs.h
> > +++ b/src/mesa/drivers/dri/i965/brw_fs.h
> > @@ -459,6 +459,9 @@ public:
> >     bool compute_to_mrf();
> >     bool dead_code_eliminate();
> >     bool remove_duplicate_mrf_writes();
> > +
> > +   void combine_send_tex(bblock_t *block, fs_inst *tex_inst, fs_inst *fb_write);
> > +   bool opt_sampler_eot();
> >     bool virtual_grf_interferes(int a, int b);
> >     void schedule_instructions(instruction_scheduler_mode mode);
> >     void insert_gen4_send_dependency_workarounds();
> > diff --git a/src/mesa/drivers/dri/i965/brw_fs_dead_code_eliminate.cpp b/src/mesa/drivers/dri/i965/brw_fs_dead_code_eliminate.cpp
> > index d66808b..e5e52c8 100644
> > --- a/src/mesa/drivers/dri/i965/brw_fs_dead_code_eliminate.cpp
> > +++ b/src/mesa/drivers/dri/i965/brw_fs_dead_code_eliminate.cpp
> > @@ -52,7 +52,7 @@ fs_visitor::dead_code_eliminate()
> >               sizeof(BITSET_WORD));
> >
> >        foreach_inst_in_block_reverse(fs_inst, inst, block) {
> > -         if (inst->dst.file == GRF && !inst->has_side_effects()) {
> > +         if (!inst->eot && inst->dst.file == GRF && !inst->has_side_effects()) {
> 
> I think it'd probably make sense to just put inst->eot in has_side_effects().
> 
> >              bool result_live = false;
> >
> >              if (inst->regs_written == 1) {
> > diff --git a/src/mesa/drivers/dri/i965/brw_fs_generator.cpp b/src/mesa/drivers/dri/i965/brw_fs_generator.cpp
> > index 8cd36f8..9f2ebaf 100644
> > --- a/src/mesa/drivers/dri/i965/brw_fs_generator.cpp
> > +++ b/src/mesa/drivers/dri/i965/brw_fs_generator.cpp
> > @@ -503,6 +503,7 @@ fs_generator::generate_tex(fs_inst *inst, struct brw_reg dst, struct brw_reg src
> >     int rlen = 4;
> >     uint32_t simd_mode;
> >     uint32_t return_format;
> > +   bool is_combined_send = inst->eot;
> >
> >     switch (dst.type) {
> >     case BRW_REGISTER_TYPE_D:
> > @@ -662,6 +663,11 @@ fs_generator::generate_tex(fs_inst *inst, struct brw_reg dst, struct brw_reg src
> >        dst = vec16(dst);
> >     }
> >
> > +   if (is_combined_send) {
> > +      assert(brw->gen >= 9 || brw->is_cherryview);
> > +      rlen = 0;
> > +   }
> > +
> >     assert(brw->gen < 7 || !inst->header_present ||
> >            src.file == BRW_GENERAL_REGISTER_FILE);
> >
> > @@ -777,6 +783,12 @@ fs_generator::generate_tex(fs_inst *inst, struct brw_reg dst, struct brw_reg src
> >         * so has already done marking.
> >         */
> >     }
> > +
> > +   if (is_combined_send) {
> > +      brw_inst_set_eot(brw, brw_last_inst, true);
> > +      brw_set_dest(p, brw_last_inst, brw_null_reg());
> 
> I think you should set the destination of the SEND to the null
> register in your optimization and remove this line.
> 
> That would also mean you didn't need to modify dead code elimination
> (since the SEND wouldn't have a GRF destination).
> 
> > +      brw_inst_set_opcode(brw, brw_last_inst, BRW_OPCODE_SENDC);
> > +   }
> >  }

Everything I didn't comment on sounds good to me, and I'll give it a shot when I
find some free time. Unfortunately, this kind of optimization I think does
require real benchmarks as opposed to shader-db, since it won't be eliminating
many instructions. Right now, benchmarking on BSW and SKL is both painfully slow
:-(