[Mesa-dev] [PATCH] i965 : Performance Improvement

Marathe, Yogesh yogesh.marathe at intel.com
Fri Jul 14 06:38:22 UTC 2017


> -----Original Message-----
> From: Kenneth Graunke [mailto:kenneth at whitecape.org]
> Sent: Friday, July 14, 2017 11:20 AM
> To: Marathe, Yogesh <yogesh.marathe at intel.com>
> Cc: mesa-dev at lists.freedesktop.org; Muthukumar, Aravindan
> <aravindan.muthukumar at intel.com>
> Subject: Re: [Mesa-dev] [PATCH] i965 : Performance Improvement
> 
> On Thursday, July 13, 2017 9:49:40 PM PDT Marathe, Yogesh wrote:
> > Kenneth,
> >
> > > -----Original Message-----
> > > From: mesa-dev [mailto:mesa-dev-bounces at lists.freedesktop.org] On
> > > Behalf Of Kenneth Graunke
> > > Sent: Friday, July 14, 2017 10:05 AM
> > > To: mesa-dev at lists.freedesktop.org
> > > Cc: Muthukumar, Aravindan <aravindan.muthukumar at intel.com>
> > > Subject: Re: [Mesa-dev] [PATCH] i965 : Performance Improvement
> > >
> > > On Thursday, July 13, 2017 9:09:09 PM PDT
> > > aravindan.muthukumar at intel.com
> > > wrote:
> > > > From: Aravindan M <aravindan.muthukumar at intel.com>
> > >
> > > The commit title should be something like, "i965: Optimize atom
> > > state flag checks" rather than a generic "Performance Improvement"
> > >
> > > > This patch improves CPI Rate(Cycles per Instruction) and CPU time
> > > > utilization for i965. The functions check_state and
> > > > brw_pipeline_state_finished was found poor CPU utilization from
> > > > performance analysis.
> > >
> > > Need actual data here, or show assembly quality improvements.
> > >
> > > > Change-Id: I17c7e719a16e222764217a0e67b4482748537b67
> > > > Signed-off-by: Aravindan M <aravindan.muthukumar at intel.com>
> > > > Reviewed-by: Yogesh M <yogesh.marathe at intel.com>
> > > > Tested-by: Asish <asish at intel.com>
> > > > ---
> > > >  src/mesa/drivers/dri/i965/brw_defines.h      |  3 +++
> > > >  src/mesa/drivers/dri/i965/brw_state_upload.c | 14 +++++++++++---
> > > >  2 files changed, 14 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/src/mesa/drivers/dri/i965/brw_defines.h
> > > > b/src/mesa/drivers/dri/i965/brw_defines.h
> > > > index a4794c6..60f88ca 100644
> > > > --- a/src/mesa/drivers/dri/i965/brw_defines.h
> > > > +++ b/src/mesa/drivers/dri/i965/brw_defines.h
> > > > @@ -1681,3 +1681,6 @@ enum brw_pixel_shader_coverage_mask_mode
> {
> > > >  # define GEN8_L3CNTLREG_ALL_ALLOC_MASK     INTEL_MASK(31, 25)
> > > >
> > > >  #endif
> > > > +
> > > > +/* Checking the state of mesa and brw before emitting atoms */
> > > > +#define CHECK_BRW_STATE(a,b) ((a.mesa & b.mesa) | (a.brw &
> > > > +b.brw))
> > > > diff --git a/src/mesa/drivers/dri/i965/brw_state_upload.c
> > > > b/src/mesa/drivers/dri/i965/brw_state_upload.c
> > > > index 5e82c1b..434decf 100644
> > > > --- a/src/mesa/drivers/dri/i965/brw_state_upload.c
> > > > +++ b/src/mesa/drivers/dri/i965/brw_state_upload.c
> > > > @@ -515,7 +515,10 @@ brw_upload_pipeline_state(struct brw_context
> *brw,
> > > >  	 const struct brw_tracked_state *atom = &atoms[i];
> > > >  	 struct brw_state_flags generated;
> > > >
> > > > -         check_and_emit_atom(brw, &state, atom);
> > > > +         /* Checking the state and emitting the atoms */
> > > > +         if (CHECK_BRW_STATE(state, atom->dirty)) {
> > > > +            check_and_emit_atom(brw, &state, atom);
> > > > +         }
> > > >
> > > >  	 accumulate_state(&examined, &atom->dirty);
> > > >
> > > > @@ -532,7 +535,10 @@ brw_upload_pipeline_state(struct brw_context
> *brw,
> > > >        for (i = 0; i < num_atoms; i++) {
> > > >  	 const struct brw_tracked_state *atom = &atoms[i];
> > > >
> > > > -         check_and_emit_atom(brw, &state, atom);
> > > > +         /* Checking the state and emitting the atoms */
> > > > +         if (CHECK_BRW_STATE(state, atom->dirty)) {
> > > > +            check_and_emit_atom(brw, &state, atom);
> > > > +         }
> > >
> > > This doesn't make any sense...the very first thing
> > > check_and_emit_atom() does is call check_state(), which does the
> > > exact thing your CHECK_BRW_STATE macro does.  So you're just needlessly
> checking the same thing twice.
> > >
> >
> > Sorry Kenneth, This is incomplete patch. The original patch that I
> > reviewed also had if check removed as below
> >
> > --- a/src/mesa/drivers/dri/i965/brw_state_upload.c
> > +++ b/src/mesa/drivers/dri/i965/brw_state_upload.c
> > @@ -417,10 +417,8 @@ check_and_emit_atom(struct brw_context *brw,
> >                      struct brw_state_flags *state,
> >                      const struct brw_tracked_state *atom)  {
> > -   if (check_state(state, &atom->dirty)) {
> >     atom->emit(brw);
> >     merge_ctx_state(brw, state);
> > -   }
> >  }
> >
> > Aravindan will push another set.
> >
> > > The only reason I could see this helping is if check_state() wasn't
> > > inlined, but a release build with -O2 definitely inlines both
> > > check_and_emit_atom() and check_state().
> > >
> > > Are you using GCC?  What are your CFLAGS?  -O2?  I hope you're not
> > > trying to optimize a debug build...
> > >
> >
> > Yes we are using O2 and its clang on android and it's not debug.
> 
> Okay.  I just built with Clang 4.0.1 and -O2 and both check_state and
> check_and_emit_atom() are inlined into the atom loop in
> brw_upload_pipeline_state().
> 
> So I'm still not sure how this would improve anything.

Yes, the improvement is not huge per say but we essentially see CPI and 
cpu utilization is coming down with this. We also see slightly improved scores 
on graphics benchmarks, particularly 3dmark with the patch. If this was 
optimized out by compiler we shouldn't have seen the difference on same
build with and without patch. We'll confirm the clang version.

I think this removes branch instructions least and being in busy path this will
have an impact, provided compiler doesn't do it, as you rightly mentioned.

> 
> > > >        }
> > > >     }
> > > >
> > > > @@ -567,7 +573,9 @@ brw_pipeline_state_finished(struct brw_context
> *brw,
> > > >           brw->state.pipelines[i].mesa |= brw->NewGLState;
> > > >           brw->state.pipelines[i].brw |= brw->ctx.NewDriverState;
> > > >        } else {
> > > > -         memset(&brw->state.pipelines[i], 0, sizeof(struct brw_state_flags));
> > > > +         /* Avoiding the memset with initialization */
> > > > +         brw->state.pipelines[i].mesa = 0;
> > > > +         brw->state.pipelines[i].brw = 0ull;
> > > >        }
> > > >     }
> > >
> 
> Clang 4.0.1 optimizes away the memset as well.


More information about the mesa-dev mailing list