[Mesa-dev] [PATCH] i965 : Performance Improvement

Marathe, Yogesh yogesh.marathe at intel.com
Fri Jul 14 04:49:40 UTC 2017


Kenneth,

> -----Original Message-----
> From: mesa-dev [mailto:mesa-dev-bounces at lists.freedesktop.org] On Behalf
> Of Kenneth Graunke
> Sent: Friday, July 14, 2017 10:05 AM
> To: mesa-dev at lists.freedesktop.org
> Cc: Muthukumar, Aravindan <aravindan.muthukumar at intel.com>
> Subject: Re: [Mesa-dev] [PATCH] i965 : Performance Improvement
> 
> On Thursday, July 13, 2017 9:09:09 PM PDT aravindan.muthukumar at intel.com
> wrote:
> > From: Aravindan M <aravindan.muthukumar at intel.com>
> 
> The commit title should be something like, "i965: Optimize atom state flag
> checks" rather than a generic "Performance Improvement"
> 
> > This patch improves CPI Rate(Cycles per Instruction) and CPU time
> > utilization for i965. The functions check_state and
> > brw_pipeline_state_finished was found poor CPU utilization from
> > performance analysis.
> 
> Need actual data here, or show assembly quality improvements.
> 
> > Change-Id: I17c7e719a16e222764217a0e67b4482748537b67
> > Signed-off-by: Aravindan M <aravindan.muthukumar at intel.com>
> > Reviewed-by: Yogesh M <yogesh.marathe at intel.com>
> > Tested-by: Asish <asish at intel.com>
> > ---
> >  src/mesa/drivers/dri/i965/brw_defines.h      |  3 +++
> >  src/mesa/drivers/dri/i965/brw_state_upload.c | 14 +++++++++++---
> >  2 files changed, 14 insertions(+), 3 deletions(-)
> >
> > diff --git a/src/mesa/drivers/dri/i965/brw_defines.h
> > b/src/mesa/drivers/dri/i965/brw_defines.h
> > index a4794c6..60f88ca 100644
> > --- a/src/mesa/drivers/dri/i965/brw_defines.h
> > +++ b/src/mesa/drivers/dri/i965/brw_defines.h
> > @@ -1681,3 +1681,6 @@ enum brw_pixel_shader_coverage_mask_mode {
> >  # define GEN8_L3CNTLREG_ALL_ALLOC_MASK     INTEL_MASK(31, 25)
> >
> >  #endif
> > +
> > +/* Checking the state of mesa and brw before emitting atoms */
> > +#define CHECK_BRW_STATE(a,b) ((a.mesa & b.mesa) | (a.brw & b.brw))
> > diff --git a/src/mesa/drivers/dri/i965/brw_state_upload.c
> > b/src/mesa/drivers/dri/i965/brw_state_upload.c
> > index 5e82c1b..434decf 100644
> > --- a/src/mesa/drivers/dri/i965/brw_state_upload.c
> > +++ b/src/mesa/drivers/dri/i965/brw_state_upload.c
> > @@ -515,7 +515,10 @@ brw_upload_pipeline_state(struct brw_context *brw,
> >  	 const struct brw_tracked_state *atom = &atoms[i];
> >  	 struct brw_state_flags generated;
> >
> > -         check_and_emit_atom(brw, &state, atom);
> > +         /* Checking the state and emitting the atoms */
> > +         if (CHECK_BRW_STATE(state, atom->dirty)) {
> > +            check_and_emit_atom(brw, &state, atom);
> > +         }
> >
> >  	 accumulate_state(&examined, &atom->dirty);
> >
> > @@ -532,7 +535,10 @@ brw_upload_pipeline_state(struct brw_context *brw,
> >        for (i = 0; i < num_atoms; i++) {
> >  	 const struct brw_tracked_state *atom = &atoms[i];
> >
> > -         check_and_emit_atom(brw, &state, atom);
> > +         /* Checking the state and emitting the atoms */
> > +         if (CHECK_BRW_STATE(state, atom->dirty)) {
> > +            check_and_emit_atom(brw, &state, atom);
> > +         }
> 
> This doesn't make any sense...the very first thing check_and_emit_atom() does
> is call check_state(), which does the exact thing your CHECK_BRW_STATE macro
> does.  So you're just needlessly checking the same thing twice.
> 

Sorry Kenneth, This is incomplete patch. The original patch that I reviewed also had 
if check removed as below

--- a/src/mesa/drivers/dri/i965/brw_state_upload.c
+++ b/src/mesa/drivers/dri/i965/brw_state_upload.c
@@ -417,10 +417,8 @@ check_and_emit_atom(struct brw_context *brw,
                     struct brw_state_flags *state,
                     const struct brw_tracked_state *atom)
 {
-   if (check_state(state, &atom->dirty)) {
    atom->emit(brw);
    merge_ctx_state(brw, state);
-   }
 }

Aravindan will push another set. 

> The only reason I could see this helping is if check_state() wasn't inlined, but a
> release build with -O2 definitely inlines both check_and_emit_atom() and
> check_state().
> 
> Are you using GCC?  What are your CFLAGS?  -O2?  I hope you're not trying to
> optimize a debug build...
> 

Yes we are using O2 and its clang on android and it's not debug.

> >        }
> >     }
> >
> > @@ -567,7 +573,9 @@ brw_pipeline_state_finished(struct brw_context *brw,
> >           brw->state.pipelines[i].mesa |= brw->NewGLState;
> >           brw->state.pipelines[i].brw |= brw->ctx.NewDriverState;
> >        } else {
> > -         memset(&brw->state.pipelines[i], 0, sizeof(struct brw_state_flags));
> > +         /* Avoiding the memset with initialization */
> > +         brw->state.pipelines[i].mesa = 0;
> > +         brw->state.pipelines[i].brw = 0ull;
> >        }
> >     }
> 
> This is a separate change.
> 
> I'm also not seeing why it's useful:
> 
> Assembly before (GCC 7.1.1, x86_64, -O2 -fno-omit-frame-pointer):
> 
> 00000000003e0380 <brw_render_state_finished>:
>   3e0380:       66 0f ef c0             pxor   %xmm0,%xmm0
>   3e0384:       55                      push   %rbp
>   3e0385:       8b 87 20 52 02 00       mov    0x25220(%rdi),%eax
>   3e038b:       c7 87 20 52 02 00 00    movl   $0x0,0x25220(%rdi)
>   3e0392:       00 00 00
>   3e0395:       48 89 e5                mov    %rsp,%rbp
>   3e0398:       09 87 38 52 02 00       or     %eax,0x25238(%rdi)
>   3e039e:       48 8b 87 38 4d 02 00    mov    0x24d38(%rdi),%rax
>   3e03a5:       5d                      pop    %rbp
>   3e03a6:       48 09 87 40 52 02 00    or     %rax,0x25240(%rdi)
>   3e03ad:       48 c7 87 38 4d 02 00    movq   $0x0,0x24d38(%rdi)
>   3e03b4:       00 00 00 00
>   3e03b8:       0f 11 87 28 52 02 00    movups %xmm0,0x25228(%rdi)
>   3e03bf:       c3                      retq
> 
> Assembly after:
> 
> 00000000003e0380 <brw_render_state_finished>:
>   3e0380:       55                      push   %rbp
>   3e0381:       8b 87 20 52 02 00       mov    0x25220(%rdi),%eax
>   3e0387:       c7 87 28 52 02 00 00    movl   $0x0,0x25228(%rdi)
>   3e038e:       00 00 00
>   3e0391:       09 87 38 52 02 00       or     %eax,0x25238(%rdi)
>   3e0397:       48 89 e5                mov    %rsp,%rbp
>   3e039a:       48 8b 87 38 4d 02 00    mov    0x24d38(%rdi),%rax
>   3e03a1:       48 c7 87 30 52 02 00    movq   $0x0,0x25230(%rdi)
>   3e03a8:       00 00 00 00
>   3e03ac:       48 09 87 40 52 02 00    or     %rax,0x25240(%rdi)
>   3e03b3:       c7 87 20 52 02 00 00    movl   $0x0,0x25220(%rdi)
>   3e03ba:       00 00 00
>   3e03bd:       48 c7 87 38 4d 02 00    movq   $0x0,0x24d38(%rdi)
>   3e03c4:       00 00 00 00
>   3e03c8:       5d                      pop    %rbp
>   3e03c9:       c3                      retq
>   3e03ca:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
> 
> That looks like more instructions to me...
> 
> Thanks for looking at ways to optimize our CPU overhead.  I'm sure there are
> some improvements we can make...


More information about the mesa-dev mailing list