[Mesa-dev] [PATCH V3] i965 : Optimize atom state flag checks

Tue Jul 25 14:17:48 UTC 2017

Hi Matt, Sorry for late reply, please see below.

> -----Original Message-----
> From: mesa-dev [mailto:mesa-dev-bounces at lists.freedesktop.org] On Behalf
> Of Matt Turner
> Sent: Saturday, July 22, 2017 12:12 AM
> To: Muthukumar, Aravindan <aravindan.muthukumar at intel.com>
> Cc: mesa-dev at lists.freedesktop.org; Marathe, Yogesh
> <yogesh.marathe at intel.com>
> Subject: Re: [Mesa-dev] [PATCH V3] i965 : Optimize atom state flag checks
> 
> On 07/21, aravindan.muthukumar at intel.com wrote:
> >From: Aravindan Muthukumar <aravindan.muthukumar at intel.com>
> >
> >This patch improves CPI Rate(Cycles per Instruction) and branch miss
> >predict for i965. The function check_state() was showing CPI retired rate.
> >
> >Performance stats with android:
> >- CPI retired lowered by 28% (lower is better)
> >- Branch missprediction lowered by 13% (lower is better)
> >- 3DMark improved by 2%
> >
> >The dissassembly doesn't show difference, although above results were
> >observed with patch.
> 

Yes this is true for V3 where we removed the function based on a review comment.

> If there's no difference in the assembly then whatever measurements you have
> taken must be noise.
>

No that's not guaranteed either. Lot of things still depend on how instructions are
aligned, sometimes even changing linking order gives different results where
disassemblies of individual functions remain same.

> I applied the patch and inspected brw_state_upload.o. There are assembly
> differences. I can produce the same assembly as this patch by just pulling the if
> statement out of check_and_emit_atom() and into the caller. The replacement
> of check_state() with a macro is completely unnecessary.
> 
> 
> diff --git a/src/mesa/drivers/dri/i965/brw_state_upload.c
> b/src/mesa/drivers/dri/i965/brw_state_upload.c
> index acaa97ee7d..b163e1490d 100644
> --- a/src/mesa/drivers/dri/i965/brw_state_upload.c
> +++ b/src/mesa/drivers/dri/i965/brw_state_upload.c
> @@ -439,14 +439,12 @@ merge_ctx_state(struct brw_context *brw,  }
> 
>  static inline void
> -check_and_emit_atom(struct brw_context *brw,
> -                    struct brw_state_flags *state,
> -                    const struct brw_tracked_state *atom)
> +emit_atom(struct brw_context *brw,
> +          struct brw_state_flags *state,
> +          const struct brw_tracked_state *atom)
>  {
> -   if (check_state(state, &atom->dirty)) {
> -      atom->emit(brw);
> -      merge_ctx_state(brw, state);
> -   }
> +   atom->emit(brw);
> +   merge_ctx_state(brw, state);
>  }
> 
>  static inline void
> @@ -541,7 +539,9 @@ brw_upload_pipeline_state(struct brw_context *brw,
>  	 const struct brw_tracked_state *atom = &atoms[i];
>  	 struct brw_state_flags generated;
> 
> -         check_and_emit_atom(brw, &state, atom);
> +         if (check_state(&state, &atom->dirty)) {
> +            emit_atom(brw, &state, atom);
> +         }
> 
>  	 accumulate_state(&examined, &atom->dirty);
> 
> @@ -558,7 +558,9 @@ brw_upload_pipeline_state(struct brw_context *brw,
>        for (i = 0; i < num_atoms; i++) {
>  	 const struct brw_tracked_state *atom = &atoms[i];
> 
> -         check_and_emit_atom(brw, &state, atom);
> +         if (check_state(&state, &atom->dirty)) {
> +            emit_atom(brw, &state, atom);
> +         }
>        }
>     }
> 
> 
> With that said, the assembly differences are not understandable to me.
> Why should extracting an if statement from a static inline function into the caller
> of that function cause any difference whatsoever?

Agreed, it shouldn't in case of static inline.

> 
> I'm viewing the assembly differences with:
> 
> wdiff -n \
>     -w $'\033[30;41m' -x $'\033[0m' \
>     -y $'\033[30;42m' -z $'\033[0m' \
>     <(objdump -d brw_state_upload.o.before | sed -e 's/^.*\t//') \
>     <(objdump -d brw_state_upload.o.wtf    | sed -e 's/^.*\t//') | less -R
> 
> and the only real difference is the movement of some call and jmp instructions.
> 
> I cannot take the patch without some reasonable explanation for the change.

Ok I think this has been discussed already and we agree that there is no big visible difference
in disassembly which can be pointed out for improvement. Although, as you said this movement
of instructions can cause this. If with this patch instructions get cache aligned that too can show
improvement.  This is a busy function with bad CPI. Hence chosen for optimization. Branch miss
predict is another metric.  Do we want to consider all these or just disassembly?

Let me make one more attempt, we clearly see icache misses for brw_upload_pipeline_state also
reduced with patch against without patch. Do you think that too is noise with *.o.wtf?
I would be happy to learn if that's the case to understand all this better.

I believe there are two reasons why this should go in
1. It consistently benefits android (we are ok to rename it), helps reduce overhead
2. It doesn't harm other platforms

We are ok to drop it if above two claims can be negated with someone else other than us testing
this. Otherwise can you please reconsider?