[Mesa-dev] [PATCH V3] i965 : Optimize atom state flag checks

Tue Aug 15 06:26:22 UTC 2017

On 08/15/2017 08:08 AM, Tapani Pälli wrote:
> 
> 
> On 08/14/2017 08:58 PM, Scott D Phillips wrote:
>> Tapani Pälli <tapani.palli at intel.com> writes:
>>
>>> Hello;
>>>
>>> On 07/25/2017 05:17 PM, Marathe, Yogesh wrote:
>>>> Hi Matt, Sorry for late reply, please see below.
>>>>
>>>>> -----Original Message-----
>>>>> From: mesa-dev [mailto:mesa-dev-bounces at lists.freedesktop.org] On 
>>>>> Behalf
>>>>> Of Matt Turner
>>>>> Sent: Saturday, July 22, 2017 12:12 AM
>>>>> To: Muthukumar, Aravindan <aravindan.muthukumar at intel.com>
>>>>> Cc: mesa-dev at lists.freedesktop.org; Marathe, Yogesh
>>>>> <yogesh.marathe at intel.com>
>>>>> Subject: Re: [Mesa-dev] [PATCH V3] i965 : Optimize atom state flag 
>>>>> checks
>>>>>
>>>>> On 07/21, aravindan.muthukumar at intel.com wrote:
>>>>>> From: Aravindan Muthukumar <aravindan.muthukumar at intel.com>
>>>>>>
>>>>>> This patch improves CPI Rate(Cycles per Instruction) and branch miss
>>>>>> predict for i965. The function check_state() was showing CPI 
>>>>>> retired rate.
>>>>>>
>>>>>> Performance stats with android:
>>>>>> - CPI retired lowered by 28% (lower is better)
>>>>>> - Branch missprediction lowered by 13% (lower is better)
>>>>>> - 3DMark improved by 2%
>>>>>>
>>>>>> The dissassembly doesn't show difference, although above results were
>>>>>> observed with patch.
>>>>>
>>>>
>>>> Yes this is true for V3 where we removed the function based on a 
>>>> review comment.
>>>>
>>>>> If there's no difference in the assembly then whatever measurements 
>>>>> you have
>>>>> taken must be noise.
>>>>>
>>>>
>>>> No that's not guaranteed either. Lot of things still depend on how 
>>>> instructions are
>>>> aligned, sometimes even changing linking order gives different 
>>>> results where
>>>> disassemblies of individual functions remain same.
>>>>> I applied the patch and inspected brw_state_upload.o. There are 
>>>>> assembly
>>>>> differences. I can produce the same assembly as this patch by just 
>>>>> pulling the if
>>>>> statement out of check_and_emit_atom() and into the caller. The 
>>>>> replacement
>>>>> of check_state() with a macro is completely unnecessary.
>>>>>
>>>>>
>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_state_upload.c
>>>>> b/src/mesa/drivers/dri/i965/brw_state_upload.c
>>>>> index acaa97ee7d..b163e1490d 100644
>>>>> --- a/src/mesa/drivers/dri/i965/brw_state_upload.c
>>>>> +++ b/src/mesa/drivers/dri/i965/brw_state_upload.c
>>>>> @@ -439,14 +439,12 @@ merge_ctx_state(struct brw_context *brw,  }
>>>>>
>>>>>    static inline void
>>>>> -check_and_emit_atom(struct brw_context *brw,
>>>>> -                    struct brw_state_flags *state,
>>>>> -                    const struct brw_tracked_state *atom)
>>>>> +emit_atom(struct brw_context *brw,
>>>>> +          struct brw_state_flags *state,
>>>>> +          const struct brw_tracked_state *atom)
>>>>>    {
>>>>> -   if (check_state(state, &atom->dirty)) {
>>>>> -      atom->emit(brw);
>>>>> -      merge_ctx_state(brw, state);
>>>>> -   }
>>>>> +   atom->emit(brw);
>>>>> +   merge_ctx_state(brw, state);
>>>>>    }
>>>>>
>>>>>    static inline void
>>>>> @@ -541,7 +539,9 @@ brw_upload_pipeline_state(struct brw_context *brw,
>>>>>         const struct brw_tracked_state *atom = &atoms[i];
>>>>>         struct brw_state_flags generated;
>>>>>
>>>>> -         check_and_emit_atom(brw, &state, atom);
>>>>> +         if (check_state(&state, &atom->dirty)) {
>>>>> +            emit_atom(brw, &state, atom);
>>>>> +         }
>>>>>
>>>>>         accumulate_state(&examined, &atom->dirty);
>>>>>
>>>>> @@ -558,7 +558,9 @@ brw_upload_pipeline_state(struct brw_context *brw,
>>>>>          for (i = 0; i < num_atoms; i++) {
>>>>>         const struct brw_tracked_state *atom = &atoms[i];
>>>>>
>>>>> -         check_and_emit_atom(brw, &state, atom);
>>>>> +         if (check_state(&state, &atom->dirty)) {
>>>>> +            emit_atom(brw, &state, atom);
>>>>> +         }
>>>>>          }
>>>>>       }
>>>>>
>>>>>
>>>>> With that said, the assembly differences are not understandable to me.
>>>>> Why should extracting an if statement from a static inline function 
>>>>> into the caller
>>>>> of that function cause any difference whatsoever?
>>>>
>>>> Agreed, it shouldn't in case of static inline.
>>>>
>>>>>
>>>>> I'm viewing the assembly differences with:
>>>>>
>>>>> wdiff -n \
>>>>>       -w $'\033[30;41m' -x $'\033[0m' \
>>>>>       -y $'\033[30;42m' -z $'\033[0m' \
>>>>>       <(objdump -d brw_state_upload.o.before | sed -e 's/^.*\t//') \
>>>>>       <(objdump -d brw_state_upload.o.wtf    | sed -e 's/^.*\t//') 
>>>>> | less -R
>>>>>
>>>>> and the only real difference is the movement of some call and jmp 
>>>>> instructions.
>>>>>
>>>>> I cannot take the patch without some reasonable explanation for the 
>>>>> change.
>>>>
>>>> Ok I think this has been discussed already and we agree that there 
>>>> is no big visible difference
>>>> in disassembly which can be pointed out for improvement. Although, 
>>>> as you said this movement
>>>> of instructions can cause this. If with this patch instructions get 
>>>> cache aligned that too can show
>>>> improvement.  This is a busy function with bad CPI. Hence chosen for 
>>>> optimization. Branch miss
>>>> predict is another metric.  Do we want to consider all these or just 
>>>> disassembly?
>>>
>>> I don't have a reasonable explanation to give but I made some benchmark
>>> runs today (3DMark "Ice Storm Unlimited" test ) and I'm getting
>>> consistently 1-2% better results with the patch. I also tried the
>>> mentioned modification "pulling the if statement out of
>>> check_and_emit_atom() and into the caller" and it has same performance
>>> benefits. What I can tell from assembly dump is that
>>> brw_upload_render_state function becomes slightly shorter, there's
>>> movement with call and jmp and dump with the patch has less mov and nopl
>>> calls in total.
>>
>> FWIW, I see no difference in the assembly emitted with clang-4.0.1 for
>> either of the patches suggested in this thread, although I see my old
>> gcc-4.8 ignore the inline request for check_and_emit_atom. That's really
>> all that makes sense to me here, that the inlining is getting
>> ignored. Someone should check that the same performance increase comes
>> from s/inline/ALWAYS_INLINE/ for check_and_emit_atom.
> 
> Forgot to mention that for that investigation I compiled with gcc 6.4.1, 
> on Android we compile with clang and using -O3 so results might differ 
> but I wanted to see the difference on desktop.
> 
> I'll try also using ALWAYS_INLINE.

Indeed, we get the same performance increase with this. Android is 
compiled with clang 3.8 so it might not be as clever as clang-4.0.1. I 
also now tried with gcc 7.1.1 and there's no assembly difference with that.

I've sent a proposal to ALWAYS_INLINE so that we get this performance 
benefit with old compilers.

> Thanks;
> 
> 
>>>> Let me make one more attempt, we clearly see icache misses for 
>>>> brw_upload_pipeline_state also
>>>> reduced with patch against without patch. Do you think that too is 
>>>> noise with *.o.wtf?
>>>> I would be happy to learn if that's the case to understand all this 
>>>> better.
>>>>
>>>> I believe there are two reasons why this should go in
>>>> 1. It consistently benefits android (we are ok to rename it), helps 
>>>> reduce overhead
>>>> 2. It doesn't harm other platforms
>>>>
>>>> We are ok to drop it if above two claims can be negated with someone 
>>>> else other than us testing
>>>> this. Otherwise can you please reconsider?
>>>> _______________________________________________
>>>> mesa-dev mailing list
>>>> mesa-dev at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev