[Mesa-dev] [PATCH V3] i965 : Optimize atom state flag checks

Tapani Pälli tapani.palli at intel.com
Tue Aug 15 05:08:51 UTC 2017



On 08/14/2017 08:58 PM, Scott D Phillips wrote:
> Tapani Pälli <tapani.palli at intel.com> writes:
> 
>> Hello;
>>
>> On 07/25/2017 05:17 PM, Marathe, Yogesh wrote:
>>> Hi Matt, Sorry for late reply, please see below.
>>>
>>>> -----Original Message-----
>>>> From: mesa-dev [mailto:mesa-dev-bounces at lists.freedesktop.org] On Behalf
>>>> Of Matt Turner
>>>> Sent: Saturday, July 22, 2017 12:12 AM
>>>> To: Muthukumar, Aravindan <aravindan.muthukumar at intel.com>
>>>> Cc: mesa-dev at lists.freedesktop.org; Marathe, Yogesh
>>>> <yogesh.marathe at intel.com>
>>>> Subject: Re: [Mesa-dev] [PATCH V3] i965 : Optimize atom state flag checks
>>>>
>>>> On 07/21, aravindan.muthukumar at intel.com wrote:
>>>>> From: Aravindan Muthukumar <aravindan.muthukumar at intel.com>
>>>>>
>>>>> This patch improves CPI Rate(Cycles per Instruction) and branch miss
>>>>> predict for i965. The function check_state() was showing CPI retired rate.
>>>>>
>>>>> Performance stats with android:
>>>>> - CPI retired lowered by 28% (lower is better)
>>>>> - Branch missprediction lowered by 13% (lower is better)
>>>>> - 3DMark improved by 2%
>>>>>
>>>>> The dissassembly doesn't show difference, although above results were
>>>>> observed with patch.
>>>>
>>>
>>> Yes this is true for V3 where we removed the function based on a review comment.
>>>
>>>> If there's no difference in the assembly then whatever measurements you have
>>>> taken must be noise.
>>>>
>>>
>>> No that's not guaranteed either. Lot of things still depend on how instructions are
>>> aligned, sometimes even changing linking order gives different results where
>>> disassemblies of individual functions remain same.
>>>    
>>>> I applied the patch and inspected brw_state_upload.o. There are assembly
>>>> differences. I can produce the same assembly as this patch by just pulling the if
>>>> statement out of check_and_emit_atom() and into the caller. The replacement
>>>> of check_state() with a macro is completely unnecessary.
>>>>
>>>>
>>>> diff --git a/src/mesa/drivers/dri/i965/brw_state_upload.c
>>>> b/src/mesa/drivers/dri/i965/brw_state_upload.c
>>>> index acaa97ee7d..b163e1490d 100644
>>>> --- a/src/mesa/drivers/dri/i965/brw_state_upload.c
>>>> +++ b/src/mesa/drivers/dri/i965/brw_state_upload.c
>>>> @@ -439,14 +439,12 @@ merge_ctx_state(struct brw_context *brw,  }
>>>>
>>>>    static inline void
>>>> -check_and_emit_atom(struct brw_context *brw,
>>>> -                    struct brw_state_flags *state,
>>>> -                    const struct brw_tracked_state *atom)
>>>> +emit_atom(struct brw_context *brw,
>>>> +          struct brw_state_flags *state,
>>>> +          const struct brw_tracked_state *atom)
>>>>    {
>>>> -   if (check_state(state, &atom->dirty)) {
>>>> -      atom->emit(brw);
>>>> -      merge_ctx_state(brw, state);
>>>> -   }
>>>> +   atom->emit(brw);
>>>> +   merge_ctx_state(brw, state);
>>>>    }
>>>>
>>>>    static inline void
>>>> @@ -541,7 +539,9 @@ brw_upload_pipeline_state(struct brw_context *brw,
>>>>    	 const struct brw_tracked_state *atom = &atoms[i];
>>>>    	 struct brw_state_flags generated;
>>>>
>>>> -         check_and_emit_atom(brw, &state, atom);
>>>> +         if (check_state(&state, &atom->dirty)) {
>>>> +            emit_atom(brw, &state, atom);
>>>> +         }
>>>>
>>>>    	 accumulate_state(&examined, &atom->dirty);
>>>>
>>>> @@ -558,7 +558,9 @@ brw_upload_pipeline_state(struct brw_context *brw,
>>>>          for (i = 0; i < num_atoms; i++) {
>>>>    	 const struct brw_tracked_state *atom = &atoms[i];
>>>>
>>>> -         check_and_emit_atom(brw, &state, atom);
>>>> +         if (check_state(&state, &atom->dirty)) {
>>>> +            emit_atom(brw, &state, atom);
>>>> +         }
>>>>          }
>>>>       }
>>>>
>>>>
>>>> With that said, the assembly differences are not understandable to me.
>>>> Why should extracting an if statement from a static inline function into the caller
>>>> of that function cause any difference whatsoever?
>>>
>>> Agreed, it shouldn't in case of static inline.
>>>
>>>>
>>>> I'm viewing the assembly differences with:
>>>>
>>>> wdiff -n \
>>>>       -w $'\033[30;41m' -x $'\033[0m' \
>>>>       -y $'\033[30;42m' -z $'\033[0m' \
>>>>       <(objdump -d brw_state_upload.o.before | sed -e 's/^.*\t//') \
>>>>       <(objdump -d brw_state_upload.o.wtf    | sed -e 's/^.*\t//') | less -R
>>>>
>>>> and the only real difference is the movement of some call and jmp instructions.
>>>>
>>>> I cannot take the patch without some reasonable explanation for the change.
>>>
>>> Ok I think this has been discussed already and we agree that there is no big visible difference
>>> in disassembly which can be pointed out for improvement. Although, as you said this movement
>>> of instructions can cause this. If with this patch instructions get cache aligned that too can show
>>> improvement.  This is a busy function with bad CPI. Hence chosen for optimization. Branch miss
>>> predict is another metric.  Do we want to consider all these or just disassembly?
>>
>> I don't have a reasonable explanation to give but I made some benchmark
>> runs today (3DMark "Ice Storm Unlimited" test ) and I'm getting
>> consistently 1-2% better results with the patch. I also tried the
>> mentioned modification "pulling the if statement out of
>> check_and_emit_atom() and into the caller" and it has same performance
>> benefits. What I can tell from assembly dump is that
>> brw_upload_render_state function becomes slightly shorter, there's
>> movement with call and jmp and dump with the patch has less mov and nopl
>> calls in total.
> 
> FWIW, I see no difference in the assembly emitted with clang-4.0.1 for
> either of the patches suggested in this thread, although I see my old
> gcc-4.8 ignore the inline request for check_and_emit_atom. That's really
> all that makes sense to me here, that the inlining is getting
> ignored. Someone should check that the same performance increase comes
> from s/inline/ALWAYS_INLINE/ for check_and_emit_atom.

Forgot to mention that for that investigation I compiled with gcc 6.4.1, 
on Android we compile with clang and using -O3 so results might differ 
but I wanted to see the difference on desktop.

I'll try also using ALWAYS_INLINE.

Thanks;


>>> Let me make one more attempt, we clearly see icache misses for brw_upload_pipeline_state also
>>> reduced with patch against without patch. Do you think that too is noise with *.o.wtf?
>>> I would be happy to learn if that's the case to understand all this better.
>>>
>>> I believe there are two reasons why this should go in
>>> 1. It consistently benefits android (we are ok to rename it), helps reduce overhead
>>> 2. It doesn't harm other platforms
>>>
>>> We are ok to drop it if above two claims can be negated with someone else other than us testing
>>> this. Otherwise can you please reconsider?
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev


More information about the mesa-dev mailing list