[Mesa-dev] [PATCH] i965 : Performance Improvement
Kenneth Graunke
kenneth at whitecape.org
Fri Jul 14 04:35:05 UTC 2017
On Thursday, July 13, 2017 9:09:09 PM PDT aravindan.muthukumar at intel.com wrote:
> From: Aravindan M <aravindan.muthukumar at intel.com>
The commit title should be something like, "i965: Optimize atom state
flag checks" rather than a generic "Performance Improvement"
> This patch improves CPI Rate(Cycles per Instruction)
> and CPU time utilization for i965. The functions
> check_state and brw_pipeline_state_finished was found
> poor CPU utilization from performance analysis.
Need actual data here, or show assembly quality improvements.
> Change-Id: I17c7e719a16e222764217a0e67b4482748537b67
> Signed-off-by: Aravindan M <aravindan.muthukumar at intel.com>
> Reviewed-by: Yogesh M <yogesh.marathe at intel.com>
> Tested-by: Asish <asish at intel.com>
> ---
> src/mesa/drivers/dri/i965/brw_defines.h | 3 +++
> src/mesa/drivers/dri/i965/brw_state_upload.c | 14 +++++++++++---
> 2 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_defines.h b/src/mesa/drivers/dri/i965/brw_defines.h
> index a4794c6..60f88ca 100644
> --- a/src/mesa/drivers/dri/i965/brw_defines.h
> +++ b/src/mesa/drivers/dri/i965/brw_defines.h
> @@ -1681,3 +1681,6 @@ enum brw_pixel_shader_coverage_mask_mode {
> # define GEN8_L3CNTLREG_ALL_ALLOC_MASK INTEL_MASK(31, 25)
>
> #endif
> +
> +/* Checking the state of mesa and brw before emitting atoms */
> +#define CHECK_BRW_STATE(a,b) ((a.mesa & b.mesa) | (a.brw & b.brw))
> diff --git a/src/mesa/drivers/dri/i965/brw_state_upload.c b/src/mesa/drivers/dri/i965/brw_state_upload.c
> index 5e82c1b..434decf 100644
> --- a/src/mesa/drivers/dri/i965/brw_state_upload.c
> +++ b/src/mesa/drivers/dri/i965/brw_state_upload.c
> @@ -515,7 +515,10 @@ brw_upload_pipeline_state(struct brw_context *brw,
> const struct brw_tracked_state *atom = &atoms[i];
> struct brw_state_flags generated;
>
> - check_and_emit_atom(brw, &state, atom);
> + /* Checking the state and emitting the atoms */
> + if (CHECK_BRW_STATE(state, atom->dirty)) {
> + check_and_emit_atom(brw, &state, atom);
> + }
>
> accumulate_state(&examined, &atom->dirty);
>
> @@ -532,7 +535,10 @@ brw_upload_pipeline_state(struct brw_context *brw,
> for (i = 0; i < num_atoms; i++) {
> const struct brw_tracked_state *atom = &atoms[i];
>
> - check_and_emit_atom(brw, &state, atom);
> + /* Checking the state and emitting the atoms */
> + if (CHECK_BRW_STATE(state, atom->dirty)) {
> + check_and_emit_atom(brw, &state, atom);
> + }
This doesn't make any sense...the very first thing check_and_emit_atom()
does is call check_state(), which does the exact thing your CHECK_BRW_STATE
macro does. So you're just needlessly checking the same thing twice.
The only reason I could see this helping is if check_state() wasn't inlined,
but a release build with -O2 definitely inlines both check_and_emit_atom()
and check_state().
Are you using GCC? What are your CFLAGS? -O2? I hope you're not trying
to optimize a debug build...
> }
> }
>
> @@ -567,7 +573,9 @@ brw_pipeline_state_finished(struct brw_context *brw,
> brw->state.pipelines[i].mesa |= brw->NewGLState;
> brw->state.pipelines[i].brw |= brw->ctx.NewDriverState;
> } else {
> - memset(&brw->state.pipelines[i], 0, sizeof(struct brw_state_flags));
> + /* Avoiding the memset with initialization */
> + brw->state.pipelines[i].mesa = 0;
> + brw->state.pipelines[i].brw = 0ull;
> }
> }
This is a separate change.
I'm also not seeing why it's useful:
Assembly before (GCC 7.1.1, x86_64, -O2 -fno-omit-frame-pointer):
00000000003e0380 <brw_render_state_finished>:
3e0380: 66 0f ef c0 pxor %xmm0,%xmm0
3e0384: 55 push %rbp
3e0385: 8b 87 20 52 02 00 mov 0x25220(%rdi),%eax
3e038b: c7 87 20 52 02 00 00 movl $0x0,0x25220(%rdi)
3e0392: 00 00 00
3e0395: 48 89 e5 mov %rsp,%rbp
3e0398: 09 87 38 52 02 00 or %eax,0x25238(%rdi)
3e039e: 48 8b 87 38 4d 02 00 mov 0x24d38(%rdi),%rax
3e03a5: 5d pop %rbp
3e03a6: 48 09 87 40 52 02 00 or %rax,0x25240(%rdi)
3e03ad: 48 c7 87 38 4d 02 00 movq $0x0,0x24d38(%rdi)
3e03b4: 00 00 00 00
3e03b8: 0f 11 87 28 52 02 00 movups %xmm0,0x25228(%rdi)
3e03bf: c3 retq
Assembly after:
00000000003e0380 <brw_render_state_finished>:
3e0380: 55 push %rbp
3e0381: 8b 87 20 52 02 00 mov 0x25220(%rdi),%eax
3e0387: c7 87 28 52 02 00 00 movl $0x0,0x25228(%rdi)
3e038e: 00 00 00
3e0391: 09 87 38 52 02 00 or %eax,0x25238(%rdi)
3e0397: 48 89 e5 mov %rsp,%rbp
3e039a: 48 8b 87 38 4d 02 00 mov 0x24d38(%rdi),%rax
3e03a1: 48 c7 87 30 52 02 00 movq $0x0,0x25230(%rdi)
3e03a8: 00 00 00 00
3e03ac: 48 09 87 40 52 02 00 or %rax,0x25240(%rdi)
3e03b3: c7 87 20 52 02 00 00 movl $0x0,0x25220(%rdi)
3e03ba: 00 00 00
3e03bd: 48 c7 87 38 4d 02 00 movq $0x0,0x24d38(%rdi)
3e03c4: 00 00 00 00
3e03c8: 5d pop %rbp
3e03c9: c3 retq
3e03ca: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
That looks like more instructions to me...
Thanks for looking at ways to optimize our CPU overhead. I'm sure there
are some improvements we can make...
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part.
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20170713/05235468/attachment.sig>
More information about the mesa-dev
mailing list