[Mesa-dev] [PATCH 2/2] i965: Optimize intel_batchbuffer_emit_dword().

Wed Jul 8 17:08:11 PDT 2015

On Wed, Jul 8, 2015 at 4:53 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> On Wed, Jul 08, 2015 at 03:33:17PM -0700, Matt Turner wrote:
>> On Wed, Jul 8, 2015 at 2:07 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
>> > On Wed, Jul 08, 2015 at 02:00:02PM -0700, Matt Turner wrote:
>> >> By keeping a pointer to the next available location, we reduce the
>> >> number of memory accesses needed to write to the batchbuffer.
>> >>
>> >> A net ~7k reduction of .text size, 7.5k of which is from the change to
>> >> intel_batchbuffer_emit_dword().
>> >>
>> >>    text     data      bss      dec      hex  filename
>> >> 4943740   195152    26192  5165084   4ed01c  i965_dri.so before
>> >> 4936804   195152    26192  5158148   4eb504  i965_dri.so after
>> >>
>> >> Combined with the previous patch, improves performance of Synmark
>> >> OglBatch7 by 4.05914% +/- 1.49373% (n=270) on Haswell.
>> >> ---
>> >> Full disclosure: when testing on an IVB desktop, I measured a
>> >> regression in the same benchmark of -4.19005% +/- 1.15188% (n=30).
>> >> I don't have any explanation.
>> >
>> > The problem is that it seems to generate worse code with multiple
>> > adjacent emit_dwords. I have seen similar regressions when doing the
>> > same batch[index] to *batch++ elsewhere.
>> > -Chris
>>
>> That is in conflict with the data I've provided. In fact, I started by
>> noticing that if I added intel_batchbuffer_emit_dword* functions that
>> took multiple dwords that there was a reduction in .text size, so it's
>> something that I've considered.
>
> This is part of a disassembly of the index version of
> gen6_viewport_state.c::upload_viewport_state_pointers()
> (ivb with -march=native -Ofast)
>
>    0x0000000000000287 <+103>:   lea    0x1(%rax),%esi
>    0x000000000000028a <+106>:   mov    %si,0x22f24(%rdi)
>    0x0000000000000291 <+113>:   mov    %ecx,(%rdx,%rax,4)
>    0x0000000000000294 <+116>:   movzwl 0x22f24(%rdi),%eax
>    0x000000000000029b <+123>:   mov    0x24330(%rdi),%ecx
>    0x00000000000002a1 <+129>:   mov    0x22f08(%rdi),%rdx
>    0x00000000000002a8 <+136>:   lea    0x1(%rax),%esi
>    0x00000000000002ab <+139>:   mov    %si,0x22f24(%rdi)
>    0x00000000000002b2 <+146>:   mov    %ecx,(%rdx,%rax,4)
>    0x00000000000002b5 <+149>:   movzwl 0x22f24(%rdi),%eax
>    0x00000000000002bc <+156>:   mov    0x2480c(%rdi),%ecx
>    0x00000000000002c2 <+162>:   mov    0x22f08(%rdi),%rdx
>    0x00000000000002c9 <+169>:   lea    0x1(%rax),%esi
>    0x00000000000002cc <+172>:   mov    %si,0x22f24(%rdi)
>    0x00000000000002d3 <+179>:   mov    %ecx,(%rdx,%rax,4)
>    0x00000000000002d6 <+182>:   pop    %rbp
>    0x00000000000002d7 <+183>:   retq
>
> and for comparison the pointer version:
>
>    0x0000000000000280 <+96>:    lea    0x4(%rax),%rcx
>    0x0000000000000284 <+100>:   mov    %rcx,0x22f10(%rdi)
>    0x000000000000028b <+107>:   mov    %edx,(%rax)
>    0x000000000000028d <+109>:   mov    0x22f10(%rdi),%rax
>    0x0000000000000294 <+116>:   mov    0x24338(%rdi),%edx
>    0x000000000000029a <+122>:   lea    0x4(%rax),%rcx
>    0x000000000000029e <+126>:   mov    %rcx,0x22f10(%rdi)
>    0x00000000000002a5 <+133>:   mov    %edx,(%rax)
>    0x00000000000002a7 <+135>:   mov    0x22f10(%rdi),%rax
>    0x00000000000002ae <+142>:   mov    0x24814(%rdi),%edx
>    0x00000000000002b4 <+148>:   lea    0x4(%rax),%rcx
>    0x00000000000002b8 <+152>:   mov    %rcx,0x22f10(%rdi)
>    0x00000000000002bf <+159>:   mov    %edx,(%rax)
>    0x00000000000002c1 <+161>:   pop    %rbp
>    0x00000000000002c2 <+162>:   retq
>
> So in neither case does gcc avoid incrementing either the index or the
> pointer in the struct after emit_dword, and then reloads it for the
> next.
>
> This is what I expected to see
>    0x000000000000025e <+62>:    movl   $0x780d1c02,(%rcx)
>    0x0000000000000264 <+68>:    mov    0x22f08(%rdi),%rax
>    0x000000000000026b <+75>:    mov    0x24320(%rdi),%edx
>    0x0000000000000271 <+81>:    mov    %edx,0x4(%rax)
>    0x0000000000000274 <+84>:    mov    0x22f08(%rdi),%rax
>    0x000000000000027b <+91>:    mov    0x24338(%rdi),%edx
>    0x0000000000000281 <+97>:    mov    %edx,0x8(%rax)
>    0x0000000000000284 <+100>:   mov    0x22f08(%rdi),%rax
>    0x000000000000028b <+107>:   mov    0x24814(%rdi),%edx
>    0x0000000000000291 <+113>:   mov    %edx,0xc(%rax)
>    0x0000000000000294 <+116>:   addq   $0x10,0x22f08(%rdi)
>    0x000000000000029c <+124>:   pop    %rbp
>    0x000000000000029d <+125>:   retq
> with the pointer increments coalesced to the end. Generated by
> opencoding emit_dwords as
>
> static void upload_viewport_state_pointers(struct brw_context *brw)
> {
>    BEGIN_BATCH(4);
>    brw->batch.map[0] = (_3DSTATE_VIEWPORT_STATE_POINTERS << 16 | (4 - 2) |
>                         GEN6_CC_VIEWPORT_MODIFY |
>                         GEN6_SF_VIEWPORT_MODIFY |
>                         GEN6_CLIP_VIEWPORT_MODIFY);
>    brw->batch.map[1] = (brw->clip.vp_offset);
>    brw->batch.map[2] = (brw->sf.vp_offset);
>    brw->batch.map[3] = (brw->cc.vp_offset);
>    brw->batch.map += 4;
>    ADVANCE_BATCH();
> }
> -Chris

Ah, thanks. I see.

I'll give it another shot.