[Intel-gfx] [RFC] drm/i915: Emit to ringbuffer directly

Tvrtko Ursulin tvrtko.ursulin at linux.intel.com
Fri Sep 9 13:58:00 UTC 2016


On 09/09/16 14:20, Dave Gordon wrote:
> On 09/09/16 09:32, Tvrtko Ursulin wrote:
>>
>> On 08/09/16 17:40, Chris Wilson wrote:
>>> On Thu, Sep 08, 2016 at 04:12:55PM +0100, Tvrtko Ursulin wrote:
>>>> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>
>>>> This removes the usage of intel_ring_emit in favour of
>>>> directly writing to the ring buffer.
>>>
>>> I have the same patch! But I called it out, for historical reasons.
>>
>> Yes I know we talked about it in the past but I did not think you will
>> find time to actually write it amongst all the other things.
>>
>>> Oh, except mine uses out[0]...out[N] because gcc prefers that over
>>> *out++ = ...
>>
>> It copes just fine with the latter here, for example:
>>
>>     *rbuf++ = cmd;
>>     *rbuf++ = I915_GEM_HWS_SCRATCH_ADDR | MI_FLUSH_DW_USE_GTT;
>>     *rbuf++ = 0; /* upper addr */
>>     *rbuf++ = 0; /* value */
>>
>> Is:
>>
>>      3e9:       89 10                   mov    %edx,(%rax)
>>      3eb:       c7 40 04 04 01 00 00    movl   $0x104,0x4(%rax)
>>      3f2:       c7 40 08 00 00 00 00    movl   $0x0,0x8(%rax)
>>      3f9:       c7 40 0c 00 00 00 00    movl   $0x0,0xc(%rax)
>>
>> And for the record, before this patch, with intel_ring_emit:
>>
>>      53a:       8b 53 3c                mov    0x3c(%rbx),%edx
>>      53d:       48 8b 4b 08             mov    0x8(%rbx),%rcx
>>      541:       89 04 11                mov    %eax,(%rcx,%rdx,1)
>
>>      544:       8b 43 3c                mov    0x3c(%rbx),%eax
>>      547:       48 8b 53 08             mov    0x8(%rbx),%rdx
>>      54b:       83 c0 04                add    $0x4,%eax
>>      54e:       89 43 3c                mov    %eax,0x3c(%rbx)
>>      551:       c7 04 02 04 01 00 00    movl   $0x104,(%rdx,%rax,1)
>
>>      558:       8b 43 3c                mov    0x3c(%rbx),%eax
>>      55b:       48 8b 53 08             mov    0x8(%rbx),%rdx
>>      55f:       83 c0 04                add    $0x4,%eax
>>      562:       89 43 3c                mov    %eax,0x3c(%rbx)
>>      565:       c7 04 02 00 00 00 00    movl   $0x0,(%rdx,%rax,1)
>
>>      56c:       8b 43 3c                mov    0x3c(%rbx),%eax
>>      56f:       48 8b 53 08             mov    0x8(%rbx),%rdx
>>      573:       83 c0 04                add    $0x4,%eax
>>      576:       89 43 3c                mov    %eax,0x3c(%rbx)
>>      579:       c7 04 02 00 00 00 00    movl   $0x0,(%rdx,%rax,1)
>>
>> Yuck :) At least they are not function calls to iowrite any more. :)
>
> Curious that the inlining wasn't doing a better job, though. For
> example, it's not preserving %eax as a local cache of 0x3c(%rbx).

Yeah I don't know. Even by employing the restrict keyword in various 
ways I couldn't make it do a better job.

>>>> intel_ring_emit was preventing the compiler for optimising
>>>> fetch and increment of the current ring buffer pointer and
>>>> therefore generating very verbose code for every write.
>>>>
>>>> It had no useful purpose since all ringbuffer operations
>>>> are started and ended with intel_ring_begin and
>>>> intel_ring_advance respectively, with no bail out in the
>>>> middle possible, so it is fine to increment the tail in
>>>> intel_ring_begin and let the code manage the pointer
>>>> itself.
>
> Or you could have intel_ring_advance() take the updated local and use
> that to update the ring->tail. It could then check that you hadn't
> exceeded your allocation, OR that you had used exactly as much as you'd
> allocated. I'm sure I had a version that did that, long ago.

Sounds good to me.

>>>> Useless instruction removal amounts to approximately
>>>> 2384 bytes of saved text on my build.
>>>>
>>>> Not sure if this has any measurable performance
>>>> implications but executing a ton of useless instructions
>>>> on fast paths cannot be good.
>>>
>>> It does show up in perf.
>>
>> Cool.
>>
>>>> Patch is not fully polished, but it compiles and runs
>>>> on Gen9 at least.
>>>>
>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>> ---
>>>>   drivers/gpu/drm/i915/i915_gem_context.c    |  62 ++--
>>>>   drivers/gpu/drm/i915/i915_gem_execbuffer.c |  27 +-
>>>>   drivers/gpu/drm/i915/i915_gem_gtt.c        |  57 ++--
>>>>   drivers/gpu/drm/i915/intel_display.c       | 113 ++++---
>>>>   drivers/gpu/drm/i915/intel_lrc.c           | 223 +++++++-------
>>>>   drivers/gpu/drm/i915/intel_mocs.c          |  43 +--
>>>>   drivers/gpu/drm/i915/intel_overlay.c       |  69 ++---
>>>>   drivers/gpu/drm/i915/intel_ringbuffer.c    | 480
>>>> +++++++++++++++--------------
>>>>   drivers/gpu/drm/i915/intel_ringbuffer.h    |  19 +-
>>>>   9 files changed, 555 insertions(+), 538 deletions(-)
>>>
>>> Hmm, mine is bigger.
>>>
>>>   drivers/gpu/drm/i915/i915_gem_context.c    |  85 ++--
>>>   drivers/gpu/drm/i915/i915_gem_execbuffer.c |  37 +-
>>>   drivers/gpu/drm/i915/i915_gem_gtt.c        |  62 +--
>>>   drivers/gpu/drm/i915/i915_gem_request.c    | 135 ++++-
>>>   drivers/gpu/drm/i915/i915_gem_request.h    |   2 +
>>>   drivers/gpu/drm/i915/intel_display.c       | 133 +++--
>>>   drivers/gpu/drm/i915/intel_lrc.c           | 188 ++++---
>>>   drivers/gpu/drm/i915/intel_lrc.h           |   2 -
>>>   drivers/gpu/drm/i915/intel_mocs.c          |  50 +-
>>>   drivers/gpu/drm/i915/intel_overlay.c       |  77 ++-
>>>   drivers/gpu/drm/i915/intel_ringbuffer.c    | 762
>>> ++++++++++++-----------------
>>>   drivers/gpu/drm/i915/intel_ringbuffer.h    |  36 +-
>>>   12 files changed, 721 insertions(+), 848 deletions(-)
>>>
>>> (this includes moving the intel_ring_begin to i915_gem_request)
>>>
>>> plus an ealier
>>>
>>>   drivers/gpu/drm/i915/i915_gem_request.c |  26 ++---
>>>   drivers/gpu/drm/i915/intel_lrc.c        | 121 ++++++++---------------
>>>   drivers/gpu/drm/i915/intel_ringbuffer.c | 168
>>> +++++++++++---------------------
>>>   drivers/gpu/drm/i915/intel_ringbuffer.h |  10 +-
>>>   4 files changed, 112 insertions(+), 213 deletions(-)
>>>
>>> since I wanted parts of it for emitting timelines.
>>
>> Ok what do you want to do?
>>
>> Regards,
>> Tvrtko
>
> You are giving up the possibility of wrapping over the end of the
> buffer, but we avoid doing that at present and it doesn't work on some
> hardware, so no great loss.
>
> What about
>
>      rptr = ring_begin(req, len);
>      if (unlikely(IS_ERR(rptr))
>          return PTR_ERR(rptr);
>
> rather than
>
>      ret = intel_ring_begin(req, len, &rbuf);
>       if (ret)
>           return ret;
>
> (because returning values through ref parameters is ugly).

I tried that first but then just got reminded how error pointers make 
gcc generate more verbose code. Especially in this case where it would 
do ERR_PTR followed by PTR_ERR for nothing. So I thought since I am 
shrinking it, lets go full hog.

> Do you have/do you want a Coccinelle script to automate this?

I don't and why not - more Coccinelle examples in the public domain the 
better. :)

Regards,

Tvrtko


More information about the Intel-gfx mailing list