[Intel-gfx] [RFC] drm/i915: Emit to ringbuffer directly

Dave Gordon david.s.gordon at intel.com
Fri Sep 9 13:20:19 UTC 2016


On 09/09/16 09:32, Tvrtko Ursulin wrote:
>
> On 08/09/16 17:40, Chris Wilson wrote:
>> On Thu, Sep 08, 2016 at 04:12:55PM +0100, Tvrtko Ursulin wrote:
>>> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>
>>> This removes the usage of intel_ring_emit in favour of
>>> directly writing to the ring buffer.
>>
>> I have the same patch! But I called it out, for historical reasons.
>
> Yes I know we talked about it in the past but I did not think you will
> find time to actually write it amongst all the other things.
>
>> Oh, except mine uses out[0]...out[N] because gcc prefers that over
>> *out++ = ...
>
> It copes just fine with the latter here, for example:
>
>     *rbuf++ = cmd;
>     *rbuf++ = I915_GEM_HWS_SCRATCH_ADDR | MI_FLUSH_DW_USE_GTT;
>     *rbuf++ = 0; /* upper addr */
>     *rbuf++ = 0; /* value */
>
> Is:
>
>      3e9:       89 10                   mov    %edx,(%rax)
>      3eb:       c7 40 04 04 01 00 00    movl   $0x104,0x4(%rax)
>      3f2:       c7 40 08 00 00 00 00    movl   $0x0,0x8(%rax)
>      3f9:       c7 40 0c 00 00 00 00    movl   $0x0,0xc(%rax)
>
> And for the record, before this patch, with intel_ring_emit:
>
>      53a:       8b 53 3c                mov    0x3c(%rbx),%edx
>      53d:       48 8b 4b 08             mov    0x8(%rbx),%rcx
>      541:       89 04 11                mov    %eax,(%rcx,%rdx,1)

>      544:       8b 43 3c                mov    0x3c(%rbx),%eax
>      547:       48 8b 53 08             mov    0x8(%rbx),%rdx
>      54b:       83 c0 04                add    $0x4,%eax
>      54e:       89 43 3c                mov    %eax,0x3c(%rbx)
>      551:       c7 04 02 04 01 00 00    movl   $0x104,(%rdx,%rax,1)

>      558:       8b 43 3c                mov    0x3c(%rbx),%eax
>      55b:       48 8b 53 08             mov    0x8(%rbx),%rdx
>      55f:       83 c0 04                add    $0x4,%eax
>      562:       89 43 3c                mov    %eax,0x3c(%rbx)
>      565:       c7 04 02 00 00 00 00    movl   $0x0,(%rdx,%rax,1)

>      56c:       8b 43 3c                mov    0x3c(%rbx),%eax
>      56f:       48 8b 53 08             mov    0x8(%rbx),%rdx
>      573:       83 c0 04                add    $0x4,%eax
>      576:       89 43 3c                mov    %eax,0x3c(%rbx)
>      579:       c7 04 02 00 00 00 00    movl   $0x0,(%rdx,%rax,1)
>
> Yuck :) At least they are not function calls to iowrite any more. :)

Curious that the inlining wasn't doing a better job, though. For 
example, it's not preserving %eax as a local cache of 0x3c(%rbx).

>>> intel_ring_emit was preventing the compiler for optimising
>>> fetch and increment of the current ring buffer pointer and
>>> therefore generating very verbose code for every write.
>>>
>>> It had no useful purpose since all ringbuffer operations
>>> are started and ended with intel_ring_begin and
>>> intel_ring_advance respectively, with no bail out in the
>>> middle possible, so it is fine to increment the tail in
>>> intel_ring_begin and let the code manage the pointer
>>> itself.

Or you could have intel_ring_advance() take the updated local and use 
that to update the ring->tail. It could then check that you hadn't 
exceeded your allocation, OR that you had used exactly as much as you'd 
allocated. I'm sure I had a version that did that, long ago.

>>> Useless instruction removal amounts to approximately
>>> 2384 bytes of saved text on my build.
>>>
>>> Not sure if this has any measurable performance
>>> implications but executing a ton of useless instructions
>>> on fast paths cannot be good.
>>
>> It does show up in perf.
>
> Cool.
>
>>> Patch is not fully polished, but it compiles and runs
>>> on Gen9 at least.
>>>
>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/i915_gem_context.c    |  62 ++--
>>>   drivers/gpu/drm/i915/i915_gem_execbuffer.c |  27 +-
>>>   drivers/gpu/drm/i915/i915_gem_gtt.c        |  57 ++--
>>>   drivers/gpu/drm/i915/intel_display.c       | 113 ++++---
>>>   drivers/gpu/drm/i915/intel_lrc.c           | 223 +++++++-------
>>>   drivers/gpu/drm/i915/intel_mocs.c          |  43 +--
>>>   drivers/gpu/drm/i915/intel_overlay.c       |  69 ++---
>>>   drivers/gpu/drm/i915/intel_ringbuffer.c    | 480
>>> +++++++++++++++--------------
>>>   drivers/gpu/drm/i915/intel_ringbuffer.h    |  19 +-
>>>   9 files changed, 555 insertions(+), 538 deletions(-)
>>
>> Hmm, mine is bigger.
>>
>>   drivers/gpu/drm/i915/i915_gem_context.c    |  85 ++--
>>   drivers/gpu/drm/i915/i915_gem_execbuffer.c |  37 +-
>>   drivers/gpu/drm/i915/i915_gem_gtt.c        |  62 +--
>>   drivers/gpu/drm/i915/i915_gem_request.c    | 135 ++++-
>>   drivers/gpu/drm/i915/i915_gem_request.h    |   2 +
>>   drivers/gpu/drm/i915/intel_display.c       | 133 +++--
>>   drivers/gpu/drm/i915/intel_lrc.c           | 188 ++++---
>>   drivers/gpu/drm/i915/intel_lrc.h           |   2 -
>>   drivers/gpu/drm/i915/intel_mocs.c          |  50 +-
>>   drivers/gpu/drm/i915/intel_overlay.c       |  77 ++-
>>   drivers/gpu/drm/i915/intel_ringbuffer.c    | 762
>> ++++++++++++-----------------
>>   drivers/gpu/drm/i915/intel_ringbuffer.h    |  36 +-
>>   12 files changed, 721 insertions(+), 848 deletions(-)
>>
>> (this includes moving the intel_ring_begin to i915_gem_request)
>>
>> plus an ealier
>>
>>   drivers/gpu/drm/i915/i915_gem_request.c |  26 ++---
>>   drivers/gpu/drm/i915/intel_lrc.c        | 121 ++++++++---------------
>>   drivers/gpu/drm/i915/intel_ringbuffer.c | 168
>> +++++++++++---------------------
>>   drivers/gpu/drm/i915/intel_ringbuffer.h |  10 +-
>>   4 files changed, 112 insertions(+), 213 deletions(-)
>>
>> since I wanted parts of it for emitting timelines.
>
> Ok what do you want to do?
>
> Regards,
> Tvrtko

You are giving up the possibility of wrapping over the end of the 
buffer, but we avoid doing that at present and it doesn't work on some 
hardware, so no great loss.

What about

	rptr = ring_begin(req, len);
	if (unlikely(IS_ERR(rptr))
		return PTR_ERR(rptr);

rather than

	ret = intel_ring_begin(req, len, &rbuf);
  	if (ret)
  		return ret;

(because returning values through ref parameters is ugly).

Do you have/do you want a Coccinelle script to automate this?

.Dave.


More information about the Intel-gfx mailing list