[Intel-gfx] [PATCH 9/9] drm/i915/gt: Schedule request retirement when timeline idles

Wed Nov 20 14:16:17 UTC 2019

On 20/11/2019 13:39, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2019-11-20 13:16:51)
>>
>> On 20/11/2019 09:33, Chris Wilson wrote:
>>> The major drawback of commit 7e34f4e4aad3 ("drm/i915/gen8+: Add RC6 CTX
>>> corruption WA") is that it disables RC6 while Skylake (and friends) is
>>> active, and we do not consider the GPU idle until all outstanding
>>> requests have been retired and the engine switched over to the kernel
>>> context. If userspace is idle, this task falls onto our background idle
>>> worker, which only runs roughly once a second, meaning that userspace has
>>> to have been idle for a couple of seconds before we enable RC6 again.
>>> Naturally, this causes us to consume considerably more energy than
>>> before as powersaving is effectively disabled while a display server
>>> (here's looking at you Xorg) is running.
>>>
>>> As execlists will get a completion event as each context is completed,
>>> we can use this interrupt to queue a retire worker bound to this engine
>>> to cleanup idle timelines. We will then immediately notice the idle
>>> engine (without userspace intervention or the aid of the background
>>> retire worker) and start parking the GPU. Thus during light workloads,
>>> we will do much more work to idle the GPU faster...  Hopefully with
>>> commensurate power saving!
>>>
>>> v2: Watch context completions and only look at those local to the engine
>>> when retiring to reduce the amount of excess work we perform.
>>>
>>> References: https://bugs.freedesktop.org/show_bug.cgi?id=112315
>>> References: 7e34f4e4aad3 ("drm/i915/gen8+: Add RC6 CTX corruption WA")
>>> References: 2248a28384fe ("drm/i915/gen8+: Add RC6 CTX corruption WA")
>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  8 +-
>>>    drivers/gpu/drm/i915/gt/intel_engine_types.h  |  8 ++
>>>    drivers/gpu/drm/i915/gt/intel_gt_requests.c   | 74 +++++++++++++++++++
>>>    drivers/gpu/drm/i915/gt/intel_gt_requests.h   | 17 ++++-
>>>    drivers/gpu/drm/i915/gt/intel_lrc.c           |  9 +++
>>>    drivers/gpu/drm/i915/gt/intel_timeline.c      |  1 +
>>>    .../gpu/drm/i915/gt/intel_timeline_types.h    |  3 +
>>>    7 files changed, 116 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>> index b9613d044393..8f6e353caa66 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>> @@ -28,13 +28,13 @@
>>>    
>>>    #include "i915_drv.h"
>>>    
>>> -#include "gt/intel_gt.h"
>>> -
>>> +#include "intel_context.h"
>>>    #include "intel_engine.h"
>>>    #include "intel_engine_pm.h"
>>>    #include "intel_engine_pool.h"
>>>    #include "intel_engine_user.h"
>>> -#include "intel_context.h"
>>> +#include "intel_gt.h"
>>> +#include "intel_gt_requests.h"
>>>    #include "intel_lrc.h"
>>>    #include "intel_reset.h"
>>>    #include "intel_ring.h"
>>> @@ -617,6 +617,7 @@ static int intel_engine_setup_common(struct intel_engine_cs *engine)
>>>        intel_engine_init_execlists(engine);
>>>        intel_engine_init_cmd_parser(engine);
>>>        intel_engine_init__pm(engine);
>>> +     intel_engine_init_retire(engine);
>>>    
>>>        intel_engine_pool_init(&engine->pool);
>>>    
>>> @@ -839,6 +840,7 @@ void intel_engine_cleanup_common(struct intel_engine_cs *engine)
>>>    
>>>        cleanup_status_page(engine);
>>>    
>>> +     intel_engine_fini_retire(engine);
>>>        intel_engine_pool_fini(&engine->pool);
>>>        intel_engine_fini_breadcrumbs(engine);
>>>        intel_engine_cleanup_cmd_parser(engine);
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
>>> index 758f0e8ec672..17f1f1441efc 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
>>> @@ -451,6 +451,14 @@ struct intel_engine_cs {
>>>    
>>>        struct intel_engine_execlists execlists;
>>>    
>>> +     /*
>>> +      * Keep track of completed timelines on this engine for early
>>> +      * retirement with the goal of quickly enabling powersaving as
>>> +      * soon as the engine is idle.
>>> +      */
>>> +     struct intel_timeline *retire;
>>> +     struct work_struct retire_work;
>>> +
>>>        /* status_notifier: list of callbacks for context-switch changes */
>>>        struct atomic_notifier_head context_status_notifier;
>>>    
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_requests.c b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
>>> index 4dc3cbeb1b36..4a98fefdf915 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_gt_requests.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
>>> @@ -29,6 +29,80 @@ static void flush_submission(struct intel_gt *gt)
>>>                intel_engine_flush_submission(engine);
>>>    }
>>>    
>>> +static void engine_retire(struct work_struct *work)
>>> +{
>>> +     struct intel_engine_cs *engine =
>>> +             container_of(work, typeof(*engine), retire_work);
>>> +     struct intel_timeline *tl = xchg(&engine->retire, NULL);
>>
>> Shouldn't this be atomic_xchg to avoid racing with add_retire?
>>
>>> +
>>> +     do {
>>> +             struct intel_timeline *next = xchg(&tl->retire, NULL);
>>
>> Here as well?
> 
> xchg() are always locked.
> 
> atomic_xchg() operates on atomic_t; xchg() works on any variable, like
> cmpxchg().
> 
>>> +
>>> +             /*
>>> +              * Our goal here is to retire _idle_ timelines as soon as
>>> +              * possible (as they are idle, we do not expect userspace
>>> +              * to be cleaning up anytime soon).
>>> +              *
>>> +              * If the tl->active_count is already zero, someone else
>>> +              * should have retired the timeline. Equally if the timeline
>>> +              * is currently locked, either it is being retired elsewhere
>>> +              * or about to be!
>>> +              */
>>> +             if (atomic_read(&tl->active_count) &&
>>> +                 mutex_trylock(&tl->mutex)) {
>>> +                     retire_requests(tl);
>>> +                     mutex_unlock(&tl->mutex);
>>> +             }
>>> +             intel_timeline_put(tl);
>>> +
>>> +             GEM_BUG_ON(!next);
>>> +             tl = ptr_mask_bits(next, 1);
>>
>> You sometimes expect engine->retire to contain 0x1?
> 
> Yes, imagine that we are submitting very fast such that we schedule_out
> the same context before the worker ran, we would then try to
> add_retire() the same timeline again. So I was using BIT(0) to tag an
> active element in the retirement list.

If we have schedule_out on the same context before the retire ran then 
add_retire is a no-op due tl->retire being set.

It could be a race between engine_retire and add_retire where 
engine->retire is set to NULL so latter would set tl->retire to NULL | 
BIT(0).

But BIT(0) is only filtered out and not acted upon anywhere so I am 
still lost.

So maybe from the opposite angle, what goes wrong if you drop all BIT(0) 
business?

> 
>>> +     } while (tl);
>>> +}
>>> +
>>> +static bool add_retire(struct intel_engine_cs *engine,
>>> +                    struct intel_timeline *tl)
>>> +{
>>> +     struct intel_timeline *first = READ_ONCE(engine->retire);
>>> +
>>> +     /*
>>> +      * We open-code a llist here to include the additional tag [BIT(0)]
>>> +      * so that we know when the timeline is already on a
>>> +      * retirement queue: either this engine or another.
>>> +      *
>>> +      * However, we rely on that a timeline can only be active on a single
>>> +      * engine at any one time and that add_retire() is called before the
>>> +      * engine releases the timeline and transferred to another to retire.
>>> +      */
>>> +
>>> +     if (READ_ONCE(tl->retire)) /* already queued */
>>> +             return false;
>>
>> Can't this go first in the function?
> 
> Conceptually it is. And I made it so because I also decided against
> having the READ_ONCE() at the top.

But READ_ONCE is at the top, just a different one which is discarded if 
this one is NULL.

> 
>>> +
>>> +     intel_timeline_get(tl);
>>> +     do
>>> +             tl->retire = ptr_pack_bits(first, 1, 1);
>>
>> Here you rely on assignment being atomic right?
> 
> Ish. Here we rely on the timeline being owned by the engine so it cannot
> be submitted by another (and so schedule_out called) until this engine
> has released it.
> 
> It is a weak point for generality, but the ordering is strong in
> execlists.
> 
>>> +     while (!try_cmpxchg(&engine->retire, &first, tl));
>>
>> So the loop is effectively creating a chain of timelines to retire on
>> this engine.
>>
>> What happens with virtual engines when a timeline goes to different
>> engine before (well or any single timeline context) the retire worker
>> runs? Ah okay, it gets re-assigned to the most recent engine.
> 
> Right. The engine_retire() doesn't care which engine the timelines were
> run on, it's just a list of suspected idle timelines.
> 
>> I am not sure about the BIT(0) business. It's always set on write so I
>> am not getting why it is useful.
> 
> It's also set to 0 on consumption :)

Another thing - assert that engine->retire is NULL after flush_work in 
intel_engine_fini_retire could be useful.

Regards,

Tvrtko