[Intel-gfx] [PATCH] i915/pmu: Wire GuC backend to per-client busyness

Thu Aug 4 07:25:12 UTC 2022

On 04/08/2022 02:21, Umesh Nerlige Ramappa wrote:
> On Tue, Aug 02, 2022 at 04:38:45PM -0700, Umesh Nerlige Ramappa wrote:
>> On Tue, Aug 02, 2022 at 09:41:38AM +0100, Tvrtko Ursulin wrote:
>>>
>>> On 01/08/2022 20:02, Umesh Nerlige Ramappa wrote:
>>>> On Wed, Jul 27, 2022 at 09:48:18AM +0100, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 27/07/2022 07:01, Umesh Nerlige Ramappa wrote:
>>>>>> On Fri, Jun 17, 2022 at 09:00:06AM +0100, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>> On 16/06/2022 23:13, Nerlige Ramappa, Umesh wrote:
>>>>>>>> From: John Harrison <John.C.Harrison at Intel.com>
>>>>>>>>
>>>>>>>> GuC provides engine_id and last_switch_in ticks for an active 
>>>>>>>> context in
>>>>>>>> the pphwsp. The context image provides a 32 bit total ticks 
>>>>>>>> which is the
>>>>>>>> accumulated by the context (a.k.a. context[CTX_TIMESTAMP]). This
>>>>>>>> information is used to calculate the context busyness as follows:
>>>>>>>>
>>>>>>>> If the engine_id is valid, then busyness is the sum of 
>>>>>>>> accumulated total
>>>>>>>> ticks and active ticks. Active ticks is calculated with current 
>>>>>>>> gt time
>>>>>>>> as reference.
>>>>>>>>
>>>>>>>> If engine_id is invalid, busyness is equal to accumulated total 
>>>>>>>> ticks.
>>>>>>>>
>>>>>>>> Since KMD (CPU) retrieves busyness data from 2 sources - GPU and 
>>>>>>>> GuC, a
>>>>>>>> potential race was highlighted in an earlier review that can 
>>>>>>>> lead to
>>>>>>>> double accounting of busyness. While the solution to this is a wip,
>>>>>>>> busyness is still usable for platforms running GuC submission.
>>>>>>>>
>>>>>>>> v2: (Tvrtko)
>>>>>>>> - Use COPS_RUNTIME_ACTIVE_TOTAL
>>>>>>>> - Add code comment for the race
>>>>>>>> - Undo local variables initializations
>>>>>>>>
>>>>>>>> v3:
>>>>>>>> - Add support for virtual engines based on
>>>>>>>>   https://patchwork.freedesktop.org/series/105227/
>>>>>>>>
>>>>>>>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>>>>>>>> Signed-off-by: Umesh Nerlige Ramappa 
>>>>>>>> <umesh.nerlige.ramappa at intel.com>
>>>>>>>> ---
>>>>>>>>  drivers/gpu/drm/i915/gt/intel_context.c       | 12 +++-
>>>>>>>>  drivers/gpu/drm/i915/gt/intel_context.h       |  6 +-
>>>>>>>>  drivers/gpu/drm/i915/gt/intel_context_types.h |  6 ++
>>>>>>>>  drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  5 ++
>>>>>>>>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 65 
>>>>>>>> ++++++++++++++++++-
>>>>>>>>  drivers/gpu/drm/i915/i915_drm_client.c        |  6 +-
>>>>>>>>  6 files changed, 89 insertions(+), 11 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c 
>>>>>>>> b/drivers/gpu/drm/i915/gt/intel_context.c
>>>>>>>> index 4070cb5711d8..4a84146710e0 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_context.c
>>>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
>>>>>>>> @@ -576,16 +576,24 @@ void 
>>>>>>>> intel_context_bind_parent_child(struct intel_context *parent,
>>>>>>>>      child->parallel.parent = parent;
>>>>>>>>  }
>>>>>>>> -u64 intel_context_get_total_runtime_ns(const struct 
>>>>>>>> intel_context *ce)
>>>>>>>> +u64 intel_context_get_total_runtime_ns(struct intel_context *ce)
>>>>>>>>  {
>>>>>>>>      u64 total, active;
>>>>>>>> +    if (ce->ops->update_stats)
>>>>>>>> +        ce->ops->update_stats(ce);
>>>>>>>> +
>>>>>>>>      total = ce->stats.runtime.total;
>>>>>>>>      if (ce->ops->flags & COPS_RUNTIME_CYCLES)
>>>>>>>>          total *= ce->engine->gt->clock_period_ns;
>>>>>>>>      active = READ_ONCE(ce->stats.active);
>>>>>>>> -    if (active)
>>>>>>>> +    /*
>>>>>>>> +     * When COPS_RUNTIME_ACTIVE_TOTAL is set for ce->cops, the 
>>>>>>>> backend
>>>>>>>> +     * already provides the total active time of the context, 
>>>>>>>> so skip this
>>>>>>>> +     * calculation when this flag is set.
>>>>>>>> +     */
>>>>>>>> +    if (active && !(ce->ops->flags & COPS_RUNTIME_ACTIVE_TOTAL))
>>>>>>>>          active = intel_context_clock() - active;
>>>>>>>>      return total + active;
>>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_context.h 
>>>>>>>> b/drivers/gpu/drm/i915/gt/intel_context.h
>>>>>>>> index b7d3214d2cdd..5fc7c19ab29b 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_context.h
>>>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_context.h
>>>>>>>> @@ -56,7 +56,7 @@ static inline bool 
>>>>>>>> intel_context_is_parent(struct intel_context *ce)
>>>>>>>>      return !!ce->parallel.number_children;
>>>>>>>>  }
>>>>>>>> -static inline bool intel_context_is_pinned(struct intel_context 
>>>>>>>> *ce);
>>>>>>>> +static inline bool intel_context_is_pinned(const struct 
>>>>>>>> intel_context *ce);
>>>>>>>>  static inline struct intel_context *
>>>>>>>>  intel_context_to_parent(struct intel_context *ce)
>>>>>>>> @@ -116,7 +116,7 @@ static inline int 
>>>>>>>> intel_context_lock_pinned(struct intel_context *ce)
>>>>>>>>   * Returns: true if the context is currently pinned for use by 
>>>>>>>> the GPU.
>>>>>>>>   */
>>>>>>>>  static inline bool
>>>>>>>> -intel_context_is_pinned(struct intel_context *ce)
>>>>>>>> +intel_context_is_pinned(const struct intel_context *ce)
>>>>>>>>  {
>>>>>>>>      return atomic_read(&ce->pin_count);
>>>>>>>>  }
>>>>>>>> @@ -351,7 +351,7 @@ intel_context_clear_nopreempt(struct 
>>>>>>>> intel_context *ce)
>>>>>>>>      clear_bit(CONTEXT_NOPREEMPT, &ce->flags);
>>>>>>>>  }
>>>>>>>> -u64 intel_context_get_total_runtime_ns(const struct 
>>>>>>>> intel_context *ce);
>>>>>>>> +u64 intel_context_get_total_runtime_ns(struct intel_context *ce);
>>>>>>>>  u64 intel_context_get_avg_runtime_ns(struct intel_context *ce);
>>>>>>>>  static inline u64 intel_context_clock(void)
>>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h 
>>>>>>>> b/drivers/gpu/drm/i915/gt/intel_context_types.h
>>>>>>>> index 09f82545789f..797bb4242c18 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_context_types.h
>>>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
>>>>>>>> @@ -38,6 +38,9 @@ struct intel_context_ops {
>>>>>>>>  #define COPS_RUNTIME_CYCLES_BIT 1
>>>>>>>>  #define COPS_RUNTIME_CYCLES BIT(COPS_RUNTIME_CYCLES_BIT)
>>>>>>>> +#define COPS_RUNTIME_ACTIVE_TOTAL_BIT 2
>>>>>>>> +#define COPS_RUNTIME_ACTIVE_TOTAL 
>>>>>>>> BIT(COPS_RUNTIME_ACTIVE_TOTAL_BIT)
>>>>>>>> +
>>>>>>>>      int (*alloc)(struct intel_context *ce);
>>>>>>>>      void (*ban)(struct intel_context *ce, struct i915_request 
>>>>>>>> *rq);
>>>>>>>> @@ -55,6 +58,8 @@ struct intel_context_ops {
>>>>>>>>      void (*sched_disable)(struct intel_context *ce);
>>>>>>>> +    void (*update_stats)(struct intel_context *ce);
>>>>>>>> +
>>>>>>>>      void (*reset)(struct intel_context *ce);
>>>>>>>>      void (*destroy)(struct kref *kref);
>>>>>>>> @@ -146,6 +151,7 @@ struct intel_context {
>>>>>>>>              struct ewma_runtime avg;
>>>>>>>>              u64 total;
>>>>>>>>              u32 last;
>>>>>>>> +            u64 start_gt_clk;
>>>>>>>>              I915_SELFTEST_DECLARE(u32 num_underflow);
>>>>>>>>              I915_SELFTEST_DECLARE(u32 max_underflow);
>>>>>>>>          } runtime;
>>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h 
>>>>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>>>>>>> index b3c9a9327f76..6231ad03e4eb 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>>>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>>>>>>>> @@ -196,6 +196,11 @@ static inline u8 
>>>>>>>> guc_class_to_engine_class(u8 guc_class)
>>>>>>>>      return guc_class_engine_class_map[guc_class];
>>>>>>>>  }
>>>>>>>> +/* Per context engine usage stats: */
>>>>>>>> +#define PPHWSP_GUC_CONTEXT_USAGE_STAMP_LO    (0x500 / sizeof(u32))
>>>>>>>> +#define PPHWSP_GUC_CONTEXT_USAGE_STAMP_HI 
>>>>>>>> (PPHWSP_GUC_CONTEXT_USAGE_STAMP_LO + 1)
>>>>>>>> +#define PPHWSP_GUC_CONTEXT_USAGE_ENGINE_ID 
>>>>>>>> (PPHWSP_GUC_CONTEXT_USAGE_STAMP_HI + 1)
>>>>>>>> +
>>>>>>>>  /* Work item for submitting workloads into work queue of GuC. */
>>>>>>>>  struct guc_wq_item {
>>>>>>>>      u32 header;
>>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>>>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>> index 5a1dfacf24ea..cbf3cbb983ce 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>> @@ -378,7 +378,7 @@ static inline void 
>>>>>>>> set_context_guc_id_invalid(struct intel_context *ce)
>>>>>>>>      ce->guc_id.id = GUC_INVALID_CONTEXT_ID;
>>>>>>>>  }
>>>>>>>> -static inline struct intel_guc *ce_to_guc(struct intel_context 
>>>>>>>> *ce)
>>>>>>>> +static inline struct intel_guc *ce_to_guc(const struct 
>>>>>>>> intel_context *ce)
>>>>>>>>  {
>>>>>>>>      return &ce->engine->gt->uc.guc;
>>>>>>>>  }
>>>>>>>> @@ -1323,13 +1323,16 @@ static void 
>>>>>>>> __update_guc_busyness_stats(struct intel_guc *guc)
>>>>>>>>      spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>>>>>>>>  }
>>>>>>>> +static void __guc_context_update_clks(struct intel_context *ce);
>>>>>>>>  static void guc_timestamp_ping(struct work_struct *wrk)
>>>>>>>>  {
>>>>>>>>      struct intel_guc *guc = container_of(wrk, typeof(*guc),
>>>>>>>>                           timestamp.work.work);
>>>>>>>>      struct intel_uc *uc = container_of(guc, typeof(*uc), guc);
>>>>>>>>      struct intel_gt *gt = guc_to_gt(guc);
>>>>>>>> +    struct intel_context *ce;
>>>>>>>>      intel_wakeref_t wakeref;
>>>>>>>> +    unsigned long index;
>>>>>>>>      int srcu, ret;
>>>>>>>>      /*
>>>>>>>> @@ -1343,6 +1346,10 @@ static void guc_timestamp_ping(struct 
>>>>>>>> work_struct *wrk)
>>>>>>>>      with_intel_runtime_pm(&gt->i915->runtime_pm, wakeref)
>>>>>>>>          __update_guc_busyness_stats(guc);
>>>>>>>> +    /* adjust context stats for overflow */
>>>>>>>> +    xa_for_each(&guc->context_lookup, index, ce)
>>>>>>>> +        __guc_context_update_clks(ce);
>>>>>>>> +
>>>>>>>>      intel_gt_reset_unlock(gt, srcu);
>>>>>>>>      mod_delayed_work(system_highpri_wq, &guc->timestamp.work,
>>>>>>>> @@ -1405,6 +1412,56 @@ void intel_guc_busyness_unpark(struct 
>>>>>>>> intel_gt *gt)
>>>>>>>>               guc->timestamp.ping_delay);
>>>>>>>>  }
>>>>>>>> +static void __guc_context_update_clks(struct intel_context *ce)
>>>>>>>> +{
>>>>>>>> +    struct intel_guc *guc = ce_to_guc(ce);
>>>>>>>> +    struct intel_gt *gt = ce->engine->gt;
>>>>>>>> +    u32 *pphwsp, last_switch, engine_id;
>>>>>>>> +    u64 start_gt_clk, active;
>>>>>>>> +    unsigned long flags;
>>>>>>>> +    ktime_t unused;
>>>>>>>> +
>>>>>>>> +    spin_lock_irqsave(&guc->timestamp.lock, flags);
>>>>>>>> +
>>>>>>>> +    /*
>>>>>>>> +     * GPU updates ce->lrc_reg_state[CTX_TIMESTAMP] when 
>>>>>>>> context is switched
>>>>>>>> +     * out, however GuC updates PPHWSP offsets below. Hence KMD 
>>>>>>>> (CPU)
>>>>>>>> +     * relies on GuC and GPU for busyness calculations. Due to 
>>>>>>>> this, A
>>>>>>>> +     * potential race was highlighted in an earlier review that 
>>>>>>>> can lead to
>>>>>>>> +     * double accounting of busyness. While the solution to 
>>>>>>>> this is a wip,
>>>>>>>> +     * busyness is still usable for platforms running GuC 
>>>>>>>> submission.
>>>>>>>> +     */
>>>>>>>> +    pphwsp = ((void *)ce->lrc_reg_state) - LRC_STATE_OFFSET;
>>>>>>>> +    last_switch = 
>>>>>>>> READ_ONCE(pphwsp[PPHWSP_GUC_CONTEXT_USAGE_STAMP_LO]);
>>>>>>>> +    engine_id = 
>>>>>>>> READ_ONCE(pphwsp[PPHWSP_GUC_CONTEXT_USAGE_ENGINE_ID]);
>>>>>>>> +
>>>>>>>> +    guc_update_pm_timestamp(guc, &unused);
>>>>>>>> +
>>>>>>>> +    if (engine_id != 0xffffffff && last_switch) {
>>>>>>>> +        start_gt_clk = READ_ONCE(ce->stats.runtime.start_gt_clk);
>>>>>>>> +        __extend_last_switch(guc, &start_gt_clk, last_switch);
>>>>>>>> +        active = intel_gt_clock_interval_to_ns(gt, 
>>>>>>>> guc->timestamp.gt_stamp - start_gt_clk);
>>>>>>>> +        WRITE_ONCE(ce->stats.runtime.start_gt_clk, start_gt_clk);
>>>>>>>> +        WRITE_ONCE(ce->stats.active, active);
>>>>>>>> +    } else {
>>>>>>>> +        lrc_update_runtime(ce);
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void guc_context_update_stats(struct intel_context *ce)
>>>>>>>> +{
>>>>>>>> +    if (!intel_context_pin_if_active(ce)) {
>>>>>>>> +        WRITE_ONCE(ce->stats.runtime.start_gt_clk, 0);
>>>>>>>> +        WRITE_ONCE(ce->stats.active, 0);
>>>>>>>> +        return;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    __guc_context_update_clks(ce);
>>>>>>>> +    intel_context_unpin(ce);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  static inline bool
>>>>>>>>  submission_disabled(struct intel_guc *guc)
>>>>>>>>  {
>>>>>>>> @@ -2585,6 +2642,7 @@ static void guc_context_unpin(struct 
>>>>>>>> intel_context *ce)
>>>>>>>>  {
>>>>>>>>      struct intel_guc *guc = ce_to_guc(ce);
>>>>>>>> +    lrc_update_runtime(ce);
>>>>>>>>      unpin_guc_id(guc, ce);
>>>>>>>>      lrc_unpin(ce);
>>>>>>>> @@ -3183,6 +3241,7 @@ static void remove_from_context(struct 
>>>>>>>> i915_request *rq)
>>>>>>>>  }
>>>>>>>>  static const struct intel_context_ops guc_context_ops = {
>>>>>>>> +    .flags = COPS_RUNTIME_CYCLES | COPS_RUNTIME_ACTIVE_TOTAL,
>>>>>>>>      .alloc = guc_context_alloc,
>>>>>>>>      .pre_pin = guc_context_pre_pin,
>>>>>>>> @@ -3199,6 +3258,8 @@ static const struct intel_context_ops 
>>>>>>>> guc_context_ops = {
>>>>>>>>      .sched_disable = guc_context_sched_disable,
>>>>>>>> +    .update_stats = guc_context_update_stats,
>>>>>>>> +
>>>>>>>>      .reset = lrc_reset,
>>>>>>>>      .destroy = guc_context_destroy,
>>>>>>>> @@ -3432,6 +3493,7 @@ static int 
>>>>>>>> guc_virtual_context_alloc(struct intel_context *ce)
>>>>>>>>  }
>>>>>>>>  static const struct intel_context_ops virtual_guc_context_ops = {
>>>>>>>> +    .flags = COPS_RUNTIME_CYCLES | COPS_RUNTIME_ACTIVE_TOTAL,
>>>>>>>>      .alloc = guc_virtual_context_alloc,
>>>>>>>>      .pre_pin = guc_virtual_context_pre_pin,
>>>>>>>> @@ -3447,6 +3509,7 @@ static const struct intel_context_ops 
>>>>>>>> virtual_guc_context_ops = {
>>>>>>>>      .exit = guc_virtual_context_exit,
>>>>>>>>      .sched_disable = guc_context_sched_disable,
>>>>>>>> +    .update_stats = guc_context_update_stats,
>>>>>>>
>>>>>>> There are also virtual_parent_context_ops and 
>>>>>>> virtual_child_context_ops - which means more test coverage is 
>>>>>>> needed..
>>>>>>
>>>>>> Trying to come back to this... The 
>>>>>> virtual_parent_context_ops/virtual_child_context_ops are used for 
>>>>>> parallel engines. GuC would only update the pphwsp of the parent 
>>>>>> context with the last_switched_in_time.
>>>>>>
>>>>>> In general, how should I report the busyness for a parallel engine?
>>>>>>
>>>>>> I would think it is busyness reported by parent context multiplied 
>>>>>> by width.
>>>>>
>>>>> That could a reasonable approximation but I can't say for certain. 
>>>>> Depends on the GuC scheduler implementation a bit. Like is anything 
>>>>> preventing child contexts from finishing their useful work ahead of 
>>>>> the parent context, or they are always strictly scheduled as one 
>>>>> entity and child engines are blocked from taking other workloads 
>>>>> until the parent is scheduled out?
>>>>
>>>> Correct, if a child finishes the work before parent/siblings for 
>>>> some reason, it cannot take up other work until all siblings are done.
>>>
>>> The only problem is that I guess one day that assumption might break 
>>> and we will "never" now. If you have some spare time it would be best 
>>> to add an IGT to verify this assumption, or at least put that work as 
>>> TODO in the backlog?
>>
>> I added some tests to IGT for parallel engine, but something is 
>> missing in the way I am submitting the batches to the parallel engine. 
>> I see some hangs, haven't had a chance to debug that. Will try to get 
>> to it and then post the updated i915 patches.
> 
> I think I may have to do the parallel engine testing later. Do you think 
> this patch alone is good enough for now? It does not enable context 
> busyness for parallel execution (which is just adding this 
> COPS_RUNTIME_CYCLES | COPS_RUNTIME_ACTIVE_TOTAL in the parent/child 
> context ops)
> 
> If so, okay to post a rebased version?

I think so. Just please file a jira for the outstanding work.

Thanks,

Tvrtko