[Intel-gfx] [PATCH 02/12] drm/i915/guc: Don't call ring_is_idle in GuC submission

Tue Jul 19 10:14:51 UTC 2022

On 19/07/2022 10:49, Tvrtko Ursulin wrote:
> 
> On 19/07/2022 01:09, John Harrison wrote:
>> On 7/18/2022 05:26, Tvrtko Ursulin wrote:
>>>
>>> On 13/07/2022 00:31, John.C.Harrison at Intel.com wrote:
>>>> From: Matthew Brost <matthew.brost at intel.com>
>>>>
>>>> The engine registers really shouldn't be touched during GuC submission
>>>> as the GuC owns the registers. Don't call ring_is_idle and tie
>>>
>>> Touch being just read and it is somehow harmful?
>> The registers are meaningless when GuC is controlling the submission. 
>> The i915 driver has no knowledge of what context is or is not 
>> executing on any given engine at any given time. So reading reading 
>> the ring registers is incorrect - it can lead to bad assumptions about 
>> what state the hardware is in.
> 
> Same is actually true with the execlists backend. The code in 
> ring_is_idle is not concerning itself with which context is running or 
> not. Just that the head/tail/ctl appear idle.
> 
> Problem/motivation appears to be on a higher than simply ring registers.
> 
> I am not claiming it makes sense with Guc and that it has to remain but 
> just suggesting for as a minimum clearer commit message.
> 
>>>> intel_engine_is_idle strictly to the engine pm.
>>>
>>> Strictly seems wrong - it is just ring_is_idle check that is replaced 
>>> and not the whole implementation of intel_engine_is_idle.
>>>
>>>> Because intel_engine_is_idle tied to the engine pm, retire requests
>>>> before checking intel_engines_are_idle in gt_drop_caches, and lastly
>>> Why is re-ordering important? I at least can't understand it. I hope 
>>> it's not working around IGT failures.
>> If requests are physically completed but not retired then they will be 
>> holding unnecessary PM references. So we need to flush those out 
>> before checking for idle.
> 
> And if they are not as someone passes in DROP_RESET_ACTIVE? They will 
> not retire and code still enters intel_engines_are_idle so that has to 
> work, no? Something does not align for me still.

With "not retire" I meant potentially not retire within 
I915_IDLE_ENGINES_TIMEOUT. I guess hack happens to work for some or all 
IGTs which use DROP_RESET_ACTIVE.

Does it also mean patch would fix that problem without touching 
intel_engine_is_idle/ring_is_idle - with just the re-ordering in 
gt_drop_caches?

Regards,

Tvrtko

> 
>>>> increase the timeout in gt_drop_caches for the intel_engines_are_idle
>>>> check.
>>>
>>> Same here - why?
>> @Matthew Brost - do you recall which particular tests were hitting an 
>> issue? I'm guessing gem_ctx_create? I believe that's the one that 
>> creates and destroys thousands of contexts. That is much slower with 
>> GuC (GuC communication required) than with execlists (i915 internal 
>> state change only).
> 
> And if that is a logically separate change please split the patch up.
> 
> Regards,
> 
> Tvrtko
> 
>>
>> John.
>>
>>
>>
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
>>>> ---
>>>>   drivers/gpu/drm/i915/gt/intel_engine_cs.c | 13 +++++++++++++
>>>>   drivers/gpu/drm/i915/i915_debugfs.c       |  6 +++---
>>>>   drivers/gpu/drm/i915/i915_drv.h           |  2 +-
>>>>   3 files changed, 17 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
>>>> b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>>> index 283870c659911..959a7c92e8f4d 100644
>>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>>> @@ -1602,6 +1602,9 @@ static bool ring_is_idle(struct 
>>>> intel_engine_cs *engine)
>>>>   {
>>>>       bool idle = true;
>>>>   +    /* GuC submission shouldn't access HEAD & TAIL via MMIO */
>>>> +    GEM_BUG_ON(intel_engine_uses_guc(engine));
>>>> +
>>>>       if (I915_SELFTEST_ONLY(!engine->mmio_base))
>>>>           return true;
>>>>   @@ -1668,6 +1671,16 @@ bool intel_engine_is_idle(struct 
>>>> intel_engine_cs *engine)
>>>>       if (!i915_sched_engine_is_empty(engine->sched_engine))
>>>>           return false;
>>>>   +    /*
>>>> +     * We shouldn't touch engine registers with GuC submission as 
>>>> the GuC
>>>> +     * owns the registers. Let's tie the idle to engine pm, at 
>>>> worst this
>>>> +     * function sometimes will falsely report non-idle when idle 
>>>> during the
>>>> +     * delay to retire requests or with virtual engines and a request
>>>> +     * running on another instance within the same class / submit 
>>>> mask.
>>>> +     */
>>>> +    if (intel_engine_uses_guc(engine))
>>>> +        return false;
>>>> +
>>>>       /* Ring stopped? */
>>>>       return ring_is_idle(engine);
>>>>   }
>>>> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
>>>> b/drivers/gpu/drm/i915/i915_debugfs.c
>>>> index 94e5c29d2ee3a..ee5334840e9cb 100644
>>>> --- a/drivers/gpu/drm/i915/i915_debugfs.c
>>>> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
>>>> @@ -654,13 +654,13 @@ gt_drop_caches(struct intel_gt *gt, u64 val)
>>>>   {
>>>>       int ret;
>>>>   +    if (val & DROP_RETIRE || val & DROP_RESET_ACTIVE)
>>>> +        intel_gt_retire_requests(gt);
>>>> +
>>>>       if (val & DROP_RESET_ACTIVE &&
>>>>           wait_for(intel_engines_are_idle(gt), 
>>>> I915_IDLE_ENGINES_TIMEOUT))
>>>>           intel_gt_set_wedged(gt);
>>>>   -    if (val & DROP_RETIRE)
>>>> -        intel_gt_retire_requests(gt);
>>>> -
>>>>       if (val & (DROP_IDLE | DROP_ACTIVE)) {
>>>>           ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
>>>>           if (ret)
>>>> diff --git a/drivers/gpu/drm/i915/i915_drv.h 
>>>> b/drivers/gpu/drm/i915/i915_drv.h
>>>> index c22f29c3faa0e..53c7474dde495 100644
>>>> --- a/drivers/gpu/drm/i915/i915_drv.h
>>>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>>>> @@ -278,7 +278,7 @@ struct i915_gem_mm {
>>>>       u32 shrink_count;
>>>>   };
>>>>   -#define I915_IDLE_ENGINES_TIMEOUT (200) /* in ms */
>>>> +#define I915_IDLE_ENGINES_TIMEOUT (500) /* in ms */
>>>>     unsigned long i915_fence_context_timeout(const struct 
>>>> drm_i915_private *i915,
>>>>                        u64 context);
>>