[Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset

Mon Aug 28 23:01:38 UTC 2023

On 8/23/2023 10:37, John Harrison wrote:
> On 8/23/2023 09:00, Daniel Vetter wrote:
>> On Tue, Aug 22, 2023 at 11:53:24AM -0700, John Harrison wrote:
>>> On 8/11/2023 11:20, Zhanjun Dong wrote:
>>>> This attempts to avoid circular locking dependency between flush 
>>>> delayed
>>>> work and intel_gt_reset.
>>>> When intel_gt_reset was called, task will hold a lock.
>>>> To cacel delayed work here, the _sync version will also acquire a 
>>>> lock,
>>>> which might trigger the possible cirular locking dependency warning.
>>>> When intel_gt_reset called, reset_in_progress flag will be set, add 
>>>> code
>>>> to check the flag, call async verion if reset is in progress.
>>>>
>>>> Signed-off-by: Zhanjun Dong<zhanjun.dong at intel.com>
>>>> Cc: John Harrison<John.C.Harrison at Intel.com>
>>>> Cc: Andi Shyti<andi.shyti at linux.intel.com>
>>>> Cc: Daniel Vetter<daniel at ffwll.ch>
>>>> ---
>>>>    drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 ++++++++++-
>>>>    1 file changed, 10 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> index a0e3ef1c65d2..600388c849f7 100644
>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> @@ -1359,7 +1359,16 @@ static void 
>>>> guc_enable_busyness_worker(struct intel_guc *guc)
>>>>    static void guc_cancel_busyness_worker(struct intel_guc *guc)
>>>>    {
>>>> -    cancel_delayed_work_sync(&guc->timestamp.work);
>>>> +    /*
>>>> +     * When intel_gt_reset was called, task will hold a lock.
>>>> +     * To cacel delayed work here, the _sync version will also 
>>>> acquire a lock, which might
>>>> +     * trigger the possible cirular locking dependency warning.
>>>> +     * Check the reset_in_progress flag, call async verion if 
>>>> reset is in progress.
>>>> +     */
>>> This needs to explain in much more detail what is going on and why 
>>> it is not
>>> a problem. E.g.:
>>>
>>>     The busyness worker needs to be cancelled. In general that means
>>>     using the synchronous cancel version to ensure that an in-progress
>>>     worker will not keep executing beyond whatever is happening that
>>>     needs the cancel. E.g. suspend, driver unload, etc. However, in the
>>>     case of a reset, the synchronous version is not required and can
>>>     trigger a false deadlock detection warning.
>>>
>>>     The business worker takes the reset mutex to protect against resets
>>>     interfering with it. However, it does a trylock and bails out if 
>>> the
>>>     reset lock is already acquired. Thus there is no actual deadlock or
>>>     other concern with the worker running concurrently with a reset. So
>>>     an asynchronous cancel is safe in the case of a reset rather than a
>>>     driver unload or suspend type operation. On the other hand, if the
>>>     cancel_sync version is used when a reset is in progress then the
>>>     mutex deadlock detection sees the mutex being acquired through
>>>     multiple paths and complains.
>>>
>>>     So just don't bother. That keeps the detection code happy and is
>>>     safe because of the trylock code described above.
>> So why do we even need to cancel anything if it doesn't do anything 
>> while
>> the reset is in progress?
> It still needs to be cancelled. The worker only aborts if it is 
> actively executing concurrently with the reset. It might not start to 
> execute until after the reset has completed. And there is presumably a 
> reason why the cancel is being called, a reason not necessarily 
> related to resets at all. Leaving the worker to run arbitrarily after 
> the driver is expecting it to be stopped will lead to much worse 
> things than a fake lockdep splat, e.g. a use after free pointer deref.
>
> John.
@Daniel Vetter - ping? Is this explanation sufficient? Are you okay with 
this change now?

John.

>
>>
>> Just remove the cancel from the reset path as uneeded instead, and 
>> explain
>> why that's ok? Because that's defacto what the cancel_work with a
>> potential deadlock scenario for cancel_work_sync does, you either don't
>> need it at all, or the replacement creates a bug.
>> -Daniel
>>
>>>
>>> John.
>>>
>>>
>>>> +    if (guc_to_gt(guc)->uc.reset_in_progress)
>>>> +        cancel_delayed_work(&guc->timestamp.work);
>>>> +    else
>>>> + cancel_delayed_work_sync(&guc->timestamp.work);
>>>>    }
>>>>    static void __reset_guc_busyness_stats(struct intel_guc *guc)
>