[Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset
John Harrison
john.c.harrison at intel.com
Mon Aug 28 23:01:38 UTC 2023
On 8/23/2023 10:37, John Harrison wrote:
> On 8/23/2023 09:00, Daniel Vetter wrote:
>> On Tue, Aug 22, 2023 at 11:53:24AM -0700, John Harrison wrote:
>>> On 8/11/2023 11:20, Zhanjun Dong wrote:
>>>> This attempts to avoid circular locking dependency between flush
>>>> delayed
>>>> work and intel_gt_reset.
>>>> When intel_gt_reset was called, task will hold a lock.
>>>> To cacel delayed work here, the _sync version will also acquire a
>>>> lock,
>>>> which might trigger the possible cirular locking dependency warning.
>>>> When intel_gt_reset called, reset_in_progress flag will be set, add
>>>> code
>>>> to check the flag, call async verion if reset is in progress.
>>>>
>>>> Signed-off-by: Zhanjun Dong<zhanjun.dong at intel.com>
>>>> Cc: John Harrison<John.C.Harrison at Intel.com>
>>>> Cc: Andi Shyti<andi.shyti at linux.intel.com>
>>>> Cc: Daniel Vetter<daniel at ffwll.ch>
>>>> ---
>>>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 ++++++++++-
>>>> 1 file changed, 10 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> index a0e3ef1c65d2..600388c849f7 100644
>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> @@ -1359,7 +1359,16 @@ static void
>>>> guc_enable_busyness_worker(struct intel_guc *guc)
>>>> static void guc_cancel_busyness_worker(struct intel_guc *guc)
>>>> {
>>>> - cancel_delayed_work_sync(&guc->timestamp.work);
>>>> + /*
>>>> + * When intel_gt_reset was called, task will hold a lock.
>>>> + * To cacel delayed work here, the _sync version will also
>>>> acquire a lock, which might
>>>> + * trigger the possible cirular locking dependency warning.
>>>> + * Check the reset_in_progress flag, call async verion if
>>>> reset is in progress.
>>>> + */
>>> This needs to explain in much more detail what is going on and why
>>> it is not
>>> a problem. E.g.:
>>>
>>> The busyness worker needs to be cancelled. In general that means
>>> using the synchronous cancel version to ensure that an in-progress
>>> worker will not keep executing beyond whatever is happening that
>>> needs the cancel. E.g. suspend, driver unload, etc. However, in the
>>> case of a reset, the synchronous version is not required and can
>>> trigger a false deadlock detection warning.
>>>
>>> The business worker takes the reset mutex to protect against resets
>>> interfering with it. However, it does a trylock and bails out if
>>> the
>>> reset lock is already acquired. Thus there is no actual deadlock or
>>> other concern with the worker running concurrently with a reset. So
>>> an asynchronous cancel is safe in the case of a reset rather than a
>>> driver unload or suspend type operation. On the other hand, if the
>>> cancel_sync version is used when a reset is in progress then the
>>> mutex deadlock detection sees the mutex being acquired through
>>> multiple paths and complains.
>>>
>>> So just don't bother. That keeps the detection code happy and is
>>> safe because of the trylock code described above.
>> So why do we even need to cancel anything if it doesn't do anything
>> while
>> the reset is in progress?
> It still needs to be cancelled. The worker only aborts if it is
> actively executing concurrently with the reset. It might not start to
> execute until after the reset has completed. And there is presumably a
> reason why the cancel is being called, a reason not necessarily
> related to resets at all. Leaving the worker to run arbitrarily after
> the driver is expecting it to be stopped will lead to much worse
> things than a fake lockdep splat, e.g. a use after free pointer deref.
>
> John.
@Daniel Vetter - ping? Is this explanation sufficient? Are you okay with
this change now?
John.
>
>>
>> Just remove the cancel from the reset path as uneeded instead, and
>> explain
>> why that's ok? Because that's defacto what the cancel_work with a
>> potential deadlock scenario for cancel_work_sync does, you either don't
>> need it at all, or the replacement creates a bug.
>> -Daniel
>>
>>>
>>> John.
>>>
>>>
>>>> + if (guc_to_gt(guc)->uc.reset_in_progress)
>>>> + cancel_delayed_work(&guc->timestamp.work);
>>>> + else
>>>> + cancel_delayed_work_sync(&guc->timestamp.work);
>>>> }
>>>> static void __reset_guc_busyness_stats(struct intel_guc *guc)
>
More information about the Intel-gfx
mailing list