[PATCH] accel/ivpu: Trigger device recovery on engine reset/resume failure

Lizhi Hou lizhi.hou at amd.com
Thu Jun 5 08:51:52 UTC 2025


On 6/2/25 06:05, Jacek Lawrynowicz wrote:
> Hi,
>
> On 5/28/2025 7:53 PM, Lizhi Hou wrote:
>> On 5/28/25 08:42, Jacek Lawrynowicz wrote:
>>> From: Karol Wachowski <karol.wachowski at intel.com>
>>>
>>> Trigger full device recovery when the driver fails to restore device state
>>> via engine reset and resume operations. This is necessary because, even if
>>> submissions from a faulty context are blocked, the NPU may still process
>>> previously submitted faulty jobs if the engine reset fails to abort them.
>>> Such jobs can continue to generate faults and occupy device resources.
>>> When engine reset is ineffective, the only way to recover is to perform
>>> a full device recovery.
>>>
>>> Fixes: dad945c27a42 ("accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW")
>>> Cc: <stable at vger.kernel.org> # v6.15+
>>> Signed-off-by: Karol Wachowski <karol.wachowski at intel.com>
>>> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz at linux.intel.com>
>>> ---
>>>    drivers/accel/ivpu/ivpu_job.c     | 6 ++++--
>>>    drivers/accel/ivpu/ivpu_jsm_msg.c | 9 +++++++--
>>>    2 files changed, 11 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/accel/ivpu/ivpu_job.c b/drivers/accel/ivpu/ivpu_job.c
>>> index 1c8e283ad9854..fae8351aa3309 100644
>>> --- a/drivers/accel/ivpu/ivpu_job.c
>>> +++ b/drivers/accel/ivpu/ivpu_job.c
>>> @@ -986,7 +986,8 @@ void ivpu_context_abort_work_fn(struct work_struct *work)
>>>            return;
>>>          if (vdev->fw->sched_mode == VPU_SCHEDULING_MODE_HW)
>>> -        ivpu_jsm_reset_engine(vdev, 0);
>>> +        if (ivpu_jsm_reset_engine(vdev, 0))
>>> +            return;
>> Is it possible the context aborting is entered again before the full device recovery work is executed?
> This is a good point but ivpu_context_abort_work_fn() is triggered by an IRQ and the first thing we do when triggering recovery is disabling IRQs.
> The recovery work also flushes context_abort_work before staring to tear down everything, so we should be safe.
Reviewed-by: Lizhi Hou <lizhi.hou at amd.com>
>
> Regards,
> Jacek
>


More information about the dri-devel mailing list