[PATCH] accel/ivpu: Trigger device recovery on engine reset/resume failure

Jacek Lawrynowicz jacek.lawrynowicz at linux.intel.com
Mon Jun 2 13:05:26 UTC 2025


Hi,

On 5/28/2025 7:53 PM, Lizhi Hou wrote:
> 
> On 5/28/25 08:42, Jacek Lawrynowicz wrote:
>> From: Karol Wachowski <karol.wachowski at intel.com>
>>
>> Trigger full device recovery when the driver fails to restore device state
>> via engine reset and resume operations. This is necessary because, even if
>> submissions from a faulty context are blocked, the NPU may still process
>> previously submitted faulty jobs if the engine reset fails to abort them.
>> Such jobs can continue to generate faults and occupy device resources.
>> When engine reset is ineffective, the only way to recover is to perform
>> a full device recovery.
>>
>> Fixes: dad945c27a42 ("accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW")
>> Cc: <stable at vger.kernel.org> # v6.15+
>> Signed-off-by: Karol Wachowski <karol.wachowski at intel.com>
>> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz at linux.intel.com>
>> ---
>>   drivers/accel/ivpu/ivpu_job.c     | 6 ++++--
>>   drivers/accel/ivpu/ivpu_jsm_msg.c | 9 +++++++--
>>   2 files changed, 11 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/accel/ivpu/ivpu_job.c b/drivers/accel/ivpu/ivpu_job.c
>> index 1c8e283ad9854..fae8351aa3309 100644
>> --- a/drivers/accel/ivpu/ivpu_job.c
>> +++ b/drivers/accel/ivpu/ivpu_job.c
>> @@ -986,7 +986,8 @@ void ivpu_context_abort_work_fn(struct work_struct *work)
>>           return;
>>         if (vdev->fw->sched_mode == VPU_SCHEDULING_MODE_HW)
>> -        ivpu_jsm_reset_engine(vdev, 0);
>> +        if (ivpu_jsm_reset_engine(vdev, 0))
>> +            return;
> 
> Is it possible the context aborting is entered again before the full device recovery work is executed?

This is a good point but ivpu_context_abort_work_fn() is triggered by an IRQ and the first thing we do when triggering recovery is disabling IRQs.
The recovery work also flushes context_abort_work before staring to tear down everything, so we should be safe.

Regards,
Jacek



More information about the dri-devel mailing list