[PATCH] accel/ivpu: Implement heartbeat-based TDR mechanism
Jacek Lawrynowicz
jacek.lawrynowicz at linux.intel.com
Wed Apr 23 07:23:51 UTC 2025
Hi,
On 4/18/2025 5:27 PM, Jeffrey Hugo wrote:
> On 4/16/2025 4:25 AM, Maciej Falkowski wrote:
>> From: Karol Wachowski <karol.wachowski at intel.com>
>>
>> Introduce a heartbeat-based Timeout Detection and Recovery (TDR) mechanism.
>> The enhancement aims to improve the reliability of device hang detection by
>> monitoring heartbeat updates.
>>
>> Each progressing inference will update heartbeat counter allowing driver to
>> monitor its progression. Limit maximum number of reschedules when heartbeat
>> indicates progression to 30.
>
> Code looks good. However, why 30? This would artificially limit how long a job could run, no?
Yes, we still need a time based limit. There may be workloads that are stuck in infinite loop for example.
With this patch the max time the job can run is extended from 2 to 60 seconds.
We are not aware of any workloads that exceed this timeout at the moment.
Regards,
Jacek
More information about the dri-devel
mailing list