[PATCH] accel/ivpu: Implement heartbeat-based TDR mechanism

Jacek Lawrynowicz jacek.lawrynowicz at linux.intel.com
Fri Apr 25 08:22:58 UTC 2025


Hi,

On 4/23/2025 4:23 PM, Jeff Hugo wrote:
> On 4/23/2025 1:23 AM, Jacek Lawrynowicz wrote:
>> Hi,
>>
>> On 4/18/2025 5:27 PM, Jeffrey Hugo wrote:
>>> On 4/16/2025 4:25 AM, Maciej Falkowski wrote:
>>>> From: Karol Wachowski <karol.wachowski at intel.com>
>>>>
>>>> Introduce a heartbeat-based Timeout Detection and Recovery (TDR) mechanism.
>>>> The enhancement aims to improve the reliability of device hang detection by
>>>> monitoring heartbeat updates.
>>>>
>>>> Each progressing inference will update heartbeat counter allowing driver to
>>>> monitor its progression. Limit maximum number of reschedules when heartbeat
>>>> indicates progression to 30.
>>>
>>> Code looks good.  However, why 30?  This would artificially limit how long a job could run, no?
>>
>> Yes, we still need a time based limit. There may be workloads that are stuck in infinite loop for example.
>> With this patch the max time the job can run is extended from 2 to 60 seconds.
>> We are not aware of any workloads that exceed this timeout at the moment.
> 
> Infinite loop vs something that just happens to be running long by design is a difficult problem.  60 seconds does not seem all that long to me.  Perhaps consider some kind of override so that if/when a workload comes along that needs more than 60 seconds, the user doesn't need to recompile their kernel to make it work?  I suspect that would be outside the scope of this change.
> 
> For this change, I think it would be good to add some info from your response.  I think the commit text would be improved by stating this increases the max runtime from 2 seconds to 60, and that this covers all known workloads.  Also, I think a comment on PM_TDR_HEARTBEAT_LIMIT that tells how long the limit is (60 seconds) would be helpful to future readers, instead of needing to parse through multiple functions and how they all interact.
> 
> With the commit text update -
> 
> Reviewed-by: Jeff Hugo <jeff.hugo at oss.qualcomm.com>
> 
> The PM_TDR_HEARTBEAT_LIMIT comment is optional to me.

OK, I will update the commit message.

There is a module param that can be used to increase the overall timeout (tdr_timeout_ms).
The total timeout is (tdr_timeout_ms * 30) but this is not too intuitive after this change.
I will figure better solution that's easier to use.


Regards,
Jacek



More information about the dri-devel mailing list