[PATCH] accel/ivpu: Implement heartbeat-based TDR mechanism

Jeff Hugo jeff.hugo at oss.qualcomm.com
Wed Apr 23 14:23:16 UTC 2025


On 4/23/2025 1:23 AM, Jacek Lawrynowicz wrote:
> Hi,
> 
> On 4/18/2025 5:27 PM, Jeffrey Hugo wrote:
>> On 4/16/2025 4:25 AM, Maciej Falkowski wrote:
>>> From: Karol Wachowski <karol.wachowski at intel.com>
>>>
>>> Introduce a heartbeat-based Timeout Detection and Recovery (TDR) mechanism.
>>> The enhancement aims to improve the reliability of device hang detection by
>>> monitoring heartbeat updates.
>>>
>>> Each progressing inference will update heartbeat counter allowing driver to
>>> monitor its progression. Limit maximum number of reschedules when heartbeat
>>> indicates progression to 30.
>>
>> Code looks good.  However, why 30?  This would artificially limit how long a job could run, no?
> 
> Yes, we still need a time based limit. There may be workloads that are stuck in infinite loop for example.
> With this patch the max time the job can run is extended from 2 to 60 seconds.
> We are not aware of any workloads that exceed this timeout at the moment.

Infinite loop vs something that just happens to be running long by 
design is a difficult problem.  60 seconds does not seem all that long 
to me.  Perhaps consider some kind of override so that if/when a 
workload comes along that needs more than 60 seconds, the user doesn't 
need to recompile their kernel to make it work?  I suspect that would be 
outside the scope of this change.

For this change, I think it would be good to add some info from your 
response.  I think the commit text would be improved by stating this 
increases the max runtime from 2 seconds to 60, and that this covers all 
known workloads.  Also, I think a comment on PM_TDR_HEARTBEAT_LIMIT that 
tells how long the limit is (60 seconds) would be helpful to future 
readers, instead of needing to parse through multiple functions and how 
they all interact.

With the commit text update -

Reviewed-by: Jeff Hugo <jeff.hugo at oss.qualcomm.com>

The PM_TDR_HEARTBEAT_LIMIT comment is optional to me.


More information about the dri-devel mailing list