[RFC PATCH 0/5] Only timeout jobs if they run longer than timeout period
Matthew Brost
matthew.brost at intel.com
Fri Jun 7 06:52:14 UTC 2024
Debugging [1] hit a known flaw in the job timeout mechanism - jobs
timeout after a period of time in which they have been submitted to the
GuC not how long they have actually been running on the hardware.
Attempt to fix this.
Algorithm is as follows:
- Copy ctx timestamp from LRC to saved location at beginning of every
job
- On TDR kick jobs off hardware via schedule disable so ctx timestamp is
updated
- Compare ctx timestamp to saved ctx timestamp, if jobs having been
running less than timeout period re-enable scheduling are restart TDR
Series needs a bit of work documented with FIXMEs, hence an RFC. Let's
agree if this is right direction before putting in more work.
Matt
[1] https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/799
Matthew Brost (5):
drm/xe: Add LRC ctx timestamp support functions
drm/xe: Add MI_COPY_MEM_MEM GPU instruction definitions
drm/xe: Emit ctx timestamp copy in ring ops
drm/xe: Add ctx timestamp to LRC snapshot
drm/xe: Sample ctx timestamp to determine if jobs have timed out
.../gpu/drm/xe/instructions/xe_mi_commands.h | 4 +
drivers/gpu/drm/xe/xe_guc_submit.c | 140 +++++++++++++-----
drivers/gpu/drm/xe/xe_lrc.c | 49 ++++++
drivers/gpu/drm/xe/xe_lrc.h | 5 +
drivers/gpu/drm/xe/xe_ring_ops.c | 21 +++
5 files changed, 186 insertions(+), 33 deletions(-)
--
2.34.1
More information about the Intel-xe
mailing list