[Intel-xe] [RFC PATCH 0/1] Long running jobs ideas

Wed Mar 29 00:09:57 UTC 2023

We discussed 2 options in a meeting today for long running jobs:

1. Return NULL in run_job, find another way to flow control ring and
clean up after engine errors.

2. Don't export dma-fences from the DRM scheduler or xe_sched_job, use
lockdep to enforce these rules (we don't export currently but do not
have lockdep asserts).

This series is an attenpt to code option #1, specifically the flow
controling of the ring. This series is not tested at all.

I've reached the conclusion that option #1 doesn't work as all it does
is move a dma-fence from run_job to prepare_job and in either case the
dma-fences from the DRM scheduler can be endless. Based on the
discussion in [1], the only proper way to throttle submission is through
dma-fences too.

Based on this IMO the only sensible solution for long running jobs in
option #2. We likely should update the DRM scheduler with a flag so when
it creates fences it uses lockdep to enforce that these fences cannot be
exported. Likewise we should also apply these rules to job->fence in Xe.

Matt

[1] https://patchwork.freedesktop.org/patch/525461/?series=114772&rev=2

Matthew Brost (1):
  drm/xe: Return NULL in run_job for long running jobs

 drivers/gpu/drm/xe/xe_engine.c          | 54 ++++++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_engine.h          |  6 +++
 drivers/gpu/drm/xe/xe_engine_types.h    |  3 ++
 drivers/gpu/drm/xe/xe_guc_submit.c      | 31 ++++++++++++--
 drivers/gpu/drm/xe/xe_sched_job.c       |  5 +++
 drivers/gpu/drm/xe/xe_sched_job_types.h |  2 +
 6 files changed, 96 insertions(+), 5 deletions(-)

-- 
2.34.1