✗ CI.checkpatch: warning for drm/xe/sched: stop re-submitting signalled jobs

Fri May 23 12:34:25 UTC 2025

== Series Details ==

Series: drm/xe/sched: stop re-submitting signalled jobs
URL   : https://patchwork.freedesktop.org/series/149426/
State : warning

== Summary ==

+ KERNEL=/kernel
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt
Cloning into 'mt'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ git -C mt rev-list -n1 origin/master
202708c00696422fd217223bb679a353a5936e23
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ git log -n1
commit d7625d16d7650e67acaabbe35d6010f7af3a3490
Author: Matthew Auld <matthew.auld at intel.com>
Date:   Fri May 23 12:36:09 2025 +0100

    drm/xe/sched: stop re-submitting signalled jobs
    
    Customer is reporting a really subtle issue where we get random DMAR
    faults, hangs and other nasties for kernel migration jobs when stressing
    stuff like s2idle/s3/s4. The explosions seems to happen somewhere
    after resuming the system with splats looking something like:
    
    PM: suspend exit
    rfkill: input handler disabled
    xe 0000:00:02.0: [drm] GT0: Engine reset: engine_class=bcs, logical_mask: 0x2, guc_id=0
    xe 0000:00:02.0: [drm] GT0: Timedout job: seqno=24496, lrc_seqno=24496, guc_id=0, flags=0x13 in no process [-1]
    xe 0000:00:02.0: [drm] GT0: Kernel-submitted job timed out
    
    The likely cause appears to be a race between suspend cancelling the
    worker that processes the free_job()'s, such that we still have pending
    jobs to be freed after the cancel. Following from this, on resume the
    pending_list will now contain at least one already complete job, but it
    looks like we call drm_sched_resubmit_jobs(), which will then call
    run_job() on everything still on the pending_list. But if the job was
    already complete, then all the resources tied to the job, like the bb
    itself, any memory that is being accessed, the iommu mappings etc. might
    be long gone since those are usually tied to the fence signalling.
    
    This scenario can be seen in ftrace when running a slightly modified
    xe_pm (kernel was only modified to inject artificial latency into
    free_job to make the race easier to hit):
    
    xe_sched_job_run: dev=0000:00:02.0, fence=0xffff888276cc8540, seqno=0, lrc_seqno=0, gt=0, guc_id=0, batch_addr=0x000000146910 ...
    xe_exec_queue_stop:   dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=0, guc_state=0x0, flags=0x13
    xe_exec_queue_stop:   dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=1, guc_state=0x0, flags=0x4
    xe_exec_queue_stop:   dev=0000:00:02.0, 4:0x1, gt=1, width=1, guc_id=0, guc_state=0x0, flags=0x3
    xe_exec_queue_stop:   dev=0000:00:02.0, 1:0x1, gt=1, width=1, guc_id=1, guc_state=0x0, flags=0x3
    xe_exec_queue_stop:   dev=0000:00:02.0, 4:0x1, gt=1, width=1, guc_id=2, guc_state=0x0, flags=0x3
    xe_exec_queue_resubmit: dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=0, guc_state=0x0, flags=0x13
    xe_sched_job_run: dev=0000:00:02.0, fence=0xffff888276cc8540, seqno=0, lrc_seqno=0, gt=0, guc_id=0, batch_addr=0x000000146910 ...
    .....
    xe_exec_queue_memory_cat_error: dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=0, guc_state=0x3, flags=0x13
    
    So the job_run() is clearly triggered twice, even though the first must
    have already signalled to completion during suspend. We can also see a
    CAT error after the re-submit.
    
    To prevent this try to call xe_sched_stop() to forcefully remove
    anything on the pending_list that has already signalled, before we
    re-submit.
    
    Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4856
    Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
    Signed-off-by: Matthew Auld <matthew.auld at intel.com>
    Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
    Cc: Matthew Brost <matthew.brost at intel.com>
    Cc: William Tseng <william.tseng at intel.com>
    Cc: <stable at vger.kernel.org> # v6.8+
+ /mt/dim checkpatch b93f7d5c722241ec6d23a3f03c99a844227c50db drm-intel
d7625d16d765 drm/xe/sched: stop re-submitting signalled jobs
-:16: WARNING:COMMIT_LOG_LONG_LINE: Prefer a maximum 75 chars per line (possible unwrapped commit description?)
#16: 
xe 0000:00:02.0: [drm] GT0: Engine reset: engine_class=bcs, logical_mask: 0x2, guc_id=0

total: 0 errors, 1 warnings, 0 checks, 7 lines checked