✗ CI.checkpatch: warning for drm/xe/sched: stop re-submitting signalled jobs (rev5)

Fri May 30 12:02:15 UTC 2025

== Series Details ==

Series: drm/xe/sched: stop re-submitting signalled jobs (rev5)
URL   : https://patchwork.freedesktop.org/series/149426/
State : warning

== Summary ==

+ KERNEL=/kernel
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt
Cloning into 'mt'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ git -C mt rev-list -n1 origin/master
202708c00696422fd217223bb679a353a5936e23
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ git log -n1
commit 3fca606b15d96965f3212939eec147ac039587ac
Author: Matthew Auld <matthew.auld at intel.com>
Date:   Wed May 28 12:33:29 2025 +0100

    drm/xe/sched: stop re-submitting signalled jobs
    
    Customer is reporting a really subtle issue where we get random DMAR
    faults, hangs and other nasties for kernel migration jobs when stressing
    stuff like s2idle/s3/s4. The explosions seems to happen somewhere
    after resuming the system with splats looking something like:
    
    PM: suspend exit
    rfkill: input handler disabled
    xe 0000:00:02.0: [drm] GT0: Engine reset: engine_class=bcs, logical_mask: 0x2, guc_id=0
    xe 0000:00:02.0: [drm] GT0: Timedout job: seqno=24496, lrc_seqno=24496, guc_id=0, flags=0x13 in no process [-1]
    xe 0000:00:02.0: [drm] GT0: Kernel-submitted job timed out
    
    The likely cause appears to be a race between suspend cancelling the
    worker that processes the free_job()'s, such that we still have pending
    jobs to be freed after the cancel. Following from this, on resume the
    pending_list will now contain at least one already complete job, but it
    looks like we call drm_sched_resubmit_jobs(), which will then call
    run_job() on everything still on the pending_list. But if the job was
    already complete, then all the resources tied to the job, like the bb
    itself, any memory that is being accessed, the iommu mappings etc. might
    be long gone since those are usually tied to the fence signalling.
    
    This scenario can be seen in ftrace when running a slightly modified
    xe_pm (kernel was only modified to inject artificial latency into
    free_job to make the race easier to hit):
    
    xe_sched_job_run: dev=0000:00:02.0, fence=0xffff888276cc8540, seqno=0, lrc_seqno=0, gt=0, guc_id=0, batch_addr=0x000000146910 ...
    xe_exec_queue_stop:   dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=0, guc_state=0x0, flags=0x13
    xe_exec_queue_stop:   dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=1, guc_state=0x0, flags=0x4
    xe_exec_queue_stop:   dev=0000:00:02.0, 4:0x1, gt=1, width=1, guc_id=0, guc_state=0x0, flags=0x3
    xe_exec_queue_stop:   dev=0000:00:02.0, 1:0x1, gt=1, width=1, guc_id=1, guc_state=0x0, flags=0x3
    xe_exec_queue_stop:   dev=0000:00:02.0, 4:0x1, gt=1, width=1, guc_id=2, guc_state=0x0, flags=0x3
    xe_exec_queue_resubmit: dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=0, guc_state=0x0, flags=0x13
    xe_sched_job_run: dev=0000:00:02.0, fence=0xffff888276cc8540, seqno=0, lrc_seqno=0, gt=0, guc_id=0, batch_addr=0x000000146910 ...
    .....
    xe_exec_queue_memory_cat_error: dev=0000:00:02.0, 3:0x2, gt=0, width=1, guc_id=0, guc_state=0x3, flags=0x13
    
    So the job_run() is clearly triggered twice for the same job, even
    though the first must have already signalled to completion during
    suspend. We can also see a CAT error after the re-submit.
    
    To prevent this try to call xe_sched_stop() to forcefully remove
    anything on the pending_list that has already signalled, before we
    re-submit.
    
    v2:
      - Make sure to re-arm the fence callbacks with sched_start().
    v3 (Matt B):
      - Stop using drm_sched_resubmit_jobs(), which appears to be deprecated
        and just open-code a simple loop such that we skip calling run_job()
        and anything already signalled.
    
    Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4856
    Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
    Signed-off-by: Matthew Auld <matthew.auld at intel.com>
    Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
    Cc: Matthew Brost <matthew.brost at intel.com>
    Cc: William Tseng <william.tseng at intel.com>
    Cc: <stable at vger.kernel.org> # v6.8+
    Reviewed-by: Matthew Brost <matthew.brost at intel.com>
    Reviewed-by: Tejas Upadhyay <tejas.upadhyay at intel.com>
+ /mt/dim checkpatch 5ee21c4d445a914860e59678f036866f80ebade4 drm-intel
3fca606b15d9 drm/xe/sched: stop re-submitting signalled jobs
-:16: WARNING:COMMIT_LOG_LONG_LINE: Prefer a maximum 75 chars per line (possible unwrapped commit description?)
#16: 
xe 0000:00:02.0: [drm] GT0: Engine reset: engine_class=bcs, logical_mask: 0x2, guc_id=0

total: 0 errors, 1 warnings, 0 checks, 16 lines checked