[PATCH v8 0/8] drm/xe/vf: Post-migration recovery of queues and jobs
Matthew Brost
matthew.brost at intel.com
Tue Aug 12 04:47:31 UTC 2025
On Fri, Aug 01, 2025 at 03:50:37AM +0200, Tomasz Lis wrote:
+Maintainers, public, consolidating threads.
IMO this needs to be fully reverted or at minimum a subset of
problematic patches, along with [1].
To start, it was brought to my attention that this code was not tested
due to a lack of internal infrastructure. I believe it is completely
unacceptable for untested code to be merged, especially for a complex
feature like this. After spending some time reviewing the feature, I
found clear bugs, and the overall software architecture appears
ill-conceived.
Quick overview of bugs / issues I have spotted:
- xe_exec_queue_jobs_ring_restore does not modify existing ring jobs’ GGTT
addresses; it instead populates new parts of the ring buffer. This not
only fails to accomplish what it intends, it could also overflow the
ring buffer for existing jobs.
- xe_exec_queue_jobs_ring_restore has clear layering violations by
reaching into q->guc state.
- Upon VF resume, nothing prevents new submissions from writing to GGTT
addresses that have been moved. I was looped into this by internal
patches proposing an untenable locking model, which IMO is non-viable.
Tomazs please post those publically for reference.
- During queue pause/unpause [3], nothing (i.e., a missing mutex) prevents
a UAF during the queue lookup in submission_state.exec_queue_lookup.
- xe_sched_submission_stop_async in [3] is not actually asynchronous (it
synchronously stops the scheduler workitems). It also doesn’t prevent
queued scheduler messages (driver side workitems) from being
processed; those may send GuC H2Gs over an inactive interface, or wait
on G2Hs that will never arrive because the CT interface/GuC is not live.
- xe_sched_submission_stop_async relies on GuC CTs being alive to make
forward progress; otherwise GuC CT timeouts can trigger a GT reset,
which by nature loses state which opposing to this features goal.
- Even if xe_sched_submission_stop_async were actually asynchronous, GuC
H2Gs have bounded timeouts, so H2Gs can be lost. The submission model
assumes H2G/G2H are lossless or that losses are recoverable by
inspecting stored GuC submission-queue state (as with GT resets).
- Modifying GGTT addresses via CTs [4] or GuC WQs [5] is fragile. Any
interface change would likely break this, and any new interfaces that
use GGTT addresses would also need updates.
I purpose this flow instead, which TL;DR is modified GT reset flow which
understands GuC state is not lost, rather CTs can be lost.
- Before calling xe_guc_submit_pause(), stop the CTs and drain/parse all
pending G2H messages. Parsing G2H stabilizes the GuC submission state;
stopping CTs prevents waits on disabled CTs and avoids timeouts due to
lack of space.
- xe_guc_submit_{pause,unpause}() should take submission_state.lock and
flip a bit in the global GuC submission state. This bit blocks all new
submissions and causes any waiters in the submission/message paths to
exit gracefully.
- xe_guc_submit_pause() stops everything synchronously, including TDRs.
This is required here because TDRs are scheduled on the same
single-threaded, ordered workqueue. We may need DRM scheduler hooks to
facilitate pausing/resuming TDRs.
- xe_guc_submit_pause() scans the GuC submission state and makes any
modifications needed for replay (e.g., adjust internal state, reset the
WQ head and tail to match, adjust LRC head/tail pointers to match,
etc.).
- Scan the CT buffer; convert anything that touches submission state into
a NOP and adjust G2H credits as needed. If the GuC lacks a NOP, we’ll
need to add one; alternatively, if feasible, we can discard the entire
H2G queue, which would be simpler.
- Adjust all GGTT addresses of non-submission-related objects.
- Restart the CTs and enable the VF in the GuC.
- xe_guc_submit_unpause() resubmits everything through the scheduler, with
ring-head pointer adjustment similar to [6]. (It’s possible the head
adjustment belongs in pause; I need a complete picture of how the GuC
replays preempted jobs—e.g., if the return address from BB start is an
exact ring address rather than relying on the ring head, the algorithm
would change.) The resubmission will re-register any queues whose
registration was lost, re-enable any queues whose scheduler-enable state
was lost, and resubmit the correct WQ items.
- xe_guc_submit_unpause() also replays any messages that were in flight
but lost due to VF resume (e.g., teardown queue, suspend/resume of
queues).
- xe_guc_submit_unpause() restarts all scheduler work items, including
TDRs.
Matt
[1] https://patchwork.freedesktop.org/series/148780/
[2] https://patchwork.freedesktop.org/patch/666681/?series=148982&rev=10
[3] https://patchwork.freedesktop.org/patch/666679/?series=148982&rev=10
[4] https://patchwork.freedesktop.org/patch/652446/?series=148780&rev=1
[5] https://patchwork.freedesktop.org/patch/666683/?series=148982&rev=10
[6] https://patchwork.freedesktop.org/series/152715/
> To support VF Migration, it is necessary to do fixups to any
> non-virtualized resources. These fixups need to be applied within
> VM, on the KMD working with VF.
>
> This series adds two fixup functions to the recovery worker:
> * for fixing xe_lrc structs within queues
> * for fixing xe_job structs and the commands they emit
> It also provides some performance and stability fixes - blocking
> submissions and resets while the fixups are being applied.
> In case of sub-allocator, it removes the cached GGTT addresses
> instead of implementing fixups for them.
>
> v2: Switcghed to update of addresses by xe_lrc_write_ctx_reg()
> to avoid kzalloc(), renamed or moved few functions
> v3: Renamed and reordered parameters, added kerneldocs
> v4: Take job_list_lock, introduce a new atomic for reset
> blocking, add "refresh utilization buffer" patch
> v5: Replaced "Finish RESFIX by reset" patch with "Skip fixups
> before getting GGTT info", rebased "Refresh utilization buffer"
> patch
> v6: Rebased to changes in "Make multi-GT migration less error prone",
> used a scratch buffer, and added one more ring recovery patch
> v7: Used better matching atomic functs, fixed noop item at end of
> WQ ring, added more exit conditions in wq ring
> v8: Improve error/warn logging, add propagation of errors,
> make enum for offsets
>
> Tomasz Lis (8):
> drm/xe/sa: Avoid caching GGTT address within the manager
> drm/xe/vf: Pause submissions during RESFIX fixups
> drm/xe: Block reset while recovering from VF migration
> drm/xe/vf: Rebase HWSP of all contexts after migration
> drm/xe/vf: Rebase MEMIRQ structures for all contexts after migration
> drm/xe/vf: Post migration, repopulate ring area for pending request
> drm/xe/vf: Refresh utilization buffer during migration recovery
> drm/xe/vf: Rebase exec queue parallel commands during migration
> recovery
>
> drivers/gpu/drm/xe/abi/guc_actions_abi.h | 8 ++
> drivers/gpu/drm/xe/xe_exec_queue.c | 48 +++++++
> drivers/gpu/drm/xe/xe_exec_queue.h | 4 +
> drivers/gpu/drm/xe/xe_gpu_scheduler.c | 13 ++
> drivers/gpu/drm/xe/xe_gpu_scheduler.h | 1 +
> drivers/gpu/drm/xe/xe_gt.c | 10 ++
> drivers/gpu/drm/xe/xe_gt_debugfs.c | 5 +-
> drivers/gpu/drm/xe/xe_gt_sriov_vf.c | 14 ++
> drivers/gpu/drm/xe/xe_gt_sriov_vf.h | 1 +
> drivers/gpu/drm/xe/xe_guc_buf.c | 2 +-
> drivers/gpu/drm/xe/xe_guc_submit.c | 175 +++++++++++++++++++++++
> drivers/gpu/drm/xe/xe_guc_submit.h | 9 ++
> drivers/gpu/drm/xe/xe_guc_types.h | 6 +
> drivers/gpu/drm/xe/xe_lrc.c | 107 ++++++++++++--
> drivers/gpu/drm/xe/xe_lrc.h | 9 ++
> drivers/gpu/drm/xe/xe_sa.c | 1 -
> drivers/gpu/drm/xe/xe_sa.h | 15 +-
> drivers/gpu/drm/xe/xe_sa_types.h | 1 -
> drivers/gpu/drm/xe/xe_sriov_vf.c | 78 +++++++++-
> drivers/gpu/drm/xe/xe_sriov_vf_ccs.c | 2 +-
> 20 files changed, 490 insertions(+), 19 deletions(-)
>
> --
> 2.25.1
>
More information about the Intel-xe
mailing list