[PATCH v8 0/8] drm/xe/vf: Post-migration recovery of queues and jobs

Tue Aug 12 04:47:31 UTC 2025

On Fri, Aug 01, 2025 at 03:50:37AM +0200, Tomasz Lis wrote:

+Maintainers, public, consolidating threads.

IMO this needs to be fully reverted or at minimum a subset of
problematic patches, along with [1].

To start, it was brought to my attention that this code was not tested
due to a lack of internal infrastructure. I believe it is completely
unacceptable for untested code to be merged, especially for a complex
feature like this. After spending some time reviewing the feature, I
found clear bugs, and the overall software architecture appears
ill-conceived.

Quick overview of bugs / issues I have spotted:

- xe_exec_queue_jobs_ring_restore does not modify existing ring jobs’ GGTT
  addresses; it instead populates new parts of the ring buffer. This not
  only fails to accomplish what it intends, it could also overflow the
  ring buffer for existing jobs.

- xe_exec_queue_jobs_ring_restore has clear layering violations by
  reaching into q->guc state.

- Upon VF resume, nothing prevents new submissions from writing to GGTT
  addresses that have been moved. I was looped into this by internal
  patches proposing an untenable locking model, which IMO is non-viable.
  Tomazs please post those publically for reference.

- During queue pause/unpause [3], nothing (i.e., a missing mutex) prevents
  a UAF during the queue lookup in submission_state.exec_queue_lookup.

- xe_sched_submission_stop_async in [3] is not actually asynchronous (it
  synchronously stops the scheduler workitems). It also doesn’t prevent
  queued scheduler messages (driver side workitems) from being
  processed; those may send GuC H2Gs over an inactive interface, or wait
  on G2Hs that will never arrive because the CT interface/GuC is not live.

- xe_sched_submission_stop_async relies on GuC CTs being alive to make
  forward progress; otherwise GuC CT timeouts can trigger a GT reset,
  which by nature loses state which opposing to this features goal.

- Even if xe_sched_submission_stop_async were actually asynchronous, GuC
  H2Gs have bounded timeouts, so H2Gs can be lost. The submission model
  assumes H2G/G2H are lossless or that losses are recoverable by
  inspecting stored GuC submission-queue state (as with GT resets).

- Modifying GGTT addresses via CTs [4] or GuC WQs [5] is fragile. Any
  interface change would likely break this, and any new interfaces that
  use GGTT addresses would also need updates.

I purpose this flow instead, which TL;DR is modified GT reset flow which
understands GuC state is not lost, rather CTs can be lost.

- Before calling xe_guc_submit_pause(), stop the CTs and drain/parse all
  pending G2H messages. Parsing G2H stabilizes the GuC submission state;
  stopping CTs prevents waits on disabled CTs and avoids timeouts due to
  lack of space.

- xe_guc_submit_{pause,unpause}() should take submission_state.lock and
  flip a bit in the global GuC submission state. This bit blocks all new
  submissions and causes any waiters in the submission/message paths to
  exit gracefully.

- xe_guc_submit_pause() stops everything synchronously, including TDRs.
  This is required here because TDRs are scheduled on the same
  single-threaded, ordered workqueue. We may need DRM scheduler hooks to
  facilitate pausing/resuming TDRs.

- xe_guc_submit_pause() scans the GuC submission state and makes any
  modifications needed for replay (e.g., adjust internal state, reset the
  WQ head and tail to match, adjust LRC head/tail pointers to match,
  etc.).

- Scan the CT buffer; convert anything that touches submission state into
  a NOP and adjust G2H credits as needed. If the GuC lacks a NOP, we’ll
  need to add one; alternatively, if feasible, we can discard the entire
  H2G queue, which would be simpler.

- Adjust all GGTT addresses of non-submission-related objects.

- Restart the CTs and enable the VF in the GuC.

- xe_guc_submit_unpause() resubmits everything through the scheduler, with
  ring-head pointer adjustment similar to [6]. (It’s possible the head
  adjustment belongs in pause; I need a complete picture of how the GuC
  replays preempted jobs—e.g., if the return address from BB start is an
  exact ring address rather than relying on the ring head, the algorithm
  would change.) The resubmission will re-register any queues whose
  registration was lost, re-enable any queues whose scheduler-enable state
  was lost, and resubmit the correct WQ items.

- xe_guc_submit_unpause() also replays any messages that were in flight
  but lost due to VF resume (e.g., teardown queue, suspend/resume of
  queues).

- xe_guc_submit_unpause() restarts all scheduler work items, including
  TDRs.

Matt

[1] https://patchwork.freedesktop.org/series/148780/
[2] https://patchwork.freedesktop.org/patch/666681/?series=148982&rev=10
[3] https://patchwork.freedesktop.org/patch/666679/?series=148982&rev=10
[4] https://patchwork.freedesktop.org/patch/652446/?series=148780&rev=1
[5] https://patchwork.freedesktop.org/patch/666683/?series=148982&rev=10
[6] https://patchwork.freedesktop.org/series/152715/ 

> To support VF Migration, it is necessary to do fixups to any
> non-virtualized resources. These fixups need to be applied within
> VM, on the KMD working with VF.
> 
> This series adds two fixup functions to the recovery worker:
> * for fixing xe_lrc structs within queues
> * for fixing xe_job structs and the commands they emit
> It also provides some performance and stability fixes - blocking
> submissions and resets while the fixups are being applied.
> In case of sub-allocator, it removes the cached GGTT addresses
> instead of implementing fixups for them.
> 
> v2: Switcghed to update of addresses by xe_lrc_write_ctx_reg()
>   to avoid kzalloc(), renamed or moved few functions
> v3: Renamed and reordered parameters, added kerneldocs
> v4: Take job_list_lock, introduce a new atomic for reset
>   blocking, add "refresh utilization buffer" patch
> v5: Replaced "Finish RESFIX by reset" patch with "Skip fixups
>   before getting GGTT info", rebased "Refresh utilization buffer"
>   patch
> v6: Rebased to changes in "Make multi-GT migration less error prone",
>   used a scratch buffer, and added one more ring recovery patch
> v7: Used better matching atomic functs, fixed noop item at end of
>   WQ ring, added more exit conditions in wq ring
> v8: Improve error/warn logging, add propagation of errors,
>  make enum for offsets
> 
> Tomasz Lis (8):
>   drm/xe/sa: Avoid caching GGTT address within the manager
>   drm/xe/vf: Pause submissions during RESFIX fixups
>   drm/xe: Block reset while recovering from VF migration
>   drm/xe/vf: Rebase HWSP of all contexts after migration
>   drm/xe/vf: Rebase MEMIRQ structures for all contexts after migration
>   drm/xe/vf: Post migration, repopulate ring area for pending request
>   drm/xe/vf: Refresh utilization buffer during migration recovery
>   drm/xe/vf: Rebase exec queue parallel commands during migration
>     recovery
> 
>  drivers/gpu/drm/xe/abi/guc_actions_abi.h |   8 ++
>  drivers/gpu/drm/xe/xe_exec_queue.c       |  48 +++++++
>  drivers/gpu/drm/xe/xe_exec_queue.h       |   4 +
>  drivers/gpu/drm/xe/xe_gpu_scheduler.c    |  13 ++
>  drivers/gpu/drm/xe/xe_gpu_scheduler.h    |   1 +
>  drivers/gpu/drm/xe/xe_gt.c               |  10 ++
>  drivers/gpu/drm/xe/xe_gt_debugfs.c       |   5 +-
>  drivers/gpu/drm/xe/xe_gt_sriov_vf.c      |  14 ++
>  drivers/gpu/drm/xe/xe_gt_sriov_vf.h      |   1 +
>  drivers/gpu/drm/xe/xe_guc_buf.c          |   2 +-
>  drivers/gpu/drm/xe/xe_guc_submit.c       | 175 +++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_guc_submit.h       |   9 ++
>  drivers/gpu/drm/xe/xe_guc_types.h        |   6 +
>  drivers/gpu/drm/xe/xe_lrc.c              | 107 ++++++++++++--
>  drivers/gpu/drm/xe/xe_lrc.h              |   9 ++
>  drivers/gpu/drm/xe/xe_sa.c               |   1 -
>  drivers/gpu/drm/xe/xe_sa.h               |  15 +-
>  drivers/gpu/drm/xe/xe_sa_types.h         |   1 -
>  drivers/gpu/drm/xe/xe_sriov_vf.c         |  78 +++++++++-
>  drivers/gpu/drm/xe/xe_sriov_vf_ccs.c     |   2 +-
>  20 files changed, 490 insertions(+), 19 deletions(-)
> 
> -- 
> 2.25.1
>