[PATCH v4 0/7] Use DRM scheduler for delayed GT TLB invalidations

Sun Aug 24 17:58:12 UTC 2025

On Fri, Aug 22, 2025 at 01:36:26PM -0700, Matthew Brost wrote:
> On Fri, Aug 22, 2025 at 01:39:18PM +0100, Tvrtko Ursulin wrote:
> > 
> > On 11/08/2025 20:25, Matthew Brost wrote:
> > > On Mon, Aug 11, 2025 at 09:31:08AM +0100, Tvrtko Ursulin wrote:
> > > > 
> > > > On 24/07/2025 20:12, Matthew Brost wrote:
> > > > > MIME-Version: 1.0
> > > > > Content-Type: text/plain; charset=UTF-8
> > > > > Content-Transfer-Encoding: 8bit
> > > > > 
> > > > >         Use the DRM scheduler for delayed GT TLB invalidations, which properly
> > > > > fixes the issue raised in [1]. GT TLB fences have their own dma-fence
> > > > > context, so even if the invalidations are ordered, the dma-resv/DRM
> > > > > scheduler cannot squash the fences. This results in O(M*N*N) complexity
> > > > > in the garbage collector, where M is the number of ranges in the garbage
> > > > > collector and N is the number of pending GT TLB invalidations. After
> > > > > this change, the resulting complexity in O(M*C) where C is the number of
> > > > > TLB invalidation contexts (i.e., number of exec queue, GTs tuples) with
> > > > > an invalidation in flight.
> > > > 
> > > > Does this series improve the performance of the TLB invalidations in
> > > > general?
> > > > 
> > > > I am asking because we are currently investigating a problem with Google
> > > > Chrome which apparently suffers a lot from latencies caused by TLB
> > > > invalidations piling up after a lot of tab switching causes a lot of vm bind
> > > > ioctl activity.
> > > 
> > > It doesn't improve performance of the actual invalidation rather the
> > > time to create a VM unbind job as it avoids O(M*N*N) complexity
> > > explosion. I originally saw this in a case of many SVM unbinds (i.e. a
> > > free / unmap is called on a large piece of memory) but many VM unbinds
> > > in short period of time could also cause this complexity explosion too.
> > 
> > You are saying it is just CPU utilisation and it does not show up as
> > submission latency, or it can?
> > 
> > The soft lockup splat from
> > https://patchwork.freedesktop.org/patch/658370/?series=150188&rev=1 is due
> > drm_sched_job_add_resv_dependencies() literally taking 26 second beacuase of
> > a huge number of fences in the resv or something else?
> > 
> 
> Yes—there were a huge number of fences, each with a different context.
> 
> The initial patch addressed the lockup but had issues; this series
> properly resolves the complexity problem.
> 
> I also greatly reduced the number of invalidations in the SVM case [1].
> That alone likely fixed the lockup, but I went ahead with this to
> address the underlying time-complexity issue.
> 
> [1] https://patchwork.freedesktop.org/series/150257/ 
> 
> > > By making all invalidations from the same (queue, GT) share a dma_fence
> > > context, dma-resv and drm_sched can coalesce them into a single
> > > dependenct per (queue, GT). That keeps the dependency count bounded and
> > > eliminated the SVM panics I was seeing under heavy unbind stress.
> > > 
> > > > 
> > > > Also, is it related to
> > > > https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2162 ?
> > > > 
> > > 
> > > Maybe. In this case, a flood of TLB invalidations overwhelmed the GuC,
> > > leaving it no time to do anything else (e.g., schedule contexts). It’s a
> > > multi-process test, so the driver-side complexity explosion likely
> > > doesn’t apply as this explosion is seen within a single VM. What we need
> > > is to hold TLB invalidations in the driver and coalesce them to issue
> > > fewer operations. I have patches for this, but I haven’t cleaned them up
> > > for posting yet—we’re in the middle of other TLB-invalidation refactors.
> > 
> > If you would be able to send something out, or share a branch, even if not
> > completely clean that would be cool. Because I should soon have a reproducer
> > for the Chrome tab switching issue and could report back on the real world
> > effects of both.
> > 
> 
> The last six patches in this branch [1] defer TLB invalidations in the
> driver when a watermark is exceeded, which should prevent overwhelming
> the GuC with excessive invalidation requests. They’re built on top of
> [2], which will hopefully merge shortly. It would be interesting to see
> whether this helps with the issue you’re seeing.

My links are screwed up here...

Let's try this again. The branch is [4], it is built on top [3].

[4] https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-svn-perf-6-15-2025/-/tree/tlb_inval_rework?ref_type=heads

Matt

> 
> Matt
> 
> [2] https://patchwork.freedesktop.org/patch/658370/?series=150188&rev=1
> [3] https://patchwork.freedesktop.org/series/152022/
> 
> > Regards,
> > 
> > Tvrtko
> > 
> > > > > Admittedly, it's quite a lot of code, but the series includes extensive
> > > > > kernel documentation and clear code comments. It introduces a generic
> > > > > dependency scheduler that can be reused in the future and is logically
> > > > > much cleaner than the previous open-coded solution for delaying GT TLB
> > > > > invalidations until a bind job completes.
> > > > > 
> > > > > v2:
> > > > >    - Various cleanup covered in detail in change logs
> > > > >    - Use a per-GT ordered workqueue as DRM scheduler workqueue
> > > > >    - Remove unused ftrace points
> > > > > v3:
> > > > >    - Address Stuarts feedback
> > > > >    - Fix kernel doc, minor cleanups
> > > > > v4:
> > > > >    - Rebase
> > > > > 
> > > > > Matt
> > > > > 
> > > > > [1] https://patchwork.freedesktop.org/patch/658370/?series=150188&rev=1
> > > > > 
> > > > > Matthew Brost (7):
> > > > >     drm/xe: Explicitly mark migration queues with flag
> > > > >     drm/xe: Add generic dependecy jobs / scheduler
> > > > >     drm/xe: Create ordered workqueue for GT TLB invalidation jobs
> > > > >     drm/xe: Add dependency scheduler for GT TLB invalidations to bind
> > > > >       queues
> > > > >     drm/xe: Add GT TLB invalidation jobs
> > > > >     drm/xe: Use GT TLB invalidation jobs in PT layer
> > > > >     drm/xe: Remove unused GT TLB invalidation trace points
> > > > > 
> > > > >    drivers/gpu/drm/xe/Makefile                 |   2 +
> > > > >    drivers/gpu/drm/xe/xe_dep_job_types.h       |  29 +++
> > > > >    drivers/gpu/drm/xe/xe_dep_scheduler.c       | 143 ++++++++++
> > > > >    drivers/gpu/drm/xe/xe_dep_scheduler.h       |  21 ++
> > > > >    drivers/gpu/drm/xe/xe_exec_queue.c          |  48 ++++
> > > > >    drivers/gpu/drm/xe/xe_exec_queue_types.h    |  15 ++
> > > > >    drivers/gpu/drm/xe/xe_gt_tlb_inval_job.c    | 274 ++++++++++++++++++++
> > > > >    drivers/gpu/drm/xe/xe_gt_tlb_inval_job.h    |  34 +++
> > > > >    drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c |   8 +
> > > > >    drivers/gpu/drm/xe/xe_gt_types.h            |   2 +
> > > > >    drivers/gpu/drm/xe/xe_migrate.c             |  42 ++-
> > > > >    drivers/gpu/drm/xe/xe_migrate.h             |  13 +
> > > > >    drivers/gpu/drm/xe/xe_pt.c                  | 178 +++++--------
> > > > >    drivers/gpu/drm/xe/xe_trace.h               |  16 --
> > > > >    14 files changed, 700 insertions(+), 125 deletions(-)
> > > > >    create mode 100644 drivers/gpu/drm/xe/xe_dep_job_types.h
> > > > >    create mode 100644 drivers/gpu/drm/xe/xe_dep_scheduler.c
> > > > >    create mode 100644 drivers/gpu/drm/xe/xe_dep_scheduler.h
> > > > >    create mode 100644 drivers/gpu/drm/xe/xe_gt_tlb_inval_job.c
> > > > >    create mode 100644 drivers/gpu/drm/xe/xe_gt_tlb_inval_job.h
> > > > > 
> > > > 
> >