[RFC PATCH 0/8] ULLS for kernel submission of migration jobs

Mon Aug 12 17:26:26 UTC 2024

On Mon, Aug 12, 2024 at 10:53:01AM +0200, Thomas Hellström wrote:
> Hi, Matt,
> 
> On Sun, 2024-08-11 at 19:47 -0700, Matthew Brost wrote:
> > Ultra low latency for kernel submission of migration jobs.
> > 
> > The basic idea is that faults (CPU or GPU) typically depend on
> > migration
> > jobs. Faults should be addressed as quickly as possible, but context
> > switches via GuC on hardware are slow. To avoid context switches,
> > perform ULLS in the kernel for migration jobs on discrete faulting
> > devices with an LR VM open.
> > 
> > This is implemented by switching the migration layer to ULLS mode
> > upon
> > opening an LR VM. In ULLS mode, migration jobs have a preamble and
> > postamble: the preamble clears the current semaphore value, and the
> > postamble waits for the next semaphore value. Each job submission
> > sets
> > the current semaphore in memory, bypassing the GuC. The net effect is
> > that the migration execution queue never gets switched off the
> > hardware
> > while an LR VM is open.
> > 
> > There may be concerns regarding power management, as the ring program
> > continuously runs on a copy engine, and a force wake reference to a
> > copy
> > engine is held with an LR VM open.
> > 
> > The implementation has been lightly tested but seems to be working.
> > 
> > This approach will likely be put on hold until SVM is operational
> > with
> > benchmarks, but it is being posted early for feedback and as a public
> > checkpoint.
> > 
> > Matt
> 
> The main concern I have with this is that, at least according to
> upstream discussions, pagefaults are so slow anyway, a performant stack
> needs to try extremely hard to avoid them using manual prefaults, and
> if we hit a gpu pagefault, we've lost anyway and any migration latency
> optimization won't matter much.
> 

I agree that if pagefaults are getting hit all the time we are in
trouble wrt to performance but that doesn't mean when they do occur we
shouldn't try to make servicing them as fast as possible.

> Also, for power management, LR VM open is a very simple strategy, which
> is good, but shouldn't it be possible to hook that up to LR job
> running, similar to vm->preempt.rebind_deactivated?
>

That seems possible. Then in is scenario we'd hook the
xe_migrate_lr_vm_get / put calls [1] [2] and runtime PM calls into the
LR VM activate / deactivate calls rather LR VM open / close calls.

Matt

[1] https://patchwork.freedesktop.org/patch/607842/?series=137128&rev=1
[2] https://patchwork.freedesktop.org/patch/607841/?series=137128&rev=1

> /Thomas
> 
> 
> > 
> > Matthew Brost (8):
> >   drm/xe: Add xe_hw_engine_write_ring_tail
> >   drm/xe: Add ULLS support to LRC
> >   drm/xe: Add ULLS flags for jobs
> >   drm/xe: Add ULLS migration job support to migration layer
> >   drm/xe: Add MI_SEMAPHORE_WAIT instruction defs
> >   drm/xe: Add ULLS migration job support to ring ops
> >   drm/xe: Add ULLS migration job support to GuC submission
> >   drm/xe: Enable ULLS migration jobs when opening LR VM
> > 
> >  .../gpu/drm/xe/instructions/xe_mi_commands.h  |   6 +
> >  drivers/gpu/drm/xe/xe_guc_submit.c            |  26 +++-
> >  drivers/gpu/drm/xe/xe_hw_engine.c             |  10 ++
> >  drivers/gpu/drm/xe/xe_hw_engine.h             |   1 +
> >  drivers/gpu/drm/xe/xe_lrc.c                   |  49 +++++++
> >  drivers/gpu/drm/xe/xe_lrc.h                   |   3 +
> >  drivers/gpu/drm/xe/xe_lrc_types.h             |   2 +
> >  drivers/gpu/drm/xe/xe_migrate.c               | 130
> > +++++++++++++++++-
> >  drivers/gpu/drm/xe/xe_migrate.h               |   4 +
> >  drivers/gpu/drm/xe/xe_ring_ops.c              |  32 +++++
> >  drivers/gpu/drm/xe/xe_sched_job_types.h       |   3 +
> >  drivers/gpu/drm/xe/xe_vm.c                    |  10 ++
> >  12 files changed, 268 insertions(+), 8 deletions(-)
> > 
>