[RFC PATCH 0/8] ULLS for kernel submission of migration jobs

Tue Sep 17 06:59:10 UTC 2024

Hi, Matt

On Mon, 2024-09-16 at 13:55 +0000, Matthew Brost wrote:
> On Mon, Aug 12, 2024 at 05:26:26PM +0000, Matthew Brost wrote:
> > On Mon, Aug 12, 2024 at 10:53:01AM +0200, Thomas Hellström wrote:
> > > Hi, Matt,
> > > 
> > > On Sun, 2024-08-11 at 19:47 -0700, Matthew Brost wrote:
> > > > Ultra low latency for kernel submission of migration jobs.
> > > > 
> > > > The basic idea is that faults (CPU or GPU) typically depend on
> > > > migration
> > > > jobs. Faults should be addressed as quickly as possible, but
> > > > context
> > > > switches via GuC on hardware are slow. To avoid context
> > > > switches,
> > > > perform ULLS in the kernel for migration jobs on discrete
> > > > faulting
> > > > devices with an LR VM open.
> > > > 
> > > > This is implemented by switching the migration layer to ULLS
> > > > mode
> > > > upon
> > > > opening an LR VM. In ULLS mode, migration jobs have a preamble
> > > > and
> > > > postamble: the preamble clears the current semaphore value, and
> > > > the
> > > > postamble waits for the next semaphore value. Each job
> > > > submission
> > > > sets
> > > > the current semaphore in memory, bypassing the GuC. The net
> > > > effect is
> > > > that the migration execution queue never gets switched off the
> > > > hardware
> > > > while an LR VM is open.
> > > > 
> > > > There may be concerns regarding power management, as the ring
> > > > program
> > > > continuously runs on a copy engine, and a force wake reference
> > > > to a
> > > > copy
> > > > engine is held with an LR VM open.
> > > > 
> > > > The implementation has been lightly tested but seems to be
> > > > working.
> > > > 
> > > > This approach will likely be put on hold until SVM is
> > > > operational
> > > > with
> > > > benchmarks, but it is being posted early for feedback and as a
> > > > public
> > > > checkpoint.
> > > > 
> > > > Matt
> > > 
> > > The main concern I have with this is that, at least according to
> > > upstream discussions, pagefaults are so slow anyway, a performant
> > > stack
> > > needs to try extremely hard to avoid them using manual prefaults,
> > > and
> > > if we hit a gpu pagefault, we've lost anyway and any migration
> > > latency
> > > optimization won't matter much.
> > > 
> > 
> > I agree that if pagefaults are getting hit all the time we are in
> > trouble wrt to performance but that doesn't mean when they do occur
> > we
> > shouldn't try to make servicing them as fast as possible.
> > 
> 
> Okay, there is definitely something to this. I have an SVM test that
> serially bounces (i.e., each bounce results in a GuC context switch)
> a
> 2M allocation via fault about 1,000 times between the CPU and GPU. I
> applied this series and added a modparam to enable/disable ULLS to
> the
> tip of my latest working branch [3].
> 
> Without ULLS enabled, the test on average took about 3 seconds on
> BMG.
> With ULLS enabled, the test on average took 2 seconds. I was
> expecting a
> small gain, but this is significant. Other similar sections I have
> seen
> a speed nearing twice as fast. It seems to indicate that this series
> is
> definitely worth pursuing.
> 
> Also note that any operation using the migrate engine will see gains
> (e.g., clearing BOs, prefetches, eviction) in terms of latency.

I'm still concerned about adding this, and if we do we should only use
it as a last resource if we see significant performance improvements in
real-world applications. Historically all the latency optimizations was
what screwed up the maintainability of the i915 driver and I think we
should be extremely careful so we don't end up in the same situation.
Concerns also around power management.

Regardless, ULLS only really should improve things if we fail to
pipeline migrations on the HW in the non-ULLS case, assuming that we're
using a single exec-queue in both cases. Is that because we wait for
each migration to complete before allowing the next one? If so, is that
something we could look at?

Thanks,
Thomas

> 
> Matt
> 
> > > Also, for power management, LR VM open is a very simple strategy,
> > > which
> > > is good, but shouldn't it be possible to hook that up to LR job
> > > running, similar to vm->preempt.rebind_deactivated?
> > > 
> > 
> > That seems possible. Then in is scenario we'd hook the
> > xe_migrate_lr_vm_get / put calls [1] [2] and runtime PM calls into
> > the
> > LR VM activate / deactivate calls rather LR VM open / close calls.
> > 
> > Matt
> > 
> > [1]
> > https://patchwork.freedesktop.org/patch/607842/?series=137128&rev=1
> > [2]
> > https://patchwork.freedesktop.org/patch/607841/?series=137128&rev=1
> > 
> > > /Thomas
> > > 
> > > 
> > > > 
> > > > Matthew Brost (8):
> > > >   drm/xe: Add xe_hw_engine_write_ring_tail
> > > >   drm/xe: Add ULLS support to LRC
> > > >   drm/xe: Add ULLS flags for jobs
> > > >   drm/xe: Add ULLS migration job support to migration layer
> > > >   drm/xe: Add MI_SEMAPHORE_WAIT instruction defs
> > > >   drm/xe: Add ULLS migration job support to ring ops
> > > >   drm/xe: Add ULLS migration job support to GuC submission
> > > >   drm/xe: Enable ULLS migration jobs when opening LR VM
> > > > 
> > > >  .../gpu/drm/xe/instructions/xe_mi_commands.h  |   6 +
> > > >  drivers/gpu/drm/xe/xe_guc_submit.c            |  26 +++-
> > > >  drivers/gpu/drm/xe/xe_hw_engine.c             |  10 ++
> > > >  drivers/gpu/drm/xe/xe_hw_engine.h             |   1 +
> > > >  drivers/gpu/drm/xe/xe_lrc.c                   |  49 +++++++
> > > >  drivers/gpu/drm/xe/xe_lrc.h                   |   3 +
> > > >  drivers/gpu/drm/xe/xe_lrc_types.h             |   2 +
> > > >  drivers/gpu/drm/xe/xe_migrate.c               | 130
> > > > +++++++++++++++++-
> > > >  drivers/gpu/drm/xe/xe_migrate.h               |   4 +
> > > >  drivers/gpu/drm/xe/xe_ring_ops.c              |  32 +++++
> > > >  drivers/gpu/drm/xe/xe_sched_job_types.h       |   3 +
> > > >  drivers/gpu/drm/xe/xe_vm.c                    |  10 ++
> > > >  12 files changed, 268 insertions(+), 8 deletions(-)
> > > > 
> > >