[RFC PATCH 0/8] ULLS for kernel submission of migration jobs

Tue Sep 17 14:39:01 UTC 2024

On Tue, Sep 17, 2024 at 08:59:10AM +0200, Thomas Hellström wrote:
> Hi, Matt
> 
> On Mon, 2024-09-16 at 13:55 +0000, Matthew Brost wrote:
> > On Mon, Aug 12, 2024 at 05:26:26PM +0000, Matthew Brost wrote:
> > > On Mon, Aug 12, 2024 at 10:53:01AM +0200, Thomas Hellström wrote:
> > > > Hi, Matt,
> > > > 
> > > > On Sun, 2024-08-11 at 19:47 -0700, Matthew Brost wrote:
> > > > > Ultra low latency for kernel submission of migration jobs.
> > > > > 
> > > > > The basic idea is that faults (CPU or GPU) typically depend on
> > > > > migration
> > > > > jobs. Faults should be addressed as quickly as possible, but
> > > > > context
> > > > > switches via GuC on hardware are slow. To avoid context
> > > > > switches,
> > > > > perform ULLS in the kernel for migration jobs on discrete
> > > > > faulting
> > > > > devices with an LR VM open.
> > > > > 
> > > > > This is implemented by switching the migration layer to ULLS
> > > > > mode
> > > > > upon
> > > > > opening an LR VM. In ULLS mode, migration jobs have a preamble
> > > > > and
> > > > > postamble: the preamble clears the current semaphore value, and
> > > > > the
> > > > > postamble waits for the next semaphore value. Each job
> > > > > submission
> > > > > sets
> > > > > the current semaphore in memory, bypassing the GuC. The net
> > > > > effect is
> > > > > that the migration execution queue never gets switched off the
> > > > > hardware
> > > > > while an LR VM is open.
> > > > > 
> > > > > There may be concerns regarding power management, as the ring
> > > > > program
> > > > > continuously runs on a copy engine, and a force wake reference
> > > > > to a
> > > > > copy
> > > > > engine is held with an LR VM open.
> > > > > 
> > > > > The implementation has been lightly tested but seems to be
> > > > > working.
> > > > > 
> > > > > This approach will likely be put on hold until SVM is
> > > > > operational
> > > > > with
> > > > > benchmarks, but it is being posted early for feedback and as a
> > > > > public
> > > > > checkpoint.
> > > > > 
> > > > > Matt
> > > > 
> > > > The main concern I have with this is that, at least according to
> > > > upstream discussions, pagefaults are so slow anyway, a performant
> > > > stack
> > > > needs to try extremely hard to avoid them using manual prefaults,
> > > > and
> > > > if we hit a gpu pagefault, we've lost anyway and any migration
> > > > latency
> > > > optimization won't matter much.
> > > > 
> > > 
> > > I agree that if pagefaults are getting hit all the time we are in
> > > trouble wrt to performance but that doesn't mean when they do occur
> > > we
> > > shouldn't try to make servicing them as fast as possible.
> > > 
> > 
> > Okay, there is definitely something to this. I have an SVM test that
> > serially bounces (i.e., each bounce results in a GuC context switch)
> > a
> > 2M allocation via fault about 1,000 times between the CPU and GPU. I
> > applied this series and added a modparam to enable/disable ULLS to
> > the
> > tip of my latest working branch [3].
> > 
> > Without ULLS enabled, the test on average took about 3 seconds on
> > BMG.
> > With ULLS enabled, the test on average took 2 seconds. I was
> > expecting a
> > small gain, but this is significant. Other similar sections I have
> > seen
> > a speed nearing twice as fast. It seems to indicate that this series
> > is
> > definitely worth pursuing.
> > 
> > Also note that any operation using the migrate engine will see gains
> > (e.g., clearing BOs, prefetches, eviction) in terms of latency.
> 
> I'm still concerned about adding this, and if we do we should only use
> it as a last resource if we see significant performance improvements in
> real-world applications. Historically all the latency optimizations was
> what screwed up the maintainability of the i915 driver and I think we

I agree real-world application performance is the acceptance criteria
and maintainability is always a concern. FWIW I thinks this series is a
clean implementation so maintainbility from PoV it not a huge concern.
If it created all sorts of new concepts I'd be concerned but it doesn't
- it largely fits in the already defined layers rather nicely.

> should be extremely careful so we don't end up in the same situation.
> Concerns also around power management.
> 

Agree we'd have to look that implictions of PM too. I added a forcewake
but really unsure if that required aside from keeping our asserts happy.
Obviously a spinning batch will use some amount of power but the UMDs
are already doing this too and that was deemed acceptable.

> Regardless, ULLS only really should improve things if we fail to
> pipeline migrations on the HW in the non-ULLS case, assuming that we're
> using a single exec-queue in both cases. Is that because we wait for

Single VM per device actually due to locking. Yes, the single VM tests
show a much larger perf improvement compared to multi-VM / process tests
due to multiple fault workers in parallel feeding the copy engine avoid
a context switch on each copy. There still is an improvement in the
latter case though albiet not as dramatic (~80% vs. ~21%).

> each migration to complete before allowing the next one? If so, is that
> something we could look at?

For fault handling within a VM, that is not we can easily remove due the
serial nature of the migration API as copy is expected complete between the
migrate_vma_setup call and migrate_vma_finalize call. Then *after* this
page collection and binding still needs to done in the GPU fault case.

For prefetches, I think we basically are going to have pipeline things
for performance but I think this a bit easier as we have a large working
set upfront compared to 1 fault address. This will likely get quite
complicated though, likely much so compared to what is in this series
(e.g. software pipeline with fence generation and async waits / CBs,
possibly multiple workers, etc...).

Matt

> 
> Thanks,
> Thomas
> 
> 
> 
> > 
> > Matt
> > 
> > > > Also, for power management, LR VM open is a very simple strategy,
> > > > which
> > > > is good, but shouldn't it be possible to hook that up to LR job
> > > > running, similar to vm->preempt.rebind_deactivated?
> > > > 
> > > 
> > > That seems possible. Then in is scenario we'd hook the
> > > xe_migrate_lr_vm_get / put calls [1] [2] and runtime PM calls into
> > > the
> > > LR VM activate / deactivate calls rather LR VM open / close calls.
> > > 
> > > Matt
> > > 
> > > [1]
> > > https://patchwork.freedesktop.org/patch/607842/?series=137128&rev=1
> > > [2]
> > > https://patchwork.freedesktop.org/patch/607841/?series=137128&rev=1
> > > 
> > > > /Thomas
> > > > 
> > > > 
> > > > > 
> > > > > Matthew Brost (8):
> > > > >   drm/xe: Add xe_hw_engine_write_ring_tail
> > > > >   drm/xe: Add ULLS support to LRC
> > > > >   drm/xe: Add ULLS flags for jobs
> > > > >   drm/xe: Add ULLS migration job support to migration layer
> > > > >   drm/xe: Add MI_SEMAPHORE_WAIT instruction defs
> > > > >   drm/xe: Add ULLS migration job support to ring ops
> > > > >   drm/xe: Add ULLS migration job support to GuC submission
> > > > >   drm/xe: Enable ULLS migration jobs when opening LR VM
> > > > > 
> > > > >  .../gpu/drm/xe/instructions/xe_mi_commands.h  |   6 +
> > > > >  drivers/gpu/drm/xe/xe_guc_submit.c            |  26 +++-
> > > > >  drivers/gpu/drm/xe/xe_hw_engine.c             |  10 ++
> > > > >  drivers/gpu/drm/xe/xe_hw_engine.h             |   1 +
> > > > >  drivers/gpu/drm/xe/xe_lrc.c                   |  49 +++++++
> > > > >  drivers/gpu/drm/xe/xe_lrc.h                   |   3 +
> > > > >  drivers/gpu/drm/xe/xe_lrc_types.h             |   2 +
> > > > >  drivers/gpu/drm/xe/xe_migrate.c               | 130
> > > > > +++++++++++++++++-
> > > > >  drivers/gpu/drm/xe/xe_migrate.h               |   4 +
> > > > >  drivers/gpu/drm/xe/xe_ring_ops.c              |  32 +++++
> > > > >  drivers/gpu/drm/xe/xe_sched_job_types.h       |   3 +
> > > > >  drivers/gpu/drm/xe/xe_vm.c                    |  10 ++
> > > > >  12 files changed, 268 insertions(+), 8 deletions(-)
> > > > > 
> > > > 
>