[Intel-gfx] [RFC 4/4] drm/i915/gt: Pipelined page migration
Chris Wilson
chris at chris-wilson.co.uk
Mon Nov 30 16:44:44 UTC 2020
Quoting Tvrtko Ursulin (2020-11-30 16:26:44)
>
> On 30/11/2020 16:21, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2020-11-30 16:07:44)
> >>
> >> On 30/11/2020 13:39, Chris Wilson wrote:
> >>> Quoting Tvrtko Ursulin (2020-11-30 13:12:55)
> >>>> On 28/11/2020 18:40, Chris Wilson wrote:
> >>>>> If we pipeline the PTE updates and then do the copy of those pages
> >>>>> within a single unpreemptible command packet, we can submit the copies
> >>>>> and leave them to be scheduled without having to synchronously wait
> >>>>> under a global lock.
> >>>>>
> >>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >>>>> ---
> >>>>> drivers/gpu/drm/i915/Makefile | 1 +
> >>>>> drivers/gpu/drm/i915/gt/intel_engine.h | 1 +
> >>>>> drivers/gpu/drm/i915/gt/intel_migrate.c | 370 ++++++++++++++++++
> >>>>> drivers/gpu/drm/i915/gt/intel_migrate.h | 33 ++
> >>>>> drivers/gpu/drm/i915/gt/selftest_migrate.c | 105 +++++
> >>>>> .../drm/i915/selftests/i915_live_selftests.h | 1 +
> >>>>> 6 files changed, 511 insertions(+)
> >>>>> create mode 100644 drivers/gpu/drm/i915/gt/intel_migrate.c
> >>>>> create mode 100644 drivers/gpu/drm/i915/gt/intel_migrate.h
> >>>>> create mode 100644 drivers/gpu/drm/i915/gt/selftest_migrate.c
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> >>>>> index e5574e506a5c..0b2e12c87f9d 100644
> >>>>> --- a/drivers/gpu/drm/i915/Makefile
> >>>>> +++ b/drivers/gpu/drm/i915/Makefile
> >>>>> @@ -103,6 +103,7 @@ gt-y += \
> >>>>> gt/intel_gtt.o \
> >>>>> gt/intel_llc.o \
> >>>>> gt/intel_lrc.o \
> >>>>> + gt/intel_migrate.o \
> >>>>> gt/intel_mocs.o \
> >>>>> gt/intel_ppgtt.o \
> >>>>> gt/intel_rc6.o \
> >>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
> >>>>> index ac58fcda4927..079d26b47a97 100644
> >>>>> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
> >>>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
> >>>>> @@ -188,6 +188,7 @@ intel_write_status_page(struct intel_engine_cs *engine, int reg, u32 value)
> >>>>> #define I915_GEM_HWS_PREEMPT_ADDR (I915_GEM_HWS_PREEMPT * sizeof(u32))
> >>>>> #define I915_GEM_HWS_SEQNO 0x40
> >>>>> #define I915_GEM_HWS_SEQNO_ADDR (I915_GEM_HWS_SEQNO * sizeof(u32))
> >>>>> +#define I915_GEM_HWS_MIGRATE (0x42 * sizeof(u32))
> >>>>> #define I915_GEM_HWS_SCRATCH 0x80
> >>>>>
> >>>>> #define I915_HWS_CSB_BUF0_INDEX 0x10
> >>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
> >>>>> new file mode 100644
> >>>>> index 000000000000..4d7bd32eb8d4
> >>>>> --- /dev/null
> >>>>> +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
> >>>>> @@ -0,0 +1,370 @@
> >>>>> +// SPDX-License-Identifier: MIT
> >>>>> +/*
> >>>>> + * Copyright © 2020 Intel Corporation
> >>>>> + */
> >>>>> +
> >>>>> +#include "i915_drv.h"
> >>>>> +#include "intel_context.h"
> >>>>> +#include "intel_gt.h"
> >>>>> +#include "intel_gtt.h"
> >>>>> +#include "intel_lrc.h" /* virtual engine */
> >>>>> +#include "intel_migrate.h"
> >>>>> +#include "intel_ring.h"
> >>>>> +
> >>>>> +#define CHUNK_SZ SZ_8M /* ~1ms at 8GiB/s preemption delay */
> >>>>> +
> >>>>> +static void insert_pte(struct i915_address_space *vm,
> >>>>> + struct i915_page_table *pt,
> >>>>> + void *data)
> >>>>> +{
> >>>>> + u64 *offset = data;
> >>>>> +
> >>>>> + vm->insert_page(vm, px_dma(pt), *offset, I915_CACHE_NONE, 0);
> >>>>> + *offset += PAGE_SIZE;
> >>>>> +}
> >>>>> +
> >>>>> +static struct i915_address_space *migrate_vm(struct intel_gt *gt)
> >>>>> +{
> >>>>> + struct i915_vm_pt_stash stash = {};
> >>>>> + struct i915_ppgtt *vm;
> >>>>> + u64 offset, sz;
> >>>>> + int err;
> >>>>> +
> >>>>> + vm = i915_ppgtt_create(gt);
> >>>>> + if (IS_ERR(vm))
> >>>>> + return ERR_CAST(vm);
> >>>>> +
> >>>>> + if (!vm->vm.allocate_va_range || !vm->vm.foreach) {
> >>>>> + err = -ENODEV;
> >>>>> + goto err_vm;
> >>>>> + }
> >>>>> +
> >>>>> + /*
> >>>>> + * We copy in 8MiB chunks. Each PDE covers 2MiB, so we need
> >>>>> + * 4x2 page directories for source/destination.
> >>>>> + */
> >>>>> + sz = 2 * CHUNK_SZ;
> >>>>> + offset = sz;
> >>>>> +
> >>>>> + /*
> >>>>> + * We need another page directory setup so that we can write
> >>>>> + * the 8x512 PTE in each chunk.
> >>>>> + */
> >>>>> + sz += (sz >> 12) * sizeof(u64);
> >>>>> +
> >>>>> + err = i915_vm_alloc_pt_stash(&vm->vm, &stash, sz);
> >>>>> + if (err)
> >>>>> + goto err_vm;
> >>>>> +
> >>>>> + err = i915_vm_pin_pt_stash(&vm->vm, &stash);
> >>>>> + if (err) {
> >>>>> + i915_vm_free_pt_stash(&vm->vm, &stash);
> >>>>> + goto err_vm;
> >>>>> + }
> >>>>> +
> >>>>> + vm->vm.allocate_va_range(&vm->vm, &stash, 0, sz);
> >>>>> + i915_vm_free_pt_stash(&vm->vm, &stash);
> >>>>> +
> >>>>> + /* Now allow the GPU to rewrite the PTE via its own ppGTT */
> >>>>> + vm->vm.foreach(&vm->vm, 0, sz, insert_pte, &offset);
> >>>>
> >>>> This is just making the [0 - sz) gva point to the allocated sz bytes of
> >>>> backing store?
> >>>
> >>> Not quite, we are pointing [offset, sz) to the page directories
> >>> themselves. When we write into that range, we are modifying the PTE of
> >>> this ppGTT. (Breaking the fourth wall, allowing the non-privileged
> >>> context to update its own page tables.)
> >>
> >> Confusing, so first step is here making [0, sz) gva ptes point to pte
> >> backing store.
> >>
> >> Which means emit_pte when called from intel_context_migrate_pages is
> >> emitting sdw which will write into [0, sz] gva, effectively placing the
> >> src and dest object content at those location.
> >>
> >> Then blit copies the data.
> >>
> >> What happens if context is re-used? Gva's no longer point to PTEs so
> >> how? Is a step to re-initialize after use missing? More sdw after the
> >> blit copy to make it point back to ptes?
> >
> > Every chunk starts with writing the PTEs used by the chunk. Within the
> > chunk, the GPU commands are not preemptible, so the PTEs cannot be
> > replaced before they are consumed.
> >
> > We do leave the PTE pointing to stale pages after the blit, but since
> > the vm is only used by first rewriting the PTE, those dangling PTE
> > should never be chased.
> >
> > So each chunk does a fixed copy between the same pair of addresses
> > within the vm; the magic is in rewriting the PTE to point at the pages
> > of interest for the chunk.
>
> Isn't the vm getting potentially re-used on the fall-back path, if
> virtual engine creation fails? In which case dangling PTEs would be
> written to. Or I am still missing something?
The same vm is used by all contexts (only one is created).
We require that the PTE write + copy occur uninterrupted, which is fine
as we control the arbitration points.
> >>>>> + i915_request_add(rq);
> >>>>> + if (it_s.sg)
> >>>>> + cond_resched();
> >>>>
> >>>> From what context does this run? No preemptible?
> >>>
> >>> Has to be process context; numerous allocations, implicit waits (that we
> >>> want to avoid in practice), and the timeline (per-context) mutex to
> >>> guard access to the ringbuffer.
> >>
> >> I guess on non-preemptible, or not fully preemptible kernel it is useful.
> >
> > Oh, the cond_resched() is almost certainly overkill. But potentially a
> > very large loop so worth being careful.
>
> Yes, could be a large loop so doesn't harm.
>
> Is the ringbuffer large enough to avoid stalls in that path?
Is any ringbuffer large enough to be able to copy between 2 128PiB
objects?
It works out that we can only handle ~500M copies before the buffer is
full and we must clear a chunk. Switching to a new context at that point
would keep on extending it until we run out of lmem and/or smem, but the
cost of fabricating a new context at that point makes it dubious.
I think that the greater impact would be from different clients using the
global context; fighting over the ringbuffer is still an improvement over
everyone doing a synchronous copy and blocking all, but the single
timeline still acts as a global serialisation and priority barrier.
-Chris
More information about the Intel-gfx
mailing list