[Intel-gfx] [RFC 4/4] drm/i915/gt: Pipelined page migration

Tue Dec 1 09:33:20 UTC 2020

Quoting Tvrtko Ursulin (2020-12-01 09:26:05)
> 
> On 30/11/2020 16:44, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2020-11-30 16:26:44)
> >>
> >> On 30/11/2020 16:21, Chris Wilson wrote:
> >>> Quoting Tvrtko Ursulin (2020-11-30 16:07:44)
> >>>>
> >>>> On 30/11/2020 13:39, Chris Wilson wrote:
> >>>>> Quoting Tvrtko Ursulin (2020-11-30 13:12:55)
> >>>>>> On 28/11/2020 18:40, Chris Wilson wrote:
> >>>>>>> If we pipeline the PTE updates and then do the copy of those pages
> >>>>>>> within a single unpreemptible command packet, we can submit the copies
> >>>>>>> and leave them to be scheduled without having to synchronously wait
> >>>>>>> under a global lock.
> >>>>>>>
> >>>>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >>>>>>> ---
> >>>>>>>      drivers/gpu/drm/i915/Makefile                 |   1 +
> >>>>>>>      drivers/gpu/drm/i915/gt/intel_engine.h        |   1 +
> >>>>>>>      drivers/gpu/drm/i915/gt/intel_migrate.c       | 370 ++++++++++++++++++
> >>>>>>>      drivers/gpu/drm/i915/gt/intel_migrate.h       |  33 ++
> >>>>>>>      drivers/gpu/drm/i915/gt/selftest_migrate.c    | 105 +++++
> >>>>>>>      .../drm/i915/selftests/i915_live_selftests.h  |   1 +
> >>>>>>>      6 files changed, 511 insertions(+)
> >>>>>>>      create mode 100644 drivers/gpu/drm/i915/gt/intel_migrate.c
> >>>>>>>      create mode 100644 drivers/gpu/drm/i915/gt/intel_migrate.h
> >>>>>>>      create mode 100644 drivers/gpu/drm/i915/gt/selftest_migrate.c
> >>>>>>>
> >>>>>>> diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
> >>>>>>> index e5574e506a5c..0b2e12c87f9d 100644
> >>>>>>> --- a/drivers/gpu/drm/i915/Makefile
> >>>>>>> +++ b/drivers/gpu/drm/i915/Makefile
> >>>>>>> @@ -103,6 +103,7 @@ gt-y += \
> >>>>>>>          gt/intel_gtt.o \
> >>>>>>>          gt/intel_llc.o \
> >>>>>>>          gt/intel_lrc.o \
> >>>>>>> +     gt/intel_migrate.o \
> >>>>>>>          gt/intel_mocs.o \
> >>>>>>>          gt/intel_ppgtt.o \
> >>>>>>>          gt/intel_rc6.o \
> >>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
> >>>>>>> index ac58fcda4927..079d26b47a97 100644
> >>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
> >>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
> >>>>>>> @@ -188,6 +188,7 @@ intel_write_status_page(struct intel_engine_cs *engine, int reg, u32 value)
> >>>>>>>      #define I915_GEM_HWS_PREEMPT_ADDR   (I915_GEM_HWS_PREEMPT * sizeof(u32))
> >>>>>>>      #define I915_GEM_HWS_SEQNO          0x40
> >>>>>>>      #define I915_GEM_HWS_SEQNO_ADDR             (I915_GEM_HWS_SEQNO * sizeof(u32))
> >>>>>>> +#define I915_GEM_HWS_MIGRATE         (0x42 * sizeof(u32))
> >>>>>>>      #define I915_GEM_HWS_SCRATCH                0x80
> >>>>>>>      
> >>>>>>>      #define I915_HWS_CSB_BUF0_INDEX             0x10
> >>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
> >>>>>>> new file mode 100644
> >>>>>>> index 000000000000..4d7bd32eb8d4
> >>>>>>> --- /dev/null
> >>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
> >>>>>>> @@ -0,0 +1,370 @@
> >>>>>>> +// SPDX-License-Identifier: MIT
> >>>>>>> +/*
> >>>>>>> + * Copyright © 2020 Intel Corporation
> >>>>>>> + */
> >>>>>>> +
> >>>>>>> +#include "i915_drv.h"
> >>>>>>> +#include "intel_context.h"
> >>>>>>> +#include "intel_gt.h"
> >>>>>>> +#include "intel_gtt.h"
> >>>>>>> +#include "intel_lrc.h" /* virtual engine */
> >>>>>>> +#include "intel_migrate.h"
> >>>>>>> +#include "intel_ring.h"
> >>>>>>> +
> >>>>>>> +#define CHUNK_SZ SZ_8M /* ~1ms at 8GiB/s preemption delay */
> >>>>>>> +
> >>>>>>> +static void insert_pte(struct i915_address_space *vm,
> >>>>>>> +                    struct i915_page_table *pt,
> >>>>>>> +                    void *data)
> >>>>>>> +{
> >>>>>>> +     u64 *offset = data;
> >>>>>>> +
> >>>>>>> +     vm->insert_page(vm, px_dma(pt), *offset, I915_CACHE_NONE, 0);
> >>>>>>> +     *offset += PAGE_SIZE;
> >>>>>>> +}
> >>>>>>> +
> >>>>>>> +static struct i915_address_space *migrate_vm(struct intel_gt *gt)
> >>>>>>> +{
> >>>>>>> +     struct i915_vm_pt_stash stash = {};
> >>>>>>> +     struct i915_ppgtt *vm;
> >>>>>>> +     u64 offset, sz;
> >>>>>>> +     int err;
> >>>>>>> +
> >>>>>>> +     vm = i915_ppgtt_create(gt);
> >>>>>>> +     if (IS_ERR(vm))
> >>>>>>> +             return ERR_CAST(vm);
> >>>>>>> +
> >>>>>>> +     if (!vm->vm.allocate_va_range || !vm->vm.foreach) {
> >>>>>>> +             err = -ENODEV;
> >>>>>>> +             goto err_vm;
> >>>>>>> +     }
> >>>>>>> +
> >>>>>>> +     /*
> >>>>>>> +      * We copy in 8MiB chunks. Each PDE covers 2MiB, so we need
> >>>>>>> +      * 4x2 page directories for source/destination.
> >>>>>>> +      */
> >>>>>>> +     sz = 2 * CHUNK_SZ;
> >>>>>>> +     offset = sz;
> >>>>>>> +
> >>>>>>> +     /*
> >>>>>>> +      * We need another page directory setup so that we can write
> >>>>>>> +      * the 8x512 PTE in each chunk.
> >>>>>>> +      */
> >>>>>>> +     sz += (sz >> 12) * sizeof(u64);
> >>>>>>> +
> >>>>>>> +     err = i915_vm_alloc_pt_stash(&vm->vm, &stash, sz);
> >>>>>>> +     if (err)
> >>>>>>> +             goto err_vm;
> >>>>>>> +
> >>>>>>> +     err = i915_vm_pin_pt_stash(&vm->vm, &stash);
> >>>>>>> +     if (err) {
> >>>>>>> +             i915_vm_free_pt_stash(&vm->vm, &stash);
> >>>>>>> +             goto err_vm;
> >>>>>>> +     }
> >>>>>>> +
> >>>>>>> +     vm->vm.allocate_va_range(&vm->vm, &stash, 0, sz);
> >>>>>>> +     i915_vm_free_pt_stash(&vm->vm, &stash);
> >>>>>>> +
> >>>>>>> +     /* Now allow the GPU to rewrite the PTE via its own ppGTT */
> >>>>>>> +     vm->vm.foreach(&vm->vm, 0, sz, insert_pte, &offset);
> >>>>>>
> >>>>>> This is just making the [0 - sz) gva point to the allocated sz bytes of
> >>>>>> backing store?
> >>>>>
> >>>>> Not quite, we are pointing [offset, sz) to the page directories
> >>>>> themselves. When we write into that range, we are modifying the PTE of
> >>>>> this ppGTT. (Breaking the fourth wall, allowing the non-privileged
> >>>>> context to update its own page tables.)
> >>>>
> >>>> Confusing, so first step is here making [0, sz) gva ptes point to pte
> >>>> backing store.
> >>>>
> >>>> Which means emit_pte when called from intel_context_migrate_pages is
> >>>> emitting sdw which will write into [0, sz] gva, effectively placing the
> >>>> src and dest object content at those location.
> >>>>
> >>>> Then blit copies the data.
> >>>>
> >>>> What happens if context is re-used? Gva's no longer point to PTEs so
> >>>> how? Is a step to re-initialize after use missing? More sdw after the
> >>>> blit copy to make it point back to ptes?
> >>>
> >>> Every chunk starts with writing the PTEs used by the chunk. Within the
> >>> chunk, the GPU commands are not preemptible, so the PTEs cannot be
> >>> replaced before they are consumed.
> >>>
> >>> We do leave the PTE pointing to stale pages after the blit, but since
> >>> the vm is only used by first rewriting the PTE, those dangling PTE
> >>> should never be chased.
> >>>
> >>> So each chunk does a fixed copy between the same pair of addresses
> >>> within the vm; the magic is in rewriting the PTE to point at the pages
> >>> of interest for the chunk.
> >>
> >> Isn't the vm getting potentially re-used on the fall-back path, if
> >> virtual engine creation fails? In which case dangling PTEs would be
> >> written to. Or I am still missing something?
> > 
> > The same vm is used by all contexts (only one is created).
> 
> Right, true. I'll try again because the overall scheme is still 
> confusing me..
> 
> Migrate_vm makes [0, 2 * CHUNK_SZ) point to PTE backing store for the 
> same range.

[2 * CHUNK_SZ, 2 * CHUNK_SZ + 2 * CHUNK_SZ >> 9]

> Src emit_pte writes up to [0, CHUNK_SZ) effectively re-writing PTEs for 
> the same range

Apart from the PD backpointers start at 2 * CHUNK_SZ...

> Dest emit_pte same for the destination range

That starts at 2 * CHUNK_SZ + CHUNK_SZ >> 9

So I passed in the target address range to emit_pte

	len = emit_pte(rq, &it_src, encode, 0, CHUNK_SZ);
	err = emit_pte(rq, &it_dst, encode, CHUNK_SZ, len);

but the first thing emit_pte does is compute the PD offset for that
range:

        offset >>= 12;
        offset *= sizeof(u64);
        offset += 2 * CHUNK_SZ;

> Virtual address [0, 2 * CHUNK_SZ) ptes now point to src and dest objects 
> backing store.

Yes.

> Blit copy emits blitter commands to copy between src and dest.
> 
> Now second copy comes and calls emit_pte which again writes to [0, 
> CHUNK_SZ) virtual range. How does that end up in the PTE backing store 
> and not previous object backing store?

It all boils down to the PD being offset.
-Chris