[Intel-gfx] [RFC 4/4] drm/i915/gt: Pipelined page migration

Mon Nov 30 16:26:44 UTC 2020

On 30/11/2020 16:21, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2020-11-30 16:07:44)
>>
>> On 30/11/2020 13:39, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2020-11-30 13:12:55)
>>>> On 28/11/2020 18:40, Chris Wilson wrote:
>>>>> If we pipeline the PTE updates and then do the copy of those pages
>>>>> within a single unpreemptible command packet, we can submit the copies
>>>>> and leave them to be scheduled without having to synchronously wait
>>>>> under a global lock.
>>>>>
>>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>>> ---
>>>>>     drivers/gpu/drm/i915/Makefile                 |   1 +
>>>>>     drivers/gpu/drm/i915/gt/intel_engine.h        |   1 +
>>>>>     drivers/gpu/drm/i915/gt/intel_migrate.c       | 370 ++++++++++++++++++
>>>>>     drivers/gpu/drm/i915/gt/intel_migrate.h       |  33 ++
>>>>>     drivers/gpu/drm/i915/gt/selftest_migrate.c    | 105 +++++
>>>>>     .../drm/i915/selftests/i915_live_selftests.h  |   1 +
>>>>>     6 files changed, 511 insertions(+)
>>>>>     create mode 100644 drivers/gpu/drm/i915/gt/intel_migrate.c
>>>>>     create mode 100644 drivers/gpu/drm/i915/gt/intel_migrate.h
>>>>>     create mode 100644 drivers/gpu/drm/i915/gt/selftest_migrate.c
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
>>>>> index e5574e506a5c..0b2e12c87f9d 100644
>>>>> --- a/drivers/gpu/drm/i915/Makefile
>>>>> +++ b/drivers/gpu/drm/i915/Makefile
>>>>> @@ -103,6 +103,7 @@ gt-y += \
>>>>>         gt/intel_gtt.o \
>>>>>         gt/intel_llc.o \
>>>>>         gt/intel_lrc.o \
>>>>> +     gt/intel_migrate.o \
>>>>>         gt/intel_mocs.o \
>>>>>         gt/intel_ppgtt.o \
>>>>>         gt/intel_rc6.o \
>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
>>>>> index ac58fcda4927..079d26b47a97 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
>>>>> @@ -188,6 +188,7 @@ intel_write_status_page(struct intel_engine_cs *engine, int reg, u32 value)
>>>>>     #define I915_GEM_HWS_PREEMPT_ADDR   (I915_GEM_HWS_PREEMPT * sizeof(u32))
>>>>>     #define I915_GEM_HWS_SEQNO          0x40
>>>>>     #define I915_GEM_HWS_SEQNO_ADDR             (I915_GEM_HWS_SEQNO * sizeof(u32))
>>>>> +#define I915_GEM_HWS_MIGRATE         (0x42 * sizeof(u32))
>>>>>     #define I915_GEM_HWS_SCRATCH                0x80
>>>>>     
>>>>>     #define I915_HWS_CSB_BUF0_INDEX             0x10
>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c
>>>>> new file mode 100644
>>>>> index 000000000000..4d7bd32eb8d4
>>>>> --- /dev/null
>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c
>>>>> @@ -0,0 +1,370 @@
>>>>> +// SPDX-License-Identifier: MIT
>>>>> +/*
>>>>> + * Copyright © 2020 Intel Corporation
>>>>> + */
>>>>> +
>>>>> +#include "i915_drv.h"
>>>>> +#include "intel_context.h"
>>>>> +#include "intel_gt.h"
>>>>> +#include "intel_gtt.h"
>>>>> +#include "intel_lrc.h" /* virtual engine */
>>>>> +#include "intel_migrate.h"
>>>>> +#include "intel_ring.h"
>>>>> +
>>>>> +#define CHUNK_SZ SZ_8M /* ~1ms at 8GiB/s preemption delay */
>>>>> +
>>>>> +static void insert_pte(struct i915_address_space *vm,
>>>>> +                    struct i915_page_table *pt,
>>>>> +                    void *data)
>>>>> +{
>>>>> +     u64 *offset = data;
>>>>> +
>>>>> +     vm->insert_page(vm, px_dma(pt), *offset, I915_CACHE_NONE, 0);
>>>>> +     *offset += PAGE_SIZE;
>>>>> +}
>>>>> +
>>>>> +static struct i915_address_space *migrate_vm(struct intel_gt *gt)
>>>>> +{
>>>>> +     struct i915_vm_pt_stash stash = {};
>>>>> +     struct i915_ppgtt *vm;
>>>>> +     u64 offset, sz;
>>>>> +     int err;
>>>>> +
>>>>> +     vm = i915_ppgtt_create(gt);
>>>>> +     if (IS_ERR(vm))
>>>>> +             return ERR_CAST(vm);
>>>>> +
>>>>> +     if (!vm->vm.allocate_va_range || !vm->vm.foreach) {
>>>>> +             err = -ENODEV;
>>>>> +             goto err_vm;
>>>>> +     }
>>>>> +
>>>>> +     /*
>>>>> +      * We copy in 8MiB chunks. Each PDE covers 2MiB, so we need
>>>>> +      * 4x2 page directories for source/destination.
>>>>> +      */
>>>>> +     sz = 2 * CHUNK_SZ;
>>>>> +     offset = sz;
>>>>> +
>>>>> +     /*
>>>>> +      * We need another page directory setup so that we can write
>>>>> +      * the 8x512 PTE in each chunk.
>>>>> +      */
>>>>> +     sz += (sz >> 12) * sizeof(u64);
>>>>> +
>>>>> +     err = i915_vm_alloc_pt_stash(&vm->vm, &stash, sz);
>>>>> +     if (err)
>>>>> +             goto err_vm;
>>>>> +
>>>>> +     err = i915_vm_pin_pt_stash(&vm->vm, &stash);
>>>>> +     if (err) {
>>>>> +             i915_vm_free_pt_stash(&vm->vm, &stash);
>>>>> +             goto err_vm;
>>>>> +     }
>>>>> +
>>>>> +     vm->vm.allocate_va_range(&vm->vm, &stash, 0, sz);
>>>>> +     i915_vm_free_pt_stash(&vm->vm, &stash);
>>>>> +
>>>>> +     /* Now allow the GPU to rewrite the PTE via its own ppGTT */
>>>>> +     vm->vm.foreach(&vm->vm, 0, sz, insert_pte, &offset);
>>>>
>>>> This is just making the [0 - sz) gva point to the allocated sz bytes of
>>>> backing store?
>>>
>>> Not quite, we are pointing [offset, sz) to the page directories
>>> themselves. When we write into that range, we are modifying the PTE of
>>> this ppGTT. (Breaking the fourth wall, allowing the non-privileged
>>> context to update its own page tables.)
>>
>> Confusing, so first step is here making [0, sz) gva ptes point to pte
>> backing store.
>>
>> Which means emit_pte when called from intel_context_migrate_pages is
>> emitting sdw which will write into [0, sz] gva, effectively placing the
>> src and dest object content at those location.
>>
>> Then blit copies the data.
>>
>> What happens if context is re-used? Gva's no longer point to PTEs so
>> how? Is a step to re-initialize after use missing? More sdw after the
>> blit copy to make it point back to ptes?
> 
> Every chunk starts with writing the PTEs used by the chunk. Within the
> chunk, the GPU commands are not preemptible, so the PTEs cannot be
> replaced before they are consumed.
> 
> We do leave the PTE pointing to stale pages after the blit, but since
> the vm is only used by first rewriting the PTE, those dangling PTE
> should never be chased.
> 
> So each chunk does a fixed copy between the same pair of addresses
> within the vm; the magic is in rewriting the PTE to point at the pages
> of interest for the chunk.

Isn't the vm getting potentially re-used on the fall-back path, if 
virtual engine creation fails? In which case dangling PTEs would be 
written to. Or I am still missing something?

>>>>> +             i915_request_add(rq);
>>>>> +             if (it_s.sg)
>>>>> +                     cond_resched();
>>>>
>>>>    From what context does this run? No preemptible?
>>>
>>> Has to be process context; numerous allocations, implicit waits (that we
>>> want to avoid in practice), and the timeline (per-context) mutex to
>>> guard access to the ringbuffer.
>>
>> I guess on non-preemptible, or not fully preemptible kernel it is useful.
> 
> Oh, the cond_resched() is almost certainly overkill. But potentially a
> very large loop so worth being careful.

Yes, could be a large loop so doesn't harm.

Is the ringbuffer large enough to avoid stalls in that path?

Regards,

Tvrtko