[PATCH 00/44] Add HMM-based SVM memory manager to KFD v2

Felix Kuehling felix.kuehling at amd.com
Mon Mar 22 16:06:51 UTC 2021


Am 2021-03-22 um 10:15 a.m. schrieb Daniel Vetter:
> On Mon, Mar 22, 2021 at 06:58:16AM -0400, Felix Kuehling wrote:
>> Since the last patch series I sent on Jan 6 a lot has changed. Patches 1-33
>> are the cleaned up, rebased on amd-staging-drm-next 5.11 version from about
>> a week ago. The remaining 11 patches are current work-in-progress with
>> further cleanup and fixes.
>>
>> MMU notifiers and CPU page faults now can split ranges and update our range
>> data structures without taking heavy locks by doing some of the critical
>> work in a deferred work handler. This includes updating MMU notifiers and
>> the SVM range interval tree. In the mean time, new ranges can live as
>> children of their parent ranges until the deferred work handler consolidates
>> them in the main interval tree.
> I'm totally swammped with intel stuff unfortunately, so not really time to
> dig in. Can you give me the spoiler on how the (gfx10+ iirc) page fault
> inversion is planned to be handled now? Or that still tbd?

Navi is still TBD. This patch series focuses on GFXv9 because that's the
IP our data center GPUs are on. The code here has two modes of
operations, one that relies on page faults and one that relies on
preemptions. The latter should work on Navi just fine. So that's our
minimal fallback option.


>
> Other thing I noticed is that amdkfd still uses the mmu_notifier directly,
> and not the mmu_interval_notifier. But you're talking a lot about managing
> intervals here, and so I'm wondering whether we shouldn't do this in core
> code? Everyone will have the same painful locking problems here (well atm
> everyone = you&nouveau only I think), sharing this imo would make a ton of
> sense.

We use mmu_interval_notifiers in all the range-based code, including
even our legacy userptr code. The only non-interval notifier that's
still in use in KFD is the one we use for cleanup on process termination.


>
> I think the other one is moving over more generic pasid code, but I think
> that's going to be less useful here and maybe more a long term project.

Yes, it's unrelated to this work.

Regards,
  Felix


>
> Cheers, Daniel
>
>> We also added proper DMA mapping of system memory pages.
>>
>> Current work in progress is cleaning up all the locking, simplifying our
>> code and data structures and resolving a few known bugs.
>>
>> This series and the corresponding ROCm Thunk and KFDTest changes are also
>> available on gitub:
>>   https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/tree/fxkamd/hmm-wip
>>   https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/tree/fxkamd/hmm-wip
>>
>> An updated Thunk
>>
>> Alex Sierra (10):
>>   drm/amdgpu: replace per_device_list by array
>>   drm/amdkfd: helper to convert gpu id and idx
>>   drm/amdkfd: add xnack enabled flag to kfd_process
>>   drm/amdkfd: add ioctl to configure and query xnack retries
>>   drm/amdgpu: enable 48-bit IH timestamp counter
>>   drm/amdkfd: SVM API call to restore page tables
>>   drm/amdkfd: add svm_bo reference for eviction fence
>>   drm/amdgpu: add param bit flag to create SVM BOs
>>   drm/amdgpu: svm bo enable_signal call condition
>>   drm/amdgpu: add svm_bo eviction to enable_signal cb
>>
>> Felix Kuehling (22):
>>   drm/amdkfd: map svm range to GPUs
>>   drm/amdkfd: svm range eviction and restore
>>   drm/amdkfd: validate vram svm range from TTM
>>   drm/amdkfd: HMM migrate ram to vram
>>   drm/amdkfd: HMM migrate vram to ram
>>   drm/amdkfd: invalidate tables on page retry fault
>>   drm/amdkfd: page table restore through svm API
>>   drm/amdkfd: add svm_bo eviction mechanism support
>>   drm/amdkfd: refine migration policy with xnack on
>>   drm/amdkfd: add svm range validate timestamp
>>   drm/amdkfd: multiple gpu migrate vram to vram
>>   drm/amdkfd: Fix dma unmapping
>>   drm/amdkfd: Call mutex_destroy
>>   drm/amdkfd: Fix spurious restore failures
>>   drm/amdkfd: Fix svm_bo_list locking in eviction worker
>>   drm/amdkfd: Simplify split_by_granularity
>>   drm/amdkfd: Point out several race conditions
>>   drm/amdkfd: Return pdd from kfd_process_device_from_gduid
>>   drm/amdkfd: Remove broken deferred mapping
>>   drm/amdkfd: Allow invalid pages in migration.src
>>   drm/amdkfd: Correct locking during migration and mapping
>>   drm/amdkfd: Nested locking and invalidation of child ranges
>>
>> Philip Yang (12):
>>   drm/amdkfd: add svm ioctl API
>>   drm/amdkfd: register svm range
>>   drm/amdkfd: add svm ioctl GET_ATTR op
>>   drm/amdgpu: add common HMM get pages function
>>   drm/amdkfd: validate svm range system memory
>>   drm/amdkfd: deregister svm range
>>   drm/amdgpu: export vm update mapping interface
>>   drm/amdkfd: register HMM device private zone
>>   drm/amdkfd: support xgmi same hive mapping
>>   drm/amdkfd: copy memory through gart table
>>   drm/amdgpu: reserve fence slot to update page table
>>   drm/amdkfd: Add SVM API support capability bits
>>
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c    |    4 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |    4 +-
>>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c  |   16 +-
>>  .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   13 +-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        |   83 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |    7 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |    4 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |   90 +-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |   48 +-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |   11 +
>>  drivers/gpu/drm/amd/amdgpu/vega10_ih.c        |    1 +
>>  drivers/gpu/drm/amd/amdkfd/Kconfig            |    1 +
>>  drivers/gpu/drm/amd/amdkfd/Makefile           |    4 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |  173 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c  |    4 +
>>  drivers/gpu/drm/amd/amdkfd/kfd_iommu.c        |    8 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c      |  922 ++++++
>>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h      |   59 +
>>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |   54 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_process.c      |  191 +-
>>  .../amd/amdkfd/kfd_process_queue_manager.c    |    6 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c          | 2865 +++++++++++++++++
>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.h          |  175 +
>>  drivers/gpu/drm/amd/amdkfd/kfd_topology.c     |    6 +
>>  drivers/gpu/drm/amd/amdkfd/kfd_topology.h     |   10 +-
>>  include/uapi/linux/kfd_ioctl.h                |  171 +-
>>  26 files changed, 4681 insertions(+), 249 deletions(-)
>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>  create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h
>>
>> -- 
>> 2.31.0
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel


More information about the dri-devel mailing list