[PATCH 00/35] Add HMM-based SVM memory manager to KFD

Thu Jan 14 00:06:54 UTC 2021

Am 2021-01-13 um 11:47 a.m. schrieb Jerome Glisse:
> On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
>> This is the first version of our HMM based shared virtual memory manager
>> for KFD. There are still a number of known issues that we're working through
>> (see below). This will likely lead to some pretty significant changes in
>> MMU notifier handling and locking on the migration code paths. So don't
>> get hung up on those details yet.
> [...]
>
>> Known issues:
>> * won't work with IOMMU enabled, we need to dma_map all pages properly
>> * still working on some race conditions and random bugs
>> * performance is not great yet
> What would those changes looks like ? Seeing the issue below i do not
> see how they inter-play with mmu notifier. Can you elaborate.

We currently have some race conditions when multiple threads are causing
migrations concurrently (e.g. CPU page faults, GPU page faults, memory
evictions, and explicit prefetch by the application).

In the current patch series we set up one MMU range notifier for the
entire address space because we had trouble setting up MMU notifiers for
specific address ranges. There are situations where we want to free or
free/resize/reallocate MMU range notifiers, but we can't due to the
locking context we're in:

  * MMU release notifier when a virtual address range is unmapped
  * CPU page fault handler

In both these situations we may need to split virtual address ranges
because we only want to free or migrate a part of it. If we have
per-address range notifiers we also need to free or create notifiers,
which is not possible in those contexts. On the other hand, using a
single range notifier for everything causes unnecessary serialization.

We're reworking all of this to have per-address range notifiers that are
updated with a deferred mechanism in workers. I finally figured out how
to do that in a clean way, hopefully without races or deadlocks, which
should also address the other race conditions we had with concurrent
migration triggers. Philip is working on the implementation.

Regards,
  Felix

>
> Cheers,
> JÃ©rÃ´me
>