Support for amdgpu VM update via CPU on large-bar systems

Felix Kuehling felix.kuehling at
Fri May 12 19:25:49 UTC 2017

On 17-05-12 04:43 AM, Christian K├Ânig wrote:
> Am 12.05.2017 um 10:37 schrieb zhoucm1:
>> If the sdma is faster, even they wait for finish, which time is
>> shorter than CPU, isn't it? Of course, the precondition is sdma is
>> exclusive. They can reserve a sdma for PT updating.
> No, if I understood Felix numbers correctly the setup and wait time
> for SDMA is a bit (but not much) longer than doing it with the CPU.

I'm skeptical of claims that SDMA is faster. Even when you use SDMA to
write the page table, the CPU still has to do about the same amount of
work writing PTEs into the SDMA IBs. SDMA can only save CPU time in
certain cases:

  * Copying PTEs from GART table if they are on the same GPU (not
    possible on Vega10 due to different MTYPE bits)
  * Generating PTEs for contiguous VRAM BOs

At least for system memory BOs writing the PTEs directly to
write-combining VRAM should be faster than writing them to cached system
memory IBs first and then kicking off an SDMA transfer and waiting for

> What would really help is to fix the KFD design and work with async
> page tables updates there as well.

That problem goes much higher up the stack than just KFD. It would
affect memory management interfaces in the HSA runtime and HCC.

The basic idea is to make the GPU behave very similar to a CPU and to
have multi-threaded code where some threads run on the CPU and others on
the GPU almost seamlessly. You allocate memory and then you use the same
pointer in your CPU and GPU threads. Exposing the messiness of
asynchronous page table updates all the way up to the application would
destroy that programming model.

In this model, latency matters most. The longer it takes to kick off a
parallel GPU processing job, the less efficient scaling you get from the
GPUs parallel processing capabilities. Exposing asynchronous memory
management up the stack would allow the application to hide the latency
in some cases (if it can do other useful things in the mean time), but
it doesn't make the latency disappear.

An application that wants to hide memory management latency can do this,
even with the existing programming model, by separating memory
management and processing into separate threads.


More information about the amd-gfx mailing list