[PATCH 00/20] Add KFD GPUVM support for dGPUs v4

Oded Gabbay oded.gabbay at gmail.com
Wed Mar 21 07:52:00 UTC 2018


On Mon, Mar 19, 2018 at 9:05 PM, Felix Kuehling <felix.kuehling at amd.com> wrote:
> On 2018-03-19 12:39 PM, Christian König wrote:
>> So coming back to this series once more.
>>
>> Patch #1, #3 are Reviewed-by: Christian König <christian.koenig at amd.com>.
>>
>> Patch #2, #4 - #13 and #18-#19 are Acked-by: Christian König
>> <christian.koenig at amd.com>.
>>
>> Patch #14: What's the difference to setting vramlimit=$size_of_bar ?
>
> The debug_largebar option only affects KFD. Graphics can still use all
> memory.
>
>>
>> Patch #15 & #20: Why is that actually still needed? I thought we have
>> fixed all dependencies and can now use the "standard" way of attaching
>> fences to reservation objects to do this.
>
> Patch 15 adds a KFD-specific MMU notifier, because the graphics MMU
> notifier deals with amdgpu_cs command submission. We need a completely
> different treatment of MMU notifiers for KFD. We need to stop user mode
> queues.
>
> Patch 20 implements the user mode queue preemption mechanism, and the
> corresponding restore function that re-validates userptr BOs before
> restarting user mode queues.
>
> I think you're implying that the graphics MMU notifier would wait for
> the eviction fence and trigger a regular eviction in KFD. I haven't
> tried that. The MMU notifier and userptr eviction mechanism was
> implemented in KFD before we had the TTM evictions. We didn't go back
> and revisit it after that. There are a few major differences that make
> me want to keep the two types of evictions separate, though:
>
>   * TTM evictions are the result of memory pressure (triggered by
>     another process)
>       o Most MMU notifiers are triggered by something in the same
>         process (fork, mprotect, etc.)
>       o Thus the restore delay after MMU notifiers can be much shorter
>   * TTM evictions are done asynchronously in a delayed worker
>       o MMU notifiers are synchronous, queues need to be stopped before
>         returning
>   * KFD's TTM eviction/restore code doesn't handle userptrs
>     (get_user_pages, etc)
>       o MMU notifier restore worker is specialized to handle just userptrs
>
>
>>
>> Saying so I still need to take a closer look at patch #20.
>>
>> Patch #16: Looks good to me in general, but I think it would be safer
>> if we grab a reference to the task structure. Otherwise grabbing pages
>> from a mm_struct sounds a bit scary to me.
>
> You're right. I've never seen it cause problems when testing process
> termination, probably because during process termination KFD cancels the
> delayed worker that calls this function. But I'll fix this and take a
> proper reference.
>
>>
>> Patch #17: I think it would be better to allocate the node when the
>> locks are not held and free it when we find that it isn't used, but no
>> big deal.
>
> OK. I'll change that.
>
> Thanks,
>   Felix
>
>>
>> Regards,
>> Christian.
>>
>> Am 15.03.2018 um 22:27 schrieb Felix Kuehling:
>>> Rebased and integrated review feedback from v3:
>>> * Removed vm->vm_context field
>>> * Use uninterruptible waiting in initial PD validation to avoid
>>> ERESTARTSYS
>>> * Return number of successful map/unmap operations in failure cases
>>> * Facilitate partial retry after failed map/unmap
>>> * Added comments with parameter descriptions to new APIs
>>> * Defined AMDKFD_IOC_FREE_MEMORY_OF_GPU write-only
>>>
>>> This patch series also adds Userptr support in patches 15-20.
>>>
>>> Felix Kuehling (19):
>>>    drm/amdgpu: Move KFD-specific fields into struct amdgpu_vm
>>>    drm/amdgpu: Fix initial validation of PD BO for KFD VMs
>>>    drm/amdgpu: Add helper to turn an existing VM into a compute VM
>>>    drm/amdgpu: Add kfd2kgd interface to acquire an existing VM
>>>    drm/amdkfd: Create KFD VMs on demand
>>>    drm/amdkfd: Remove limit on number of GPUs
>>>    drm/amdkfd: Aperture setup for dGPUs
>>>    drm/amdkfd: Add per-process IDR for buffer handles
>>>    drm/amdkfd: Allocate CWSR trap handler memory for dGPUs
>>>    drm/amdkfd: Add TC flush on VMID deallocation for Hawaii
>>>    drm/amdkfd: Add ioctls for GPUVM memory management
>>>    drm/amdkfd: Kmap event page for dGPUs
>>>    drm/amdkfd: Add module option for testing large-BAR functionality
>>>    drm/amdgpu: Add MMU notifier type for KFD userptr
>>>    drm/amdgpu: Enable amdgpu_ttm_tt_get_user_pages in worker threads
>>>    drm/amdgpu: GFP_NOIO while holding locks taken in MMU notifier
>>>    drm/amdkfd: GFP_NOIO while holding locks taken in MMU notifier
>>>    drm/amdkfd: Add quiesce_mm and resume_mm to kgd2kfd_calls
>>>    drm/amdgpu: Add userptr support for KFD
>>>
>>> Oak Zeng (1):
>>>    drm/amdkfd: Populate DRM render device minor
>>>
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h         |  37 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v7.c  |   1 +
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v8.c  |   1 +
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c   | 818
>>> ++++++++++++++++++---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c             |   2 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c             |  96 ++-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h             |  11 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c            |  30 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c             |  70 ++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h             |  10 +
>>>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c           | 532
>>> ++++++++++++++
>>>   drivers/gpu/drm/amd/amdkfd/kfd_crat.c              |   3 +
>>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c            |  40 +-
>>>   .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c  |  22 +-
>>>   drivers/gpu/drm/amd/amdkfd/kfd_events.c            |  31 +-
>>>   drivers/gpu/drm/amd/amdkfd/kfd_flat_memory.c       |  59 +-
>>>   drivers/gpu/drm/amd/amdkfd/kfd_module.c            |   7 +
>>>   drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_cik.c   |   2 +-
>>>   drivers/gpu/drm/amd/amdkfd/kfd_mqd_manager_vi.c    |   2 +-
>>>   drivers/gpu/drm/amd/amdkfd/kfd_packet_manager.c    |  37 +
>>>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h              |  41 ++
>>>   drivers/gpu/drm/amd/amdkfd/kfd_process.c           | 314 +++++++-
>>>   drivers/gpu/drm/amd/amdkfd/kfd_topology.c          |   4 +
>>>   drivers/gpu/drm/amd/amdkfd/kfd_topology.h          |   1 +
>>>   drivers/gpu/drm/amd/include/kgd_kfd_interface.h    |  10 +
>>>   include/uapi/linux/kfd_ioctl.h                     | 122 ++-
>>>   26 files changed, 2090 insertions(+), 213 deletions(-)
>>>
>>
>

Hi Felix,
I did a quick pass on the patch-set and didn't see anything scary.
Patches 1-14 are already applied to my -next tree. If I will send it
now to Dave I believe we would be ok from schedule POV.
I suggest we delay userptr support for the next kernel release because
you need to address what Christian said and I also want to take a
closer look at it.
What do you think ?

Thanks,
Oded


More information about the amd-gfx mailing list