[PATCH] drm/amdkfd: Fix dmabuf's redundant eviction when unmapping

Mon Apr 10 19:09:01 UTC 2023

On 2023-04-10 15:03, Eric Huang wrote:
> Hi Felix,
>
> What do you think my proposal in my previous email? that setting 
> domain to CPU in kfd_mem_dmamap_dmabuf, and setting domain to GTT in 
> kfd_mem_dmaunmap_dmabuf, that will be doing the similar way as userptr.

No. This is the exact opposite of what it should be doing. Validating a 
BO in the CPU domain means, you cannot map it in a GPU virtual address 
space. It would result in invalid PTEs in the GPU page table and any 
access by an application would result in a page fault.

You say that we do this for userptrs. I think that's a misunderstanding. 
>From what I see in the source code, kfd_mem_dmaUNMAP_userptr validates 
the userptr import in CPU domain, kfd_mem_dmaMAP_userptr validates the 
userptr import in GTT domain, just as a I would expect.

Regards,
   Felix

>
> Thanks,
> Eric
>
> On 2023-04-10 14:50, Felix Kuehling wrote:
>> Sorry, you're right, there is no AMDGPU_GEM_DOMAIN_PREEMPTIBLE. I 
>> remembered this wrong. There is a flag called 
>> AMDGPU_GEM_CREATE_PREEMPTIBLE, which changes what happens when is 
>> placed in the AMDGPU_GEM_DOMAIN_GTT domain.
>>
>> So my proposal would need to be modified to set the flag 
>> AMDGPU_GEM_CREATE_PREEMPTIBLE in the imported DMABuf BO.
>>
>> On 2023-04-10 14:28, Eric Huang wrote:
>>> Hi Felix,
>>>
>>> Thanks for your review and suggestion, but unfortunately the 
>>> AMDGPU_GEM_DOMAIN_PREEMPTIBLE is not defined in amdgpu_drm.h. I 
>>> understand we need the memory eviction on either 
>>> kfd_mem_dmamap_dmabuf() or kfd_mem_dmaunmap_dmabuf() to update DMA 
>>> address, so I am thinking to do it as simply as userptr memory does.
>>>
>>> The purpose for this change is for non-MES HW scheduler we are using 
>>> userptr/paged memory, but since GFX11 we will be using MES scheduler 
>>> and it needs the memory to be allocated as GTT/non-paged memory, so 
>>> we want all GPUs using GTT/non-paged memory, but there is 
>>> performance drop, because of eviction in kfd_mem_dmaunmap_dmabuf.
>>>
>>> Currently userptr memory is evicted in kfd_mem_dmamap_userptr as 
>>> changing domain to GTT before calling ttm_bo_validate, and not 
>>> evicted in kfd_mem_dmamap_userptr, so I think we can do the similar 
>>> way for GTT/non-paged memory that setting domain to CPU in 
>>> kfd_mem_dmamap_dmabuf, which will evict memory to update DMA 
>>> address, and setting domain to GTT in kfd_mem_dmaunmap_dmabuf, which 
>>> will not evict memory. The performance should be the same as 
>>> userptr/paged memory.
>>
>> This sounds backwards to me. dmaunmap should move objects to the CPU 
>> domain because the GPU mapping is potentially invalid. And dmamap 
>> must use move it to the GTT domain because that updates the GPU 
>> mapping and allows the GPU virtual address mapping to be updated.
>>
>> The problem is the eviction in dmaunmap. Userptrs don't see these 
>> evictions because the SG BOs we use to map them on other GPUs do set 
>> the AMDGPU_GEM_CREATE_PREEMPTIBLE flag. My idea is to do the same 
>> thing for DMABufs that map GTT (and VRAM) BOs to other GPUs._
>>
>> Now that I look at it in more detail, I see we're already doing that 
>> in kfd_mem_attach_dmabuf:
>>
>>         *bo = gem_to_amdgpu_bo(gobj);
>>         (*bo)->flags |= AMDGPU_GEM_CREATE_PREEMPTIBLE;
>>
>> So then the question is, why is this not working? I think that's the 
>> second part of my proposal, which is still needed:
>>
>>> 2. Add a special case in the above if-block for old_mem->mem_type ==
>>>    AMDGPU_PL_PREEMPT: use amdgpu_bo_sync_wait with
>>>    owner=AMDGPU_FENCE_OWNER_KFD so that it doesn't wait for eviction 
>>> fences 
>>
>> Regards,
>>   Felix
>>
>>
>>>
>>> Regards,
>>> Eric
>>>
>>> On 2023-04-04 16:40, Felix Kuehling wrote:
>>>> [+Christian]
>>>>
>>>> OK, this comes from the ttm_bo_wait_ctx call in this section of 
>>>> amdgpu_bo_move:
>>>>
>>>>         if ((old_mem->mem_type == TTM_PL_TT ||
>>>>              old_mem->mem_type == AMDGPU_PL_PREEMPT) &&
>>>>              new_mem->mem_type == TTM_PL_SYSTEM) {
>>>>                 r = ttm_bo_wait_ctx(bo, ctx);
>>>>                 if (r)
>>>>                         return r;
>>>>
>>>>                 amdgpu_ttm_backend_unbind(bo->bdev, bo->ttm);
>>>>                 ttm_resource_free(bo, &bo->resource);
>>>>                 ttm_bo_assign_mem(bo, new_mem);
>>>>                 goto out;
>>>>         }
>>>>
>>>> We can't just remove this wait. It's not even specific to KFD or 
>>>> DMABuf imports. We also can't just change it to avoid waiting for 
>>>> eviction fences because it's also used for GTT BOs (e.g. before a 
>>>> BO gets swapped under extreme memory pressure). So we also need to 
>>>> trigger the eviction fence in general case.
>>>>
>>>> In the specific case of DMABuf imports, they share the reservation 
>>>> object with the original BO. So waiting on the reservation triggers 
>>>> the eviction fence on the original BO. I think we want to avoid the 
>>>> waiting on eviction fences for all BOs where the underlying memory 
>>>> is managed by some other BO, and at the same time also avoid ever 
>>>> evicting the DMABuf import BO. That's what AMDGPU_PL_PREEMPT is 
>>>> for. So I think a combination of two changes should to the trick:
>>>>
>>>> 1. Change kfd_mem_dmamap_dmabuf to use AMDGPU_GEM_DOMAIN_PREEMPTIBLE
>>>> 2. Add a special case in the above if-block for old_mem->mem_type ==
>>>>    AMDGPU_PL_PREEMPT: use amdgpu_bo_sync_wait with
>>>>    owner=AMDGPU_FENCE_OWNER_KFD so that it doesn't wait for 
>>>> eviction fences
>>>>
>>>> Regards,
>>>>   Felix
>>>>
>>>>
>>>> Am 2023-04-04 um 10:36 schrieb Eric Huang:
>>>>> Here is the backtrace from Jira:
>>>>>
>>>>> Thu Nov 10 13:10:23 2022] Scheduling eviction of pid 97784 in 0 
>>>>> jiffies
>>>>> [Thu Nov 10 13:10:23 2022] WARNING: CPU: 173 PID: 97784 at 
>>>>> /var/lib/dkms/amdgpu/5.16.9.22.20-1438746~20.04/build/amd/amdgpu/../amdkfd/kfd_device.c:878 
>>>>> kgd2kfd_schedule_evict_and_restore_process+0x104/0x120 [amdgpu]
>>>>> [Thu Nov 10 13:10:23 2022] Modules linked in: veth amdgpu(OE) 
>>>>> amddrm_ttm_helper(OE) amdttm(OE) iommu_v2 amd_sched(OE) amdkcl(OE) 
>>>>> xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink 
>>>>> xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat 
>>>>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter br_netfilter 
>>>>> bridge stp llc aufs overlay binfmt_misc nls_iso8859_1 dm_multipath 
>>>>> scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_rapl_msr 
>>>>> intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm efi_pstore 
>>>>> rapl ipmi_ssif ccp acpi_ipmi k10temp ipmi_si ipmi_devintf 
>>>>> ipmi_msghandler mac_hid sch_fq_codel msr ip_tables x_tables 
>>>>> autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 
>>>>> async_raid6_recov async_memcpy async_pq async_xor async_tx xor 
>>>>> raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs 
>>>>> ib_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel 
>>>>> aesni_intel crypto_simd cryptd ast drm_vram_helper drm_ttm_helper 
>>>>> ttm mlx5_core drm_kms_helper syscopyarea sysfillrect sysimgblt 
>>>>> fb_sys_fops
>>>>> [Thu Nov 10 13:10:23 2022]  pci_hyperv_intf cec psample igb mlxfw 
>>>>> rc_core dca ahci xhci_pci tls drm i2c_algo_bit libahci 
>>>>> xhci_pci_renesas i2c_piix4
>>>>> [Thu Nov 10 13:10:23 2022] CPU: 173 PID: 97784 Comm: 
>>>>> onnxruntime_tes Tainted: G        W  OE 5.13.0-30-generic 
>>>>> #33~20.04.1-Ubuntu
>>>>> [Thu Nov 10 13:10:23 2022] Hardware name: GIGABYTE 
>>>>> G482-Z53-YF/MZ52-G40-00, BIOS R12 05/13/2020
>>>>> [Thu Nov 10 13:10:23 2022] RIP: 
>>>>> 0010:kgd2kfd_schedule_evict_and_restore_process+0x104/0x120 [amdgpu]
>>>>> [Thu Nov 10 13:10:23 2022] Code: 5e 5d c3 4c 89 e7 e8 cb c6 44 df 
>>>>> eb e7 49 8b 45 60 48 89 ca 48 c7 c7 38 8b d7 c1 48 89 4d e0 8b b0 
>>>>> 20 09 00 00 e8 87 ee 7e df <0f> 0b 48 8b 4d e0 eb 9f 41 be ea ff 
>>>>> ff ff eb ba 41 be ed ff ff ff
>>>>> [Thu Nov 10 13:10:23 2022] RSP: 0018:ffffb25f2a173978 EFLAGS: 
>>>>> 00010086
>>>>> [Thu Nov 10 13:10:23 2022] RAX: 0000000000000000 RBX: 
>>>>> 0000000000000001 RCX: 0000000000000027
>>>>> [Thu Nov 10 13:10:23 2022] RDX: 0000000000000027 RSI: 
>>>>> 00000000fffeffff RDI: ffff95d06e4a09c8
>>>>> [Thu Nov 10 13:10:23 2022] RBP: ffffb25f2a173998 R08: 
>>>>> ffff95d06e4a09c0 R09: ffffb25f2a173750
>>>>> [Thu Nov 10 13:10:23 2022] R10: 0000000000000001 R11: 
>>>>> 0000000000000001 R12: ffff95c371d74580
>>>>> [Thu Nov 10 13:10:23 2022] R13: ffff95b1cd3f2000 R14: 
>>>>> 0000000000000000 R15: ffff95c371d74580
>>>>> [Thu Nov 10 13:10:23 2022] FS:  00007fcaff268b00(0000) 
>>>>> GS:ffff95d06e480000(0000) knlGS:0000000000000000
>>>>> [Thu Nov 10 13:10:23 2022] CS:  0010 DS: 0000 ES: 0000 CR0: 
>>>>> 0000000080050033
>>>>> [Thu Nov 10 13:10:23 2022] CR2: 00007fc643980000 CR3: 
>>>>> 00000003e9492000 CR4: 0000000000350ee0
>>>>> [Thu Nov 10 13:10:23 2022] Call Trace:
>>>>> [Thu Nov 10 13:10:23 2022]  <TASK>
>>>>> [Thu Nov 10 13:10:23 2022] 
>>>>>  amdkfd_fence_enable_signaling+0x46/0x50 [amdgpu]
>>>>> [Thu Nov 10 13:10:23 2022]  __dma_fence_enable_signaling+0x52/0xb0
>>>>> [Thu Nov 10 13:10:23 2022]  dma_fence_default_wait+0xa9/0x200
>>>>> [Thu Nov 10 13:10:23 2022]  dma_fence_wait_timeout+0xbd/0xe0
>>>>> [Thu Nov 10 13:10:23 2022]  amddma_resv_wait_timeout+0x6f/0xd0 
>>>>> [amdkcl]
>>>>> [Thu Nov 10 13:10:23 2022]  amdttm_bo_wait+0x39/0x50 [amdttm]
>>>>> [Thu Nov 10 13:10:23 2022]  amdgpu_bo_move+0x41e/0x7b0 [amdgpu]
>>>>> [Thu Nov 10 13:10:23 2022]  ? down_write+0x13/0x50
>>>>> [Thu Nov 10 13:10:23 2022]  ? unmap_mapping_pages+0x68/0x130
>>>>> [Thu Nov 10 13:10:23 2022]  ttm_bo_handle_move_mem+0x7f/0x120 
>>>>> [amdttm]
>>>>> [Thu Nov 10 13:10:23 2022]  amdttm_bo_validate+0xbf/0x100 [amdttm]
>>>>> [Thu Nov 10 13:10:23 2022] 
>>>>>  kfd_mem_dmaunmap_attachment+0x131/0x140 [amdgpu]
>>>>> [Thu Nov 10 13:10:23 2022]  unmap_bo_from_gpuvm+0x67/0x80 [amdgpu]
>>>>> [Thu Nov 10 13:10:23 2022] 
>>>>>  amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu+0x114/0x220 [amdgpu]
>>>>> [Thu Nov 10 13:10:23 2022]  ? __mod_memcg_lruvec_state+0x22/0xe0
>>>>> [Thu Nov 10 13:10:23 2022] 
>>>>>  kfd_ioctl_unmap_memory_from_gpu+0xe8/0x270 [amdgpu]
>>>>> [Thu Nov 10 13:10:23 2022]  kfd_ioctl+0x23c/0x590 [amdgpu]
>>>>> [Thu Nov 10 13:10:23 2022]  ? 
>>>>> kfd_ioctl_get_process_apertures_new+0x330/0x330 [amdgpu]
>>>>> [Thu Nov 10 13:10:23 2022]  ? exit_to_user_mode_prepare+0x3d/0x1c0
>>>>> [Thu Nov 10 13:10:23 2022]  ? __fget_files+0xa7/0xd0
>>>>> [Thu Nov 10 13:10:23 2022]  __x64_sys_ioctl+0x91/0xc0
>>>>> [Thu Nov 10 13:10:23 2022]  do_syscall_64+0x61/0xb0
>>>>> [Thu Nov 10 13:10:23 2022]  ? do_syscall_64+0x6e/0xb0
>>>>> [Thu Nov 10 13:10:23 2022]  ? do_syscall_64+0x6e/0xb0
>>>>> [Thu Nov 10 13:10:23 2022]  ? do_syscall_64+0x6e/0xb0
>>>>> [Thu Nov 10 13:10:23 2022]  ? do_syscall_64+0x6e/0xb0
>>>>> [Thu Nov 10 13:10:23 2022]  ? 
>>>>> asm_sysvec_apic_timer_interrupt+0xa/0x20
>>>>> [Thu Nov 10 13:10:23 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>> [Thu Nov 10 13:10:23 2022] RIP: 0033:0x7fcaff57b3ab
>>>>> [Thu Nov 10 13:10:23 2022] Code: 0f 1e fa 48 8b 05 e5 7a 0d 00 64 
>>>>> c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 
>>>>> 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 
>>>>> b5 7a 0d 00 f7 d8 64 89 01 48
>>>>> [Thu Nov 10 13:10:23 2022] RSP: 002b:00007fffe41e0098 EFLAGS: 
>>>>> 00000206 ORIG_RAX: 0000000000000010
>>>>> [Thu Nov 10 13:10:23 2022] RAX: ffffffffffffffda RBX: 
>>>>> 00007fcacc7f7f80 RCX: 00007fcaff57b3ab
>>>>> [Thu Nov 10 13:10:23 2022] RDX: 00007fffe41e0120 RSI: 
>>>>> 00000000c0184b19 RDI: 0000000000000003
>>>>> [Thu Nov 10 13:10:23 2022] RBP: 00007fffe41e00d0 R08: 
>>>>> 0000562e2d5730d0 R09: 0000000000000000
>>>>> [Thu Nov 10 13:10:23 2022] R10: 0000562e2c928ec0 R11: 
>>>>> 0000000000000206 R12: 0000000000000001
>>>>> [Thu Nov 10 13:10:23 2022] R13: 00007fffe41e04b0 R14: 
>>>>> 0000000000000000 R15: 0000562e2d3f5b20
>>>>> [Thu Nov 10 13:10:23 2022]  </TASK>
>>>>> [Thu Nov 10 13:10:23 2022] ---[ end trace 1464f08f6be60b30 ]---
>>>>>
>>>>> Regards,
>>>>> Eric
>>>>>
>>>>> On 2023-04-04 10:11, Felix Kuehling wrote:
>>>>>> If we keep the BO in the GTT domain, it means it will not be 
>>>>>> updated if we validate it again later in kfd_mem_dmamap_dmabuf. 
>>>>>> This means we'll use stale DMA addresses when we update the page 
>>>>>> tables after evictions.
>>>>>>
>>>>>> I think we'll need to find a different way to avoid triggering 
>>>>>> the eviction fence on the original BO when changing the placement 
>>>>>> of the DMABuf import here. If you need help brainstorming here, 
>>>>>> please share a backtrace from the eviction generated with the 
>>>>>> debug_evictions module param.
>>>>>>
>>>>>> Regards,
>>>>>>   Felix
>>>>>>
>>>>>>
>>>>>> Am 2023-04-03 um 13:59 schrieb Eric Huang:
>>>>>>> dmabuf is allocated/mapped as GTT domain, when dma-unmapping dmabuf
>>>>>>> changing placement to CPU will trigger memory eviction after 
>>>>>>> calling
>>>>>>> ttm_bo_validate, and the eviction will cause performance drop.
>>>>>>> Keeping the correct domain will solve the issue.
>>>>>>>
>>>>>>> Signed-off-by: Eric Huang <jinhuieric.huang at amd.com>
>>>>>>> ---
>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +-
>>>>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>>>> index a3b09edfd1bf..17b708acb447 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>>>> @@ -642,7 +642,7 @@ kfd_mem_dmaunmap_dmabuf(struct 
>>>>>>> kfd_mem_attachment *attachment)
>>>>>>>       struct ttm_operation_ctx ctx = {.interruptible = true};
>>>>>>>       struct amdgpu_bo *bo = attachment->bo_va->base.bo;
>>>>>>>   -    amdgpu_bo_placement_from_domain(bo, AMDGPU_GEM_DOMAIN_CPU);
>>>>>>> +    amdgpu_bo_placement_from_domain(bo, AMDGPU_GEM_DOMAIN_GTT);
>>>>>>>       ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
>>>>>>>   }
>>>>>
>>>
>