[PATCH 5/8] drm/amdgpu: use the new cursor in amdgpu_ttm_access_memory

Pan, Xinhui Xinhui.Pan at amd.com
Mon Mar 22 00:47:53 UTC 2021


[AMD Official Use Only - Internal Distribution Only]

No, the patch from Nirmoy did not fully fix this issue. I will send another fix patch later.


-----Original Message-----
From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Christian K?nig
Sent: 2021年3月20日 17:08
To: Kuehling, Felix <Felix.Kuehling at amd.com>; Paneer Selvam, Arunpravin <Arunpravin.PaneerSelvam at amd.com>; amd-gfx at lists.freedesktop.org
Subject: Re: [PATCH 5/8] drm/amdgpu: use the new cursor in amdgpu_ttm_access_memory

Yeah, Nirmoy already stumbled over the correct fix.

The size check is off by one. Patch to fix this should be pushed on Monday.

Regards,
Christian.

Am 19.03.21 um 22:23 schrieb Felix Kuehling:
> This is causing a deadlock in amdgpu_ttm_access_memory during the
> PtraceAccess test in kfdtest. Unfortunately it doesn't get flagged by
> LOCKDEP. See the kernel log snippet below. I don't have a good
> explanation what's going on other than maybe some data structure
> corruption.
>
> With this patch reverted the PtraceAccess test still fails, but it
> doesn't hang any more. If I also revert "use new cursor in
> amdgpu_ttm_io_mem_pfn" (which is used via amdgpu_find_mm_node in
> amdgpu_ttm_access_memory), Ptrace access starts working correctly.
> That tells me that there is some fundamental bug in the resource
> cursor implementation that's breaking several users.
>
> Regards,
>   Felix
>
>
> [  129.446085] watchdog: BUG: soft lockup - CPU#8 stuck for 22s!
> [kfdtest:3588]
> [  129.455379] Modules linked in: ip6table_filter ip6_tables
> iptable_filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_v2
> gpu_sched ip_tables x_tables [  129.455428] irq event stamp: 75294000
> [  129.455432] hardirqs last  enabled at (75293999):
> [<ffffffffa9dcfd7d>] _raw_spin_unlock_irqrestore+0x2d/0x40
> [  129.455447] hardirqs last disabled at (75294000):
> [<ffffffffa9dbef8a>] sysvec_apic_timer_interrupt+0xa/0xa0
> [  129.455457] softirqs last  enabled at (75184000):
> [<ffffffffaa000306>] __do_softirq+0x306/0x429 [  129.455467] softirqs
> last disabled at (75183995):
> [<ffffffffa9e00f8f>] asm_call_irq_on_stack+0xf/0x20 [  129.455477]
> CPU: 8 PID: 3588 Comm: kfdtest Not tainted 5.11.0-kfd-fkuehlin #194 [
> 129.455485] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS
> 3201 06/17/2016
> [  129.455490] RIP: 0010:_raw_spin_lock_irqsave+0xb/0x50
> [  129.455498] Code: d2 31 f6 e8 e7 e9 31 ff 48 89 df 58 5b e9 7d 32
> 32 ff 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89
> fd 53 9c <5b> fa f6 c7 02 74 05 e8 59 06 3e ff 65 ff 05 92 73 24 56 48
> 8d 7d [  129.455505] RSP: 0018:ffffa3eb407f3c58 EFLAGS: 00000246 [
> 129.455513] RAX: ffff96466e2010a0 RBX: ffff96466e200000 RCX:
> 0000000000000000
> [  129.455519] RDX: ffffa3eb407f3e70 RSI: 00000000019ffff0 RDI:
> ffff96466e2010a0
> [  129.455524] RBP: ffff96466e2010a0 R08: 0000000000000000 R09:
> 0000000000000001
> [  129.455528] R10: ffffa3eb407f3c60 R11: ffff96466e2010b8 R12:
> 00000000019ffff0
> [  129.455533] R13: 00000000019ffff0 R14: ffffa3eb407f3e70 R15:
> 00000000019ffff0
> [  129.455538] FS:  00007f5aad0f0740(0000) GS:ffff96467fc00000(0000)
> knlGS:0000000000000000
> [  129.455544] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [
> 129.455549] CR2: 0000563ea76ad0f0 CR3: 00000007c6e92005 CR4:
> 00000000001706e0
> [  129.455554] Call Trace:
> [  129.455563]  amdgpu_device_vram_access+0xc1/0x200 [amdgpu] [
> 129.455820]  ? _raw_spin_unlock_irqrestore+0x2d/0x40
> [  129.455834]  amdgpu_ttm_access_memory+0x29e/0x320 [amdgpu] [
> 129.456063]  ttm_bo_vm_access+0x1c8/0x3a0 [ttm] [  129.456089]
> __access_remote_vm+0x289/0x390 [  129.456112]
> ptrace_access_vm+0x98/0xc0 [  129.456127]
> generic_ptrace_peekdata+0x31/0x80 [  129.456138]
> ptrace_request+0x13b/0x5d0 [  129.456155]  arch_ptrace+0x24f/0x2f0 [
> 129.456165]  __x64_sys_ptrace+0xc9/0x140 [  129.456177]
> do_syscall_64+0x2d/0x40 [  129.456185]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  129.456194] RIP: 0033:0x7f5aab861a3f [  129.456199] Code: 48 89 44
> 24 18 48 8d 44 24 30 c7 44 24 10 18 00
> 00 00 8b 70 08 48 8b 50 10 48 89 44 24 20 4c 0f 43 50 18 b8 65 00 00
> 00 0f 05 <48> 3d 00 f0 ff ff 77 41 48 85 c0 78 06 41 83 f8 02 76 1e 48
> 8b 4c [  129.456205] RSP: 002b:00007ffd27b68750 EFLAGS: 00000293
> ORIG_RAX:
> 0000000000000065
> [  129.456214] RAX: ffffffffffffffda RBX: 0000000000000001 RCX:
> 00007f5aab861a3f
> [  129.456219] RDX: 00007f5aab3ffff0 RSI: 0000000000000dfa RDI:
> 0000000000000002
> [  129.456224] RBP: 00007ffd27b68870 R08: 0000000000000001 R09:
> 0000000000000000
> [  129.456228] R10: 00007ffd27b68758 R11: 0000000000000293 R12:
> 0000563ea764e2aa
> [  129.456233] R13: 0000000000000000 R14: 0000000000000021 R15:
> 0000000000000000
>
> On 2021-03-08 8:40 a.m., Christian König wrote:
>> Separate the drm_mm_node walking from the actual handling.
>>
>> Signed-off-by: Christian König <christian.koenig at amd.com>
>> Acked-by: Oak Zeng <Oak.Zeng at amd.com>
>> Tested-by: Nirmoy Das <nirmoy.das at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 67
>> +++++++------------------
>>   1 file changed, 18 insertions(+), 49 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index 517611b709fa..2cbe4ace591f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -178,26 +178,6 @@ static int amdgpu_verify_access(struct
>> ttm_buffer_object *bo, struct file *filp)
>>                         filp->private_data);
>>   }
>>   -/**
>> - * amdgpu_find_mm_node - Helper function finds the drm_mm_node
>> corresponding to
>> - * @offset. It also modifies the offset to be within the drm_mm_node
>> returned
>> - *
>> - * @mem: The region where the bo resides.
>> - * @offset: The offset that drm_mm_node is used for finding.
>> - *
>> - */
>> -static struct drm_mm_node *amdgpu_find_mm_node(struct ttm_resource
>> *mem,
>> -                           uint64_t *offset) -{
>> -    struct drm_mm_node *mm_node = mem->mm_node;
>> -
>> -    while (*offset >= (mm_node->size << PAGE_SHIFT)) {
>> -        *offset -= (mm_node->size << PAGE_SHIFT);
>> -        ++mm_node;
>> -    }
>> -    return mm_node;
>> -}
>> -
>>   /**
>>    * amdgpu_ttm_map_buffer - Map memory into the GART windows
>>    * @bo: buffer object to map
>> @@ -1478,41 +1458,36 @@ static bool
>> amdgpu_ttm_bo_eviction_valuable(struct ttm_buffer_object *bo,
>>    * access for debugging purposes.
>>    */
>>   static int amdgpu_ttm_access_memory(struct ttm_buffer_object *bo,
>> -                    unsigned long offset,
>> -                    void *buf, int len, int write)
>> +                    unsigned long offset, void *buf, int len,
>> +                    int write)
>>   {
>>       struct amdgpu_bo *abo = ttm_to_amdgpu_bo(bo);
>>       struct amdgpu_device *adev = amdgpu_ttm_adev(abo->tbo.bdev);
>> -    struct drm_mm_node *nodes;
>> +    struct amdgpu_res_cursor cursor;
>> +    unsigned long flags;
>>       uint32_t value = 0;
>>       int ret = 0;
>> -    uint64_t pos;
>> -    unsigned long flags;
>>         if (bo->mem.mem_type != TTM_PL_VRAM)
>>           return -EIO;
>>   -    pos = offset;
>> -    nodes = amdgpu_find_mm_node(&abo->tbo.mem, &pos);
>> -    pos += (nodes->start << PAGE_SHIFT);
>> -
>> -    while (len && pos < adev->gmc.mc_vram_size) {
>> -        uint64_t aligned_pos = pos & ~(uint64_t)3;
>> -        uint64_t bytes = 4 - (pos & 3);
>> -        uint32_t shift = (pos & 3) * 8;
>> +    amdgpu_res_first(&bo->mem, offset, len, &cursor);
>> +    while (cursor.remaining) {
>> +        uint64_t aligned_pos = cursor.start & ~(uint64_t)3;
>> +        uint64_t bytes = 4 - (cursor.start & 3);
>> +        uint32_t shift = (cursor.start & 3) * 8;
>>           uint32_t mask = 0xffffffff << shift;
>>   -        if (len < bytes) {
>> -            mask &= 0xffffffff >> (bytes - len) * 8;
>> -            bytes = len;
>> +        if (cursor.size < bytes) {
>> +            mask &= 0xffffffff >> (bytes - cursor.size) * 8;
>> +            bytes = cursor.size;
>>           }
>>             if (mask != 0xffffffff) {
>>               spin_lock_irqsave(&adev->mmio_idx_lock, flags);
>>               WREG32_NO_KIQ(mmMM_INDEX, ((uint32_t)aligned_pos) |
>> 0x80000000);
>>               WREG32_NO_KIQ(mmMM_INDEX_HI, aligned_pos >> 31);
>> -            if (!write || mask != 0xffffffff)
>> -                value = RREG32_NO_KIQ(mmMM_DATA);
>> +            value = RREG32_NO_KIQ(mmMM_DATA);
>>               if (write) {
>>                   value &= ~mask;
>>                   value |= (*(uint32_t *)buf << shift) & mask; @@
>> -1524,21 +1499,15 @@ static int amdgpu_ttm_access_memory(struct
>> ttm_buffer_object *bo,
>>                   memcpy(buf, &value, bytes);
>>               }
>>           } else {
>> -            bytes = (nodes->start + nodes->size) << PAGE_SHIFT;
>> -            bytes = min(bytes - pos, (uint64_t)len & ~0x3ull);
>> -
>> -            amdgpu_device_vram_access(adev, pos, (uint32_t *)buf,
>> -                          bytes, write);
>> +            bytes = cursor.size & 0x3ull;
>> +            amdgpu_device_vram_access(adev, cursor.start,
>> +                          (uint32_t *)buf, bytes,
>> +                          write);
>>           }
>>             ret += bytes;
>>           buf = (uint8_t *)buf + bytes;
>> -        pos += bytes;
>> -        len -= bytes;
>> -        if (pos >= (nodes->start + nodes->size) << PAGE_SHIFT) {
>> -            ++nodes;
>> -            pos = (nodes->start << PAGE_SHIFT);
>> -        }
>> +        amdgpu_res_next(&cursor, bytes);
>>       }
>>         return ret;

_______________________________________________
amd-gfx mailing list
amd-gfx at lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7Cxinhui.pan%40amd.com%7C9ee7a8bef8ea40e37dac08d8eb7faf73%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637518280926169326%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ncIBwfPtsvw379ihW47cbchHwBRArQV%2B5GB8wpY5ylc%3D&reserved=0


More information about the amd-gfx mailing list