[PATCH 2/4] drm/amdgpu: allow higher level PD invalidations

Mon Jan 27 19:14:53 UTC 2020

OH, thanks for that hint! I was staring at the code for a day now 
without having a clue what's going wrong.

Haven't realized that something could have been sign extended!

Probably going to figure it out by tomorrow now,
Christian.

Am 27.01.20 um 18:37 schrieb Tom St Denis:
> The offending PDE address is: "3ffffeff600000"
>
> Which looks like it was sign extended into the "reserved" section 
> between bits 40:58 (0x3fff) hence triggering the BUG() in the gmc_v8 
> driver.
>
> Tom
>
> On 2020-01-27 9:57 a.m., Tom St Denis wrote:
>> Reverting this patch avoids the gmc_v8 errors (I previously sent 
>> kernel logs, here's one for convenience):
>>
>> [  358.554335] ------------[ cut here ]------------
>> [  358.554338] kernel BUG at drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c:725!
>> [  358.554351] invalid opcode: 0000 [#1] SMP NOPTI
>> [  358.554354] CPU: 0 PID: 4551 Comm: Xwayland Not tainted 5.4.0-rc7+ 
>> #14
>> [  358.554355] Hardware name: System manufacturer System Product 
>> Name/TUF B350M-PLUS GAMING, BIOS 5220 09/12/2019
>> [  358.554452] RIP: 0010:gmc_v8_0_get_vm_pde+0x10/0x20 [amdgpu]
>> [  358.554455] Code: 31 f6 48 89 df e8 30 e9 ff ff e9 28 ff ff ff e8 
>> 16 d6 21 f9 66 0f 1f 44 00 00 48 b8 ff 0f 00 00 00 ff ff ff 48 85 02 
>> 75 01 c3 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 55 48 89 
>> fd e8 c7
>> [  358.554456] RSP: 0018:ffffa28142287a00 EFLAGS: 00010206
>> [  358.554458] RAX: ffffff0000000fff RBX: 0000000000000000 RCX: 
>> ffffa28142287a78
>> [  358.554459] RDX: ffffa28142287a50 RSI: 0000000000000002 RDI: 
>> ffff8b9be15e0000
>> [  358.554460] RBP: 0000000000000001 R08: 0000000000000000 R09: 
>> 0000000000000000
>> [  358.554461] R10: 000000000000000f R11: 0000000000000406 R12: 
>> 0000000000002030
>> [  358.554462] R13: 003ffffefea00000 R14: 0000000000101c00 R15: 
>> ffffa28142287af0
>> [  358.554464] FS:  00007f180a48ba80(0000) GS:ffff8b9be6c00000(0000) 
>> knlGS:0000000000000000
>> [  358.554465] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  358.554466] CR2: 00007f3de8f5dcc0 CR3: 00000002170c8000 CR4: 
>> 00000000001406f0
>> [  358.554467] Call Trace:
>> [  358.554502]  amdgpu_vm_update_ptes+0x28a/0x7f0 [amdgpu]
>> [  358.554534]  ? amdgpu_sync_resv+0x34/0x190 [amdgpu]
>> [  358.554565]  amdgpu_vm_bo_update_mapping+0x12b/0x160 [amdgpu]
>> [  358.554596]  amdgpu_vm_bo_update+0x333/0x6a0 [amdgpu]
>> [  358.554626]  amdgpu_gem_va_ioctl+0x3c1/0x3e0 [amdgpu]
>> [  358.554658]  ? amdgpu_gem_va_map_flags+0x60/0x60 [amdgpu]
>> [  358.554663]  drm_ioctl_kernel+0xa5/0xf0
>> [  358.554665]  drm_ioctl+0x1df/0x366
>> [  358.554695]  ? amdgpu_gem_va_map_flags+0x60/0x60 [amdgpu]
>> [  358.554698]  ? __switch_to_asm+0x34/0x70
>> [  358.554699]  ? __switch_to_asm+0x40/0x70
>> [  358.554701]  ? __switch_to_asm+0x34/0x70
>> [  358.554702]  ? __switch_to_asm+0x40/0x70
>> [  358.554703]  ? __switch_to_asm+0x34/0x70
>> [  358.554704]  ? __switch_to_asm+0x40/0x70
>> [  358.554731]  amdgpu_drm_ioctl+0x44/0x80 [amdgpu]
>> [  358.554735]  do_vfs_ioctl+0x3f0/0x650
>> [  358.554737]  ? __schedule+0x28c/0x5a0
>> [  358.554738]  ksys_ioctl+0x59/0x90
>> [  358.554740]  __x64_sys_ioctl+0x11/0x20
>> [  358.554743]  do_syscall_64+0x43/0x110
>> [  358.554745]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [  358.554747] RIP: 0033:0x7f1809e6638b
>> [  358.554749] Code: 0f 1e fa 48 8b 05 fd 9a 0c 00 64 c7 00 26 00 00 
>> 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 
>> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d cd 9a 0c 00 f7 d8 64 
>> 89 01 48
>> [  358.554750] RSP: 002b:00007fffac20a638 EFLAGS: 00000246 ORIG_RAX: 
>> 0000000000000010
>> [  358.554751] RAX: ffffffffffffffda RBX: 00007fffac20a690 RCX: 
>> 00007f1809e6638b
>> [  358.554752] RDX: 00007fffac20a690 RSI: 00000000c0286448 RDI: 
>> 000000000000000e
>> [  358.554753] RBP: 00000000c0286448 R08: 0000000101600000 R09: 
>> 000000000000000e
>> [  358.554754] R10: 00000000000000e0 R11: 0000000000000246 R12: 
>> 0000000000000000
>> [  358.554754] R13: 000000000000000e R14: 0000000000000001 R15: 
>> 0000563d48bfd1f0
>> [  358.554756] Modules linked in: amdgpu gpu_sched ttm r8152 efivarfs
>> [  358.554790] ---[ end trace e0d54f6c49902356 ]---
>> [  358.554824] RIP: 0010:gmc_v8_0_get_vm_pde+0x10/0x20 [amdgpu]
>>
>> (the gmc_v8 bug triggers regardless of whether I'm running piglit on 
>> my headless vega20 or directly on the carrizo).
>>
>> However, with patch 2 of 4 reverted I then get:
>>
>> [ 1471.338089] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>> update BO_VA (-2)
>> [ 1471.338647] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>> update BO_VA (-2)
>> [ 1471.339807] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>> update BO_VA (-2)
>> [ 1471.341699] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>> update BO_VA (-2)
>> [ 1471.342348] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>> update BO_VA (-2)
>> [ 1471.342474] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>> update BO_VA (-2)
>> [ 1471.342532] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>> update BO_VA (-2)
>> [ 1471.342583] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>> update BO_VA (-2)
>> [ 1471.342636] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>> update BO_VA (-2)
>> [ 1471.342694] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>> update BO_VA (-2)
>> [ 1471.342745] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>> update BO_VA (-2)
>> [ 1471.342796] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>> update BO_VA (-2)
>> [ 1471.343555] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>> update BO_VA (-2)
>> [ 1471.350270] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>> update BO_VA (-2)
>> [ 1471.350351] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>> update BO_VA (-2)
>> [ 1471.350395] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>> update BO_VA (-2)
>> [ 1471.351895] amdgpu 0000:00:01.0: failed to clear page tables on 
>> GEM object close (-2)
>> [ 1471.353995] amdgpu 0000:00:01.0: failed to clear page tables on 
>> GEM object close (-2)
>> [ 1471.354179] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>> update BO_VA (-2)
>> [ 1471.354190] amdgpu 0000:00:01.0: failed to clear page tables on 
>> GEM object close (-2)
>> [ 1471.354252] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>> update BO_VA (-2)
>> [ 1471.354259] amdgpu 0000:00:01.0: failed to clear page tables on 
>> GEM object close (-2)
>> [ 1471.354302] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>> update BO_VA (-2)
>> [ 1471.354308] amdgpu 0000:00:01.0: failed to clear page tables on 
>> GEM object close (-2)
>> [ 1471.354351] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>> update BO_VA (-2)
>> [ 1471.354356] amdgpu 0000:00:01.0: failed to clear page tables on 
>> GEM object close (-2)
>>
>> So clearly that's not the fix either :-/
>>
>> (all on top of the latest drm-next from this morning).
>>
>> Tom
>>
>>
>> On 2020-01-23 9:21 a.m., Christian König wrote:
>>> Allow partial invalidation on unallocated PDs. This is useful when we
>>> need to silence faults to stop interrupt floods on Vega.
>>>
>>> Signed-off-by: Christian König <christian.koenig at amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 23 ++++++++++++++++++-----
>>>   1 file changed, 18 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> index 8119f32ca94d..0f79c17118bf 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> @@ -1467,9 +1467,8 @@ static int amdgpu_vm_update_ptes(struct 
>>> amdgpu_vm_update_params *params,
>>>                * smaller than the address shift. Go to the next
>>>                * child entry and try again.
>>>                */
>>> -            if (!amdgpu_vm_pt_descendant(adev, &cursor))
>>> -                return -ENOENT;
>>> -            continue;
>>> +            if (amdgpu_vm_pt_descendant(adev, &cursor))
>>> +                continue;
>>>           } else if (frag >= parent_shift) {
>>>               /* If the fragment size is even larger than the parent
>>>                * shift we should go up one level and check it again.
>>> @@ -1480,8 +1479,19 @@ static int amdgpu_vm_update_ptes(struct 
>>> amdgpu_vm_update_params *params,
>>>           }
>>>             pt = cursor.entry->base.bo;
>>> -        if (!pt)
>>> -            return -ENOENT;
>>> +        if (!pt) {
>>> +            /* We need all PDs and PTs for mapping something, */
>>> +            if (flags & AMDGPU_PTE_VALID)
>>> +                return -ENOENT;
>>> +
>>> +            /* but unmapping something can happen at a higher
>>> +             * level. */
>>> +            if (!amdgpu_vm_pt_ancestor(&cursor))
>>> +                return -EINVAL;
>>> +
>>> +            pt = cursor.entry->base.bo;
>>> +            shift = parent_shift;
>>> +        }
>>>             /* Looks good so far, calculate parameters for the 
>>> update */
>>>           incr = (uint64_t)AMDGPU_GPU_PAGE_SIZE << shift;
>>> @@ -1495,6 +1505,9 @@ static int amdgpu_vm_update_ptes(struct 
>>> amdgpu_vm_update_params *params,
>>>               uint64_t upd_end = min(entry_end, frag_end);
>>>               unsigned nptes = (upd_end - frag_start) >> shift;
>>>   +            /* This can happen when we set higher level PDs to
>>> +             * silent to stop fault floods. */
>>> +            nptes = max(nptes, 1u);
>>>               amdgpu_vm_update_flags(params, pt, cursor.level,
>>>                              pe_start, dst, nptes, incr,
>>>                              flags | AMDGPU_PTE_FRAG(frag));