[PATCH 2/4] drm/amdgpu: allow higher level PD invalidations

Mon Jan 27 19:18:54 UTC 2020

No problemo,  maybe we should split that BUG() statement to catch the 
two halves of the PDE address so next time it'll be more obvious.

For ref though your 4 patches have applied cleanly on top of drm-next 
every day and work fine on my navi10/raven1 system.  So it seems like 
just gmc8 is affected (I don't have any 6/7 hardware to test).

Tom

On 2020-01-27 2:14 p.m., Christian König wrote:
> OH, thanks for that hint! I was staring at the code for a day now 
> without having a clue what's going wrong.
>
> Haven't realized that something could have been sign extended!
>
> Probably going to figure it out by tomorrow now,
> Christian.
>
> Am 27.01.20 um 18:37 schrieb Tom St Denis:
>> The offending PDE address is: "3ffffeff600000"
>>
>> Which looks like it was sign extended into the "reserved" section 
>> between bits 40:58 (0x3fff) hence triggering the BUG() in the gmc_v8 
>> driver.
>>
>> Tom
>>
>> On 2020-01-27 9:57 a.m., Tom St Denis wrote:
>>> Reverting this patch avoids the gmc_v8 errors (I previously sent 
>>> kernel logs, here's one for convenience):
>>>
>>> [  358.554335] ------------[ cut here ]------------
>>> [  358.554338] kernel BUG at drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c:725!
>>> [  358.554351] invalid opcode: 0000 [#1] SMP NOPTI
>>> [  358.554354] CPU: 0 PID: 4551 Comm: Xwayland Not tainted 
>>> 5.4.0-rc7+ #14
>>> [  358.554355] Hardware name: System manufacturer System Product 
>>> Name/TUF B350M-PLUS GAMING, BIOS 5220 09/12/2019
>>> [  358.554452] RIP: 0010:gmc_v8_0_get_vm_pde+0x10/0x20 [amdgpu]
>>> [  358.554455] Code: 31 f6 48 89 df e8 30 e9 ff ff e9 28 ff ff ff e8 
>>> 16 d6 21 f9 66 0f 1f 44 00 00 48 b8 ff 0f 00 00 00 ff ff ff 48 85 02 
>>> 75 01 c3 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 55 48 89 
>>> fd e8 c7
>>> [  358.554456] RSP: 0018:ffffa28142287a00 EFLAGS: 00010206
>>> [  358.554458] RAX: ffffff0000000fff RBX: 0000000000000000 RCX: 
>>> ffffa28142287a78
>>> [  358.554459] RDX: ffffa28142287a50 RSI: 0000000000000002 RDI: 
>>> ffff8b9be15e0000
>>> [  358.554460] RBP: 0000000000000001 R08: 0000000000000000 R09: 
>>> 0000000000000000
>>> [  358.554461] R10: 000000000000000f R11: 0000000000000406 R12: 
>>> 0000000000002030
>>> [  358.554462] R13: 003ffffefea00000 R14: 0000000000101c00 R15: 
>>> ffffa28142287af0
>>> [  358.554464] FS:  00007f180a48ba80(0000) GS:ffff8b9be6c00000(0000) 
>>> knlGS:0000000000000000
>>> [  358.554465] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  358.554466] CR2: 00007f3de8f5dcc0 CR3: 00000002170c8000 CR4: 
>>> 00000000001406f0
>>> [  358.554467] Call Trace:
>>> [  358.554502]  amdgpu_vm_update_ptes+0x28a/0x7f0 [amdgpu]
>>> [  358.554534]  ? amdgpu_sync_resv+0x34/0x190 [amdgpu]
>>> [  358.554565]  amdgpu_vm_bo_update_mapping+0x12b/0x160 [amdgpu]
>>> [  358.554596]  amdgpu_vm_bo_update+0x333/0x6a0 [amdgpu]
>>> [  358.554626]  amdgpu_gem_va_ioctl+0x3c1/0x3e0 [amdgpu]
>>> [  358.554658]  ? amdgpu_gem_va_map_flags+0x60/0x60 [amdgpu]
>>> [  358.554663]  drm_ioctl_kernel+0xa5/0xf0
>>> [  358.554665]  drm_ioctl+0x1df/0x366
>>> [  358.554695]  ? amdgpu_gem_va_map_flags+0x60/0x60 [amdgpu]
>>> [  358.554698]  ? __switch_to_asm+0x34/0x70
>>> [  358.554699]  ? __switch_to_asm+0x40/0x70
>>> [  358.554701]  ? __switch_to_asm+0x34/0x70
>>> [  358.554702]  ? __switch_to_asm+0x40/0x70
>>> [  358.554703]  ? __switch_to_asm+0x34/0x70
>>> [  358.554704]  ? __switch_to_asm+0x40/0x70
>>> [  358.554731]  amdgpu_drm_ioctl+0x44/0x80 [amdgpu]
>>> [  358.554735]  do_vfs_ioctl+0x3f0/0x650
>>> [  358.554737]  ? __schedule+0x28c/0x5a0
>>> [  358.554738]  ksys_ioctl+0x59/0x90
>>> [  358.554740]  __x64_sys_ioctl+0x11/0x20
>>> [  358.554743]  do_syscall_64+0x43/0x110
>>> [  358.554745]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [  358.554747] RIP: 0033:0x7f1809e6638b
>>> [  358.554749] Code: 0f 1e fa 48 8b 05 fd 9a 0c 00 64 c7 00 26 00 00 
>>> 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 
>>> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d cd 9a 0c 00 f7 d8 64 
>>> 89 01 48
>>> [  358.554750] RSP: 002b:00007fffac20a638 EFLAGS: 00000246 ORIG_RAX: 
>>> 0000000000000010
>>> [  358.554751] RAX: ffffffffffffffda RBX: 00007fffac20a690 RCX: 
>>> 00007f1809e6638b
>>> [  358.554752] RDX: 00007fffac20a690 RSI: 00000000c0286448 RDI: 
>>> 000000000000000e
>>> [  358.554753] RBP: 00000000c0286448 R08: 0000000101600000 R09: 
>>> 000000000000000e
>>> [  358.554754] R10: 00000000000000e0 R11: 0000000000000246 R12: 
>>> 0000000000000000
>>> [  358.554754] R13: 000000000000000e R14: 0000000000000001 R15: 
>>> 0000563d48bfd1f0
>>> [  358.554756] Modules linked in: amdgpu gpu_sched ttm r8152 efivarfs
>>> [  358.554790] ---[ end trace e0d54f6c49902356 ]---
>>> [  358.554824] RIP: 0010:gmc_v8_0_get_vm_pde+0x10/0x20 [amdgpu]
>>>
>>> (the gmc_v8 bug triggers regardless of whether I'm running piglit on 
>>> my headless vega20 or directly on the carrizo).
>>>
>>> However, with patch 2 of 4 reverted I then get:
>>>
>>> [ 1471.338089] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>>> update BO_VA (-2)
>>> [ 1471.338647] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>>> update BO_VA (-2)
>>> [ 1471.339807] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>>> update BO_VA (-2)
>>> [ 1471.341699] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>>> update BO_VA (-2)
>>> [ 1471.342348] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>>> update BO_VA (-2)
>>> [ 1471.342474] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>>> update BO_VA (-2)
>>> [ 1471.342532] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>>> update BO_VA (-2)
>>> [ 1471.342583] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>>> update BO_VA (-2)
>>> [ 1471.342636] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>>> update BO_VA (-2)
>>> [ 1471.342694] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>>> update BO_VA (-2)
>>> [ 1471.342745] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>>> update BO_VA (-2)
>>> [ 1471.342796] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>>> update BO_VA (-2)
>>> [ 1471.343555] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>>> update BO_VA (-2)
>>> [ 1471.350270] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>>> update BO_VA (-2)
>>> [ 1471.350351] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>>> update BO_VA (-2)
>>> [ 1471.350395] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>>> update BO_VA (-2)
>>> [ 1471.351895] amdgpu 0000:00:01.0: failed to clear page tables on 
>>> GEM object close (-2)
>>> [ 1471.353995] amdgpu 0000:00:01.0: failed to clear page tables on 
>>> GEM object close (-2)
>>> [ 1471.354179] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>>> update BO_VA (-2)
>>> [ 1471.354190] amdgpu 0000:00:01.0: failed to clear page tables on 
>>> GEM object close (-2)
>>> [ 1471.354252] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>>> update BO_VA (-2)
>>> [ 1471.354259] amdgpu 0000:00:01.0: failed to clear page tables on 
>>> GEM object close (-2)
>>> [ 1471.354302] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>>> update BO_VA (-2)
>>> [ 1471.354308] amdgpu 0000:00:01.0: failed to clear page tables on 
>>> GEM object close (-2)
>>> [ 1471.354351] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't 
>>> update BO_VA (-2)
>>> [ 1471.354356] amdgpu 0000:00:01.0: failed to clear page tables on 
>>> GEM object close (-2)
>>>
>>> So clearly that's not the fix either :-/
>>>
>>> (all on top of the latest drm-next from this morning).
>>>
>>> Tom
>>>
>>>
>>> On 2020-01-23 9:21 a.m., Christian König wrote:
>>>> Allow partial invalidation on unallocated PDs. This is useful when we
>>>> need to silence faults to stop interrupt floods on Vega.
>>>>
>>>> Signed-off-by: Christian König <christian.koenig at amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 23 ++++++++++++++++++-----
>>>>   1 file changed, 18 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> index 8119f32ca94d..0f79c17118bf 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> @@ -1467,9 +1467,8 @@ static int amdgpu_vm_update_ptes(struct 
>>>> amdgpu_vm_update_params *params,
>>>>                * smaller than the address shift. Go to the next
>>>>                * child entry and try again.
>>>>                */
>>>> -            if (!amdgpu_vm_pt_descendant(adev, &cursor))
>>>> -                return -ENOENT;
>>>> -            continue;
>>>> +            if (amdgpu_vm_pt_descendant(adev, &cursor))
>>>> +                continue;
>>>>           } else if (frag >= parent_shift) {
>>>>               /* If the fragment size is even larger than the parent
>>>>                * shift we should go up one level and check it again.
>>>> @@ -1480,8 +1479,19 @@ static int amdgpu_vm_update_ptes(struct 
>>>> amdgpu_vm_update_params *params,
>>>>           }
>>>>             pt = cursor.entry->base.bo;
>>>> -        if (!pt)
>>>> -            return -ENOENT;
>>>> +        if (!pt) {
>>>> +            /* We need all PDs and PTs for mapping something, */
>>>> +            if (flags & AMDGPU_PTE_VALID)
>>>> +                return -ENOENT;
>>>> +
>>>> +            /* but unmapping something can happen at a higher
>>>> +             * level. */
>>>> +            if (!amdgpu_vm_pt_ancestor(&cursor))
>>>> +                return -EINVAL;
>>>> +
>>>> +            pt = cursor.entry->base.bo;
>>>> +            shift = parent_shift;
>>>> +        }
>>>>             /* Looks good so far, calculate parameters for the 
>>>> update */
>>>>           incr = (uint64_t)AMDGPU_GPU_PAGE_SIZE << shift;
>>>> @@ -1495,6 +1505,9 @@ static int amdgpu_vm_update_ptes(struct 
>>>> amdgpu_vm_update_params *params,
>>>>               uint64_t upd_end = min(entry_end, frag_end);
>>>>               unsigned nptes = (upd_end - frag_start) >> shift;
>>>>   +            /* This can happen when we set higher level PDs to
>>>> +             * silent to stop fault floods. */
>>>> +            nptes = max(nptes, 1u);
>>>>               amdgpu_vm_update_flags(params, pt, cursor.level,
>>>>                              pe_start, dst, nptes, incr,
>>>>                              flags | AMDGPU_PTE_FRAG(frag));
>