[Bug 107065] "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Mon Jul 2 22:55:24 UTC 2018


https://bugs.freedesktop.org/show_bug.cgi?id=107065

--- Comment #13 from Andrey Grodzovsky <andrey.grodzovsky at amd.com> ---
(In reply to dwagner from comment #12)
> (In reply to Andrey Grodzovsky from comment #10)
> > Created attachment 140418 [details] [review] [review]
> > drm/amdgpu: Verify root PD is mapped into kernel address space.
> > 
> > dwagner, please try this patch. Fixes the issue for me and I observed no
> > suspend/resume issues.
> 
> While I can start X11 with this patch applied to current
> amd-staging-drm-next, attempts to resume from S3 fail consistently.
> 
> The following related output is emitted right before the suspend:
> 
> Jul 02 21:31:32 ryzen kernel: Freezing remaining freezable tasks ...
> (elapsed 0.000 seconds) done.
> Jul 02 21:31:32 ryzen kernel: Suspending console(s) (use no_console_suspend
> to debug)
> Jul 02 21:31:32 ryzen kernel: sd 9:0:0:0: [sda] Synchronizing SCSI cache
> Jul 02 21:31:32 ryzen kernel: [TTM] Buffer eviction failed
> Jul 02 21:31:32 ryzen kernel: ACPI: Preparing to enter system sleep state S3
> Jul 02 21:31:32 ryzen kernel: PM: Saving platform NVS memory
> Jul 02 21:31:32 ryzen kernel: Disabling non-boot CPUs ...
> 
> (I wonder if that "[TTM] Buffer eviction failed" is a bad sign - as I have
> seen it some other times in conjunction with heavy uses of the amdgpu
> driver.)
> 
> 
> Then, upon resume, the following messages are emitted:
> 
> Jul 02 21:31:33 ryzen kernel: ACPI: Low-level resume complete
> Jul 02 21:31:33 ryzen kernel: [drm] PCIE GART of 256M enabled (table at
> 0x000000F400300000).
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 146 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 148 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 145 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 146 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 189 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 306 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 5e ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 18a ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 145 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 146 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 148 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 145 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>                                failed to send message 146 ret is 0 
> Jul 02 21:31:33 ryzen kernel: [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR*
> amdgpu: ring 0 test failed (scratch(0xC040)=0xC>
> Jul 02 21:31:33 ryzen kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]]
> *ERROR* resume of IP block <gfx_v8_0> failed -22
> Jul 02 21:31:33 ryzen kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR*
> amdgpu_device_ip_resume failed (-22).
> Jul 02 21:31:33 ryzen kernel: dpm_run_callback(): pci_pm_resume+0x0/0xa0
> returns -22
> Jul 02 21:31:33 ryzen kernel: PM: Device 0000:0a:00.0 failed to resume
> async: error -22
> Jul 02 21:31:33 ryzen kernel: OOM killer enabled.
> Jul 02 21:31:33 ryzen kernel: Restarting tasks ... done.
> Jul 02 21:31:33 ryzen kernel: PM: suspend exit
> Jul 02 21:31:33 ryzen kernel: BUG: unable to handle kernel paging request at
> 0000000000001000
> Jul 02 21:31:33 ryzen kernel: PGD 0 P4D 0 
> Jul 02 21:31:33 ryzen kernel: Oops: 0002 [#1] SMP
> Jul 02 21:31:33 ryzen kernel: CPU: 14 PID: 791 Comm: amdgpu_cs:0 Tainted: G 
> W  O      4.18.0-rc1-amd+ #45
> Jul 02 21:31:33 ryzen kernel: Hardware name: System manufacturer System
> Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018
> Jul 02 21:31:33 ryzen kernel: RIP: 0010:gmc_v8_0_set_pte_pde+0x1b/0x30
> [amdgpu]
> Jul 02 21:31:33 ryzen kernel: Code: 80 d8 00 00 00 e9 25 78 60 e1 0f 1f 44
> 00 00 0f 1f 44 00 00 48 b8 00 f0 ff ff ff 00 00 0>
> Jul 02 21:31:33 ryzen kernel: RSP: 0018:ffffc90003e73898 EFLAGS: 00010202
> Jul 02 21:31:33 ryzen kernel: RAX: 000000fffffff000 RBX: 0000000000000001
> RCX: 000000000fe004f1
> Jul 02 21:31:33 ryzen kernel: RDX: 0000000000001000 RSI: 0000000000001000
> RDI: ffff8807e2f70000
> Jul 02 21:31:33 ryzen kernel: RBP: 0000000000001000 R08: 00000000000004f1
> R09: 0000000000001000
> Jul 02 21:31:33 ryzen kernel: R10: ffffffffa03ac7e0 R11: ffff8807daf78000
> R12: 0000000000001000
> Jul 02 21:31:33 ryzen kernel: R13: 0000000000000200 R14: ffffc90003e73a18
> R15: 000000000fe01000
> Jul 02 21:31:33 ryzen kernel: FS:  00007f8b57266700(0000)
> GS:ffff88081ef80000(0000) knlGS:0000000000000000
> Jul 02 21:31:33 ryzen kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033
> Jul 02 21:31:33 ryzen kernel: CR2: 0000000000001000 CR3: 00000007dbbda000
> CR4: 00000000003406e0
> Jul 02 21:31:33 ryzen kernel: Call Trace:
> Jul 02 21:31:33 ryzen kernel:  amdgpu_vm_cpu_set_ptes+0x76/0xe0 [amdgpu]
> Jul 02 21:31:33 ryzen kernel:  amdgpu_vm_update_ptes+0x1d3/0x2e0 [amdgpu]
> Jul 02 21:31:33 ryzen kernel:  amdgpu_vm_frag_ptes+0xae/0x130 [amdgpu]
> Jul 02 21:31:33 ryzen kernel:  amdgpu_vm_bo_update_mapping+0xed/0x410
> [amdgpu]
> Jul 02 21:31:33 ryzen kernel:  ? amdgpu_vm_do_copy_ptes+0xa0/0xa0 [amdgpu]
> Jul 02 21:31:33 ryzen kernel:  amdgpu_vm_bo_update+0x310/0x680 [amdgpu]
> Jul 02 21:31:33 ryzen kernel:  amdgpu_cs_ioctl+0x1092/0x1a50 [amdgpu]
> Jul 02 21:31:33 ryzen kernel:  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
> Jul 02 21:31:33 ryzen kernel:  drm_ioctl_kernel+0xa7/0xf0 [drm]
> Jul 02 21:31:33 ryzen kernel:  drm_ioctl+0x2f1/0x3c0 [drm]
> Jul 02 21:31:33 ryzen kernel:  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
> Jul 02 21:31:33 ryzen kernel:  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
> Jul 02 21:31:33 ryzen kernel:  do_vfs_ioctl+0xa4/0x620
> Jul 02 21:31:33 ryzen kernel:  ? __se_sys_futex+0x138/0x180
> Jul 02 21:31:33 ryzen kernel:  ksys_ioctl+0x60/0x90
> Jul 02 21:31:33 ryzen kernel:  __x64_sys_ioctl+0x16/0x20
> Jul 02 21:31:33 ryzen kernel:  do_syscall_64+0x48/0xf0
> Jul 02 21:31:33 ryzen kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Jul 02 21:31:33 ryzen kernel: RIP: 0033:0x7f8b66c92667
> Jul 02 21:31:33 ryzen kernel: Code: 00 00 90 48 8b 05 e9 67 2c 00 64 c7 00
> 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 8>
> Jul 02 21:31:33 ryzen kernel: RSP: 002b:00007f8b57265a98 EFLAGS: 00000246
> ORIG_RAX: 0000000000000010
> Jul 02 21:31:33 ryzen kernel: RAX: ffffffffffffffda RBX: 00007f8b57265b88
> RCX: 00007f8b66c92667
> Jul 02 21:31:33 ryzen kernel: RDX: 00007f8b57265b00 RSI: 00000000c0186444
> RDI: 000000000000000b
> Jul 02 21:31:33 ryzen kernel: RBP: 00007f8b57265b00 R08: 00007f8b57265bb0
> R09: 0000000000000010
> Jul 02 21:31:33 ryzen kernel: R10: 00007f8b57265bb0 R11: 0000000000000246
> R12: 00000000c0186444
> Jul 02 21:31:33 ryzen kernel: R13: 000000000000000b R14: 0000000000000002
> R15: 0000000000000000
> Jul 02 21:31:33 ryzen kernel: Modules linked in: it87(O) joydev mousedev
> hid_generic hidp hid ipt_REJECT nf_reject_ipv4 nf_l>
> Jul 02 21:31:33 ryzen kernel:  serio_raw crc32_pclmul atkbd
> ghash_clmulni_intel libps2 pcbc ahci libahci xhci_pci libata aes>
> Jul 02 21:31:33 ryzen kernel: CR2: 0000000000001000
> Jul 02 21:31:33 ryzen kernel: ---[ end trace 517a8a72887251f0 ]---
> Jul 02 21:31:33 ryzen kernel: RIP: 0010:gmc_v8_0_set_pte_pde+0x1b/0x30
> [amdgpu]
> Jul 02 21:31:33 ryzen kernel: Code: 80 d8 00 00 00 e9 25 78 60 e1 0f 1f 44
> 00 00 0f 1f 44 00 00 48 b8 00 f0 ff ff ff 00 00 0>
> Jul 02 21:31:33 ryzen kernel: RSP: 0018:ffffc90003e73898 EFLAGS: 00010202
> Jul 02 21:31:33 ryzen kernel: RAX: 000000fffffff000 RBX: 0000000000000001
> RCX: 000000000fe004f1
> Jul 02 21:31:33 ryzen kernel: RDX: 0000000000001000 RSI: 0000000000001000
> RDI: ffff8807e2f70000
> Jul 02 21:31:33 ryzen kernel: RBP: 0000000000001000 R08: 00000000000004f1
> R09: 0000000000001000
> Jul 02 21:31:33 ryzen kernel: R10: ffffffffa03ac7e0 R11: ffff8807daf78000
> R12: 0000000000001000
> Jul 02 21:31:33 ryzen kernel: R13: 0000000000000200 R14: ffffc90003e73a18
> R15: 000000000fe01000
> Jul 02 21:31:33 ryzen kernel: FS:  00007f8b57266700(0000)
> GS:ffff88081ef80000(0000) knlGS:0000000000000000
> Jul 02 21:31:33 ryzen kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033
> Jul 02 21:31:33 ryzen kernel: CR2: 0000000000001000 CR3: 00000007dbbda000
> CR4: 00000000003406e0
> 
> (At this point, the machine is just dead, and reacts upon nothing.)
> 
> So something is still wrong at amdgpu_vm_cpu_set_ptes+0x76


My guess is that on resume from S3 root PD needs to be again mapped to CPU
address space. Maybe changing the patch according  to Christian's advise will
be enough. I will take a look tomorrow. Or it has to do with the resume failure
you are experiencing. What ASIC are you using ? I also tested with gfx8 ASIC
and haven't observed any issues with resume. Did you update the firmware for
this ASIC to latest #

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20180702/9c274f40/attachment-0001.html>


More information about the dri-devel mailing list