TTM allocation failure under memory pressure on suspend
Lorenz Brun
lorenz at brun.one
Wed May 29 19:18:13 UTC 2019
Hi,
I have a RX 570 which fails to suspend properly under memory pressure and stays black after waking up.
It looks like an allocation failure in the TTM VRAM eviction is to blame:
[635471.240411] kworker/u24:26: page allocation failure: order:0, mode:0x620402(GFP_NOIO|__GFP_HIGHMEM|__GFP_RETRY_MAYFAIL|__GFP_HARDWALL), nodemask=(null),cpuset=/,mems_allowed=0
[635471.240416] CPU: 9 PID: 20884 Comm: kworker/u24:26 Tainted: P OE 5.0.0-13-generic #14-Ubuntu
[635471.240417] Hardware name: MSI MS-7885/X99A SLI PLUS(MS-7885), BIOS 1.80 03/20/2015
[635471.240421] Workqueue: events_unbound async_run_entry_fn
[635471.240421] Call Trace:
[635471.240426] dump_stack+0x63/0x8a
[635471.240428] warn_alloc.cold.119+0x7b/0xfb
[635471.240429] __alloc_pages_slowpath+0xe63/0xea0
[635471.240432] ? flush_tlb_all+0x1c/0x20
[635471.240433] ? change_page_attr_set_clr+0x164/0x1f0
[635471.240434] __alloc_pages_nodemask+0x2c4/0x2e0
[635471.240437] alloc_pages_current+0x81/0xe0
[635471.240442] ttm_alloc_new_pages.isra.16+0x95/0x1e0 [ttm]
[635471.240444] ttm_page_pool_get_pages+0x16b/0x380 [ttm]
[635471.240446] ttm_pool_populate+0x1a3/0x4a0 [ttm]
[635471.240448] ttm_populate_and_map_pages+0x28/0x250 [ttm]
[635471.240450] ? ttm_dma_tt_alloc_page_directory+0x2d/0x60 [ttm]
[635471.240490] amdgpu_ttm_tt_populate+0x56/0xe0 [amdgpu]
[635471.240493] ttm_tt_populate.part.9+0x22/0x60 [ttm]
[635471.240495] ttm_tt_bind+0x4f/0x60 [ttm]
[635471.240497] ttm_bo_handle_move_mem+0x26c/0x500 [ttm]
[635471.240499] ttm_bo_evict+0x142/0x1c0 [ttm]
[635471.240501] ttm_mem_evict_first+0x19a/0x220 [ttm]
[635471.240504] ttm_bo_force_list_clean+0xa1/0x170 [ttm]
[635471.240506] ttm_bo_evict_mm+0x2e/0x30 [ttm]
[635471.240531] amdgpu_bo_evict_vram+0x1a/0x20 [amdgpu]
[635471.240554] amdgpu_device_suspend+0x1dd/0x3d0 [amdgpu]
[635471.240578] amdgpu_pmops_suspend+0x1f/0x30 [amdgpu]
[635471.240579] pci_pm_suspend+0x76/0x130
[635471.240580] ? pci_pm_freeze+0xf0/0xf0
[635471.240582] dpm_run_callback+0x66/0x150
[635471.240582] __device_suspend+0x110/0x490
[635471.240583] async_suspend+0x1f/0x90
[635471.240584] async_run_entry_fn+0x3c/0x150
[635471.240586] process_one_work+0x20f/0x410
[635471.240587] worker_thread+0x34/0x400
[635471.240589] kthread+0x120/0x140
[635471.240589] ? process_one_work+0x410/0x410
[635471.240591] ? __kthread_parkme+0x70/0x70
[635471.240592] ret_from_fork+0x35/0x40
…
[635471.241994] [TTM] Buffer eviction failed
[635471.627554] [TTM] Buffer eviction failed
Subsequently it fails to wake up (all 3 screens black) because of an initialization failure:
[635472.216323] amdgpu 0000:04:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
[635472.216354] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -110
[635472.216384] [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-110).
[635472.216387] dpm_run_callback(): pci_pm_resume+0x0/0xb0 returns -110
[635472.216390] PM: Device 0000:04:00.0 failed to resume async: error -110
I’m pretty sure the problem is setting GFP_NOIO which makes it impossible for the kernel to swap anything out and it subsequently gives up trying to satisfy the allocation. I usually run under quite some memory pressure with a lot of swap (32GiB RAM + 48GiB Swap, >48GiB memory usage is regular). I have looked at the code in question but I’m not sure where this is coming from, it seems like neither ttm nor amdgpu set GFP_NOIO. TTM seems to have per-pool allocation flags and somehow GFP_NOIO is getting enabled there for the amdgpu pool.
Thanks,
Lorenz
More information about the amd-gfx
mailing list