TTM allocation failure under memory pressure on suspend

Wed May 29 19:18:13 UTC 2019

Hi,

I have a RX 570 which fails to suspend properly under memory pressure and stays black after waking up.
It looks like an allocation failure in the TTM VRAM eviction is to blame:

[635471.240411] kworker/u24:26: page allocation failure: order:0, mode:0x620402(GFP_NOIO|__GFP_HIGHMEM|__GFP_RETRY_MAYFAIL|__GFP_HARDWALL), nodemask=(null),cpuset=/,mems_allowed=0
[635471.240416] CPU: 9 PID: 20884 Comm: kworker/u24:26 Tainted: P           OE     5.0.0-13-generic #14-Ubuntu
[635471.240417] Hardware name: MSI MS-7885/X99A SLI PLUS(MS-7885), BIOS 1.80 03/20/2015
[635471.240421] Workqueue: events_unbound async_run_entry_fn
[635471.240421] Call Trace:
[635471.240426]  dump_stack+0x63/0x8a
[635471.240428]  warn_alloc.cold.119+0x7b/0xfb
[635471.240429]  __alloc_pages_slowpath+0xe63/0xea0
[635471.240432]  ? flush_tlb_all+0x1c/0x20
[635471.240433]  ? change_page_attr_set_clr+0x164/0x1f0
[635471.240434]  __alloc_pages_nodemask+0x2c4/0x2e0
[635471.240437]  alloc_pages_current+0x81/0xe0
[635471.240442]  ttm_alloc_new_pages.isra.16+0x95/0x1e0 [ttm]
[635471.240444]  ttm_page_pool_get_pages+0x16b/0x380 [ttm]
[635471.240446]  ttm_pool_populate+0x1a3/0x4a0 [ttm]
[635471.240448]  ttm_populate_and_map_pages+0x28/0x250 [ttm]
[635471.240450]  ? ttm_dma_tt_alloc_page_directory+0x2d/0x60 [ttm]
[635471.240490]  amdgpu_ttm_tt_populate+0x56/0xe0 [amdgpu]
[635471.240493]  ttm_tt_populate.part.9+0x22/0x60 [ttm]
[635471.240495]  ttm_tt_bind+0x4f/0x60 [ttm]
[635471.240497]  ttm_bo_handle_move_mem+0x26c/0x500 [ttm]
[635471.240499]  ttm_bo_evict+0x142/0x1c0 [ttm]
[635471.240501]  ttm_mem_evict_first+0x19a/0x220 [ttm]
[635471.240504]  ttm_bo_force_list_clean+0xa1/0x170 [ttm]
[635471.240506]  ttm_bo_evict_mm+0x2e/0x30 [ttm]
[635471.240531]  amdgpu_bo_evict_vram+0x1a/0x20 [amdgpu]
[635471.240554]  amdgpu_device_suspend+0x1dd/0x3d0 [amdgpu]
[635471.240578]  amdgpu_pmops_suspend+0x1f/0x30 [amdgpu]
[635471.240579]  pci_pm_suspend+0x76/0x130
[635471.240580]  ? pci_pm_freeze+0xf0/0xf0
[635471.240582]  dpm_run_callback+0x66/0x150
[635471.240582]  __device_suspend+0x110/0x490
[635471.240583]  async_suspend+0x1f/0x90
[635471.240584]  async_run_entry_fn+0x3c/0x150
[635471.240586]  process_one_work+0x20f/0x410
[635471.240587]  worker_thread+0x34/0x400
[635471.240589]  kthread+0x120/0x140
[635471.240589]  ? process_one_work+0x410/0x410
[635471.240591]  ? __kthread_parkme+0x70/0x70
[635471.240592]  ret_from_fork+0x35/0x40
…
[635471.241994] [TTM] Buffer eviction failed
[635471.627554] [TTM] Buffer eviction failed

Subsequently it fails to wake up (all 3 screens black) because of an initialization failure:

[635472.216323] amdgpu 0000:04:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
[635472.216354] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v8_0> failed -110
[635472.216384] [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-110).
[635472.216387] dpm_run_callback(): pci_pm_resume+0x0/0xb0 returns -110
[635472.216390] PM: Device 0000:04:00.0 failed to resume async: error -110

I’m pretty sure the problem is setting GFP_NOIO which makes it impossible for the kernel to swap anything out and it subsequently gives up trying to satisfy the allocation. I usually run under quite some memory pressure with a lot of swap (32GiB RAM + 48GiB Swap, >48GiB memory usage is regular). I have looked at the code in question but I’m not sure where this is coming from, it seems like neither ttm nor amdgpu set GFP_NOIO. TTM seems to have per-pool allocation flags and somehow GFP_NOIO is getting enabled there for the amdgpu pool.

Thanks,
Lorenz