printk: Should console related code avoid __GFP_DIRECT_RECLAIM memory allocations?
Tetsuo Handa
penguin-kernel at I-love.SAKURA.ne.jp
Thu Jul 6 10:28:17 UTC 2017
Hello.
I noticed that printing to consoles can stop forever because console drivers
can wait for memory allocation when memory allocation cannot make progress.
I was testing almost OOM situation and reproduced a situation where allocation
requests fall into infinite too_many_isolated() trap in shrink_inactive_list().
Since nothing will be printed to consoles when we hit that situation, I tried to
confirm that I actually reproduced that situation using SysRq-t. However, SysRq-t
printed nothing for some reason. Therefore, I used SysRq-c at uptime = 910.
Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170706.txt.xz :
------------------------------------------------------------
[ 779.494100] Mem-Info:
[ 779.495493] active_anon:846948 inactive_anon:5416 isolated_anon:0
[ 779.495493] active_file:33 inactive_file:13 isolated_file:47
[ 779.495493] unevictable:0 dirty:11 writeback:0 unstable:0
[ 779.495493] slab_reclaimable:0 slab_unreclaimable:11
[ 779.495493] mapped:3580 shmem:5581 pagetables:7491 bounce:0
[ 779.495493] free:21354 free_pcp:87 free_cma:0
[ 779.510664] Node 0 active_anon:3387792kB inactive_anon:21664kB active_file:132kB inactive_file:52kB unevictable:0kB isolated(anon):0kB isolated(file):188kB mapped:14320kB dirty:44kB writeback:0kB shmem:22324kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 3067904kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[ 779.522778] Node 0 DMA free:14836kB min:284kB low:352kB high:420kB active_anon:1024kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:8kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 779.533972] lowmem_reserve[]: 0 2700 3642 3642
[ 779.536321] Node 0 DMA32 free:53656kB min:49888kB low:62360kB high:74832kB active_anon:2710360kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2765492kB mlocked:0kB kernel_stack:0kB pagetables:12kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 779.548127] lowmem_reserve[]: 0 0 942 942
[ 779.550326] Node 0 Normal free:16924kB min:17404kB low:21752kB high:26100kB active_anon:676260kB inactive_anon:21664kB active_file:340kB inactive_file:244kB unevictable:0kB writepending:44kB present:1048576kB managed:964816kB mlocked:0kB kernel_stack:13600kB pagetables:29944kB bounce:0kB free_pcp:348kB local_pcp:0kB free_cma:0kB
[ 779.563405] lowmem_reserve[]: 0 0 0 0
[ 779.565537] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 2*1024kB (UM) 0*2048kB 3*4096kB (M) = 14836kB
[ 779.571291] Node 0 DMA32: 6*4kB (UM) 8*8kB (UM) 8*16kB (UM) 20*32kB (U) 29*64kB (UM) 32*128kB (UM) 13*256kB (UM) 9*512kB (UM) 10*1024kB (UM) 0*2048kB 7*4096kB (UME) = 53656kB
[ 779.578672] Node 0 Normal: 734*4kB (UE) 478*8kB (UME) 266*16kB (UE) 163*32kB (UE) 7*64kB (U) 3*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17064kB
[ 779.585618] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 779.589721] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 779.593765] 5660 total pagecache pages
[ 779.596059] 0 pages in swap cache
[ 779.598144] Swap cache stats: add 0, delete 0, find 0/0
[ 779.600947] Free swap = 0kB
[ 779.602837] Total swap = 0kB
[ 779.604716] 1048445 pages RAM
[ 779.606691] 0 pages HighMem/MovableOnly
[ 779.608974] 111892 pages reserved
[ 779.611075] 0 pages cma reserved
[ 779.613118] 0 pages hwpoisoned
[ 779.615117] Out of memory: Kill process 3655 (c.out) score 999 or sacrifice child
[ 779.618790] Killed process 3655 (c.out) total-vm:4168kB, anon-rss:88kB, file-rss:0kB, shmem-rss:0kB
[ 779.625401] oom_reaper: reaped process 3655 (c.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
** 1838 printk messages dropped ** [ 912.886103] __do_fault+0x19/0xf0
[ 912.886108] __handle_mm_fault+0xb0b/0x1030
[ 912.886114] handle_mm_fault+0xf4/0x220
[ 912.886117] __do_page_fault+0x25b/0x4a0
[ 912.886120] do_page_fault+0x30/0x80
[ 912.886124] ? do_syscall_64+0xfd/0x140
[ 912.886127] page_fault+0x28/0x30
(...snipped...)
[ 912.892027] kworker/3:3 D12120 3217 2 0x00000080
[ 912.892041] Workqueue: events console_callback
[ 912.892042] Call Trace:
[ 912.892047] __schedule+0x23f/0x5d0
[ 912.892051] schedule+0x31/0x80
[ 912.892056] schedule_preempt_disabled+0x9/0x10
[ 912.892061] __mutex_lock.isra.2+0x2ac/0x4d0
[ 912.892068] __mutex_lock_slowpath+0xe/0x10
[ 912.892072] ? __mutex_lock_slowpath+0xe/0x10
[ 912.892077] mutex_lock+0x2a/0x30
[ 912.892105] vmw_fb_pan_display+0x35/0x90 [vmwgfx]
[ 912.892114] fb_pan_display+0xca/0x160
[ 912.892118] bit_update_start+0x1b/0x40
[ 912.892123] fbcon_switch+0x4a6/0x630
[ 912.892128] redraw_screen+0x15a/0x240
[ 912.892132] ? update_attr.isra.3+0x90/0x90
[ 912.892139] complete_change_console+0x3d/0xd0
[ 912.892143] change_console+0x57/0x90
[ 912.892147] console_callback+0x116/0x190
[ 912.892153] process_one_work+0x1f5/0x390
[ 912.892156] worker_thread+0x46/0x410
[ 912.892161] ? __schedule+0x247/0x5d0
[ 912.892165] kthread+0xff/0x140
[ 912.892170] ? process_one_work+0x390/0x390
[ 912.892174] ? kthread_create_on_node+0x60/0x60
[ 912.892178] ? do_syscall_64+0x13a/0x140
[ 912.892181] ret_from_fork+0x25/0x30
(...snipped...)
[ 912.934633] kworker/0:0 D12824 4263 2 0x00000080
[ 912.934643] Workqueue: events vmw_fb_dirty_flush [vmwgfx]
[ 912.934643] Call Trace:
[ 912.934645] __schedule+0x23f/0x5d0
[ 912.934646] schedule+0x31/0x80
[ 912.934647] schedule_timeout+0x189/0x290
[ 912.934649] ? del_timer_sync+0x40/0x40
[ 912.934650] io_schedule_timeout+0x19/0x40
[ 912.934651] ? io_schedule_timeout+0x19/0x40
[ 912.934653] congestion_wait+0x7d/0xd0
[ 912.934654] ? wait_woken+0x80/0x80
[ 912.934654] shrink_inactive_list+0x3e3/0x4d0
[ 912.934656] shrink_node_memcg+0x360/0x780
[ 912.934657] ? list_lru_count_one+0x65/0x70
[ 912.934658] shrink_node+0xdc/0x310
[ 912.934658] ? shrink_node+0xdc/0x310
[ 912.934659] do_try_to_free_pages+0xea/0x370
[ 912.934660] try_to_free_pages+0xc3/0x100
[ 912.934661] __alloc_pages_slowpath+0x441/0xd50
[ 912.934663] ? ___slab_alloc+0x1b6/0x590
[ 912.934664] __alloc_pages_nodemask+0x20c/0x250
[ 912.934665] alloc_pages_current+0x65/0xd0
[ 912.934666] new_slab+0x472/0x600
[ 912.934668] ___slab_alloc+0x41b/0x590
[ 912.934685] ? drm_modeset_lock_all+0x1b/0xa0 [drm]
[ 912.934691] ? drm_modeset_lock_all+0x1b/0xa0 [drm]
[ 912.934692] __slab_alloc+0x1b/0x30
[ 912.934693] ? __slab_alloc+0x1b/0x30
[ 912.934694] kmem_cache_alloc+0x16b/0x1c0
[ 912.934699] drm_modeset_lock_all+0x1b/0xa0 [drm]
[ 912.934702] vmw_framebuffer_dmabuf_dirty+0x47/0x1d0 [vmwgfx]
[ 912.934706] vmw_fb_dirty_flush+0x229/0x270 [vmwgfx]
[ 912.934708] process_one_work+0x1f5/0x390
[ 912.934709] worker_thread+0x46/0x410
[ 912.934710] ? __schedule+0x247/0x5d0
[ 912.934711] kthread+0xff/0x140
[ 912.934712] ? process_one_work+0x390/0x390
[ 912.934713] ? kthread_create_on_node+0x60/0x60
[ 912.934714] ret_from_fork+0x25/0x30
(...snipped...)
[ 912.934765] Showing busy workqueues and worker pools:
[ 912.934785] workqueue events: flags=0x0
[ 912.934798] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=2/256
[ 912.934801] in-flight: 3217:console_callback{79553} console_callback{79553}
[ 912.934806] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[ 912.934807] in-flight: 4263:vmw_fb_dirty_flush [vmwgfx]{130989} vmw_fb_dirty_flush [vmwgfx]{130989}
[ 912.934849] workqueue events_freezable_power_: flags=0x84
[ 912.934865] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 912.934866] in-flight: 4264:disk_events_workfn{131975}
[ 912.934906] workqueue writeback: flags=0x4e
[ 912.934907] pwq 128: cpus=0-63 flags=0x4 nice=0 active=2/256
[ 912.934909] in-flight: 357:wb_workfn{130154} wb_workfn{130154}
[ 912.935347] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=5 idle: 33 3214 237
[ 912.935351] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 29 95 3215
[ 912.935364] pool 128: cpus=0-63 flags=0x4 nice=0 hung=130s workers=3 idle: 358 356
[ 925.093875] sysrq: SysRq : Trigger a crash
[ 925.093903] BUG: unable to handle kernel NULL pointer dereference at (null)
------------------------------------------------------------
Pressing SysRq-c caused all locks to be released (doesn't it ?), and console
resumed printing pending output. The {value} after each work is delay in jiffies
since that work was queued (as suggested at
http://lkml.kernel.org/r/201706291957.JGH39511.tQMOFSLOFJVHOF@I-love.SAKURA.ne.jp ).
Since disk_events_workfn was stalling for 131975 jiffies (CONFIG_HZ=1000)
as of uptime = 912, disk_events_workfn started stalling at uptime = 781
(which was 2 seconds after the OOM killer/reaper reclaimed some memory
for the last time).
Since vmw_fb_dirty_flush was stalling for 130989 jiffies,
vmw_fb_dirty_flush started stalling at uptime = 782. And
drm_modeset_lock_all() from vmw_fb_dirty_flush work started
GFP_KERNEL memory allocation
----------
void drm_modeset_lock_all(struct drm_device *dev)
{
struct drm_mode_config *config = &dev->mode_config;
struct drm_modeset_acquire_ctx *ctx;
int ret;
ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
if (WARN_ON(!ctx))
return;
(...snipped...)
}
----------
with par->bo_mutex mutex held at vmw_fb_dirty_flush().
----------
static void vmw_fb_dirty_flush(struct work_struct *work)
{
struct vmw_fb_par *par = container_of(work, struct vmw_fb_par,
local_work.work);
struct vmw_private *vmw_priv = par->vmw_priv;
struct fb_info *info = vmw_priv->fb_info;
unsigned long irq_flags;
s32 dst_x1, dst_x2, dst_y1, dst_y2, w, h;
u32 cpp, max_x, max_y;
struct drm_clip_rect clip;
struct drm_framebuffer *cur_fb;
u8 *src_ptr, *dst_ptr;
if (vmw_priv->suspended)
return;
mutex_lock(&par->bo_mutex);
(...snipped...)
WARN_ON_ONCE(par->set_fb->funcs->dirty(cur_fb, NULL, 0, 0,
&clip, 1));
(...snipped...)
mutex_unlock(&par->bo_mutex);
}
----------
Since console_callback was stalling for 79553 jiffies, console_callback
started stalling at uptime = 833. (Thus, maybe there were OOM killer
activities a few more times, but they if any are hidden into
"** 1838 printk messages dropped **" message.)
vmw_fb_pan_display() from fb_pan_display() from console_callback work was
waiting for par->bo_mutex mutex held at vmw_fb_dirty_flush() which cannot
make progress due to infinite too_many_isolated() trap.
----------
static int vmw_fb_pan_display(struct fb_var_screeninfo *var,
struct fb_info *info)
{
struct vmw_fb_par *par = info->par;
if ((var->xoffset + var->xres) > var->xres_virtual ||
(var->yoffset + var->yres) > var->yres_virtual) {
DRM_ERROR("Requested panning can not fit in framebuffer\n");
return -EINVAL;
}
mutex_lock(&par->bo_mutex);
par->fb_x = var->xoffset;
par->fb_y = var->yoffset;
if (par->set_fb)
vmw_fb_dirty_mark(par, par->fb_x, par->fb_y, par->set_fb->width,
par->set_fb->height);
mutex_unlock(&par->bo_mutex);
return 0;
}
----------
As a result, console was not able to print SysRq-t output.
So, how should we avoid this problem? Of course, this problem would not
have occurred from the beginning if we did not fall into infinite
too_many_isolated() trap in shrink_inactive_list() (as proposed at
http://lkml.kernel.org/r/20170307133057.26182-1-mhocko@kernel.org ).
But should fbcon, drm, tty and so on stop using __GFP_DIRECT_RECLAIM
memory allocations because consoles should be as responsive as printk() ?
More information about the dri-devel
mailing list