printk: Should console related code avoid __GFP_DIRECT_RECLAIM memory allocations?

Thu Jul 6 10:28:17 UTC 2017

Hello.

I noticed that printing to consoles can stop forever because console drivers
can wait for memory allocation when memory allocation cannot make progress.

I was testing almost OOM situation and reproduced a situation where allocation
requests fall into infinite too_many_isolated() trap in shrink_inactive_list().

Since nothing will be printed to consoles when we hit that situation, I tried to
confirm that I actually reproduced that situation using SysRq-t. However, SysRq-t
printed nothing for some reason. Therefore, I used SysRq-c at uptime = 910.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170706.txt.xz :
------------------------------------------------------------
[  779.494100] Mem-Info:
[  779.495493] active_anon:846948 inactive_anon:5416 isolated_anon:0
[  779.495493]  active_file:33 inactive_file:13 isolated_file:47
[  779.495493]  unevictable:0 dirty:11 writeback:0 unstable:0
[  779.495493]  slab_reclaimable:0 slab_unreclaimable:11
[  779.495493]  mapped:3580 shmem:5581 pagetables:7491 bounce:0
[  779.495493]  free:21354 free_pcp:87 free_cma:0
[  779.510664] Node 0 active_anon:3387792kB inactive_anon:21664kB active_file:132kB inactive_file:52kB unevictable:0kB isolated(anon):0kB isolated(file):188kB mapped:14320kB dirty:44kB writeback:0kB shmem:22324kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 3067904kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[  779.522778] Node 0 DMA free:14836kB min:284kB low:352kB high:420kB active_anon:1024kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:8kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  779.533972] lowmem_reserve[]: 0 2700 3642 3642
[  779.536321] Node 0 DMA32 free:53656kB min:49888kB low:62360kB high:74832kB active_anon:2710360kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2765492kB mlocked:0kB kernel_stack:0kB pagetables:12kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  779.548127] lowmem_reserve[]: 0 0 942 942
[  779.550326] Node 0 Normal free:16924kB min:17404kB low:21752kB high:26100kB active_anon:676260kB inactive_anon:21664kB active_file:340kB inactive_file:244kB unevictable:0kB writepending:44kB present:1048576kB managed:964816kB mlocked:0kB kernel_stack:13600kB pagetables:29944kB bounce:0kB free_pcp:348kB local_pcp:0kB free_cma:0kB
[  779.563405] lowmem_reserve[]: 0 0 0 0
[  779.565537] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 2*1024kB (UM) 0*2048kB 3*4096kB (M) = 14836kB
[  779.571291] Node 0 DMA32: 6*4kB (UM) 8*8kB (UM) 8*16kB (UM) 20*32kB (U) 29*64kB (UM) 32*128kB (UM) 13*256kB (UM) 9*512kB (UM) 10*1024kB (UM) 0*2048kB 7*4096kB (UME) = 53656kB
[  779.578672] Node 0 Normal: 734*4kB (UE) 478*8kB (UME) 266*16kB (UE) 163*32kB (UE) 7*64kB (U) 3*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17064kB
[  779.585618] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  779.589721] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  779.593765] 5660 total pagecache pages
[  779.596059] 0 pages in swap cache
[  779.598144] Swap cache stats: add 0, delete 0, find 0/0
[  779.600947] Free swap  = 0kB
[  779.602837] Total swap = 0kB
[  779.604716] 1048445 pages RAM
[  779.606691] 0 pages HighMem/MovableOnly
[  779.608974] 111892 pages reserved
[  779.611075] 0 pages cma reserved
[  779.613118] 0 pages hwpoisoned
[  779.615117] Out of memory: Kill process 3655 (c.out) score 999 or sacrifice child
[  779.618790] Killed process 3655 (c.out) total-vm:4168kB, anon-rss:88kB, file-rss:0kB, shmem-rss:0kB
[  779.625401] oom_reaper: reaped process 3655 (c.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
** 1838 printk messages dropped ** [  912.886103]  __do_fault+0x19/0xf0
[  912.886108]  __handle_mm_fault+0xb0b/0x1030
[  912.886114]  handle_mm_fault+0xf4/0x220
[  912.886117]  __do_page_fault+0x25b/0x4a0
[  912.886120]  do_page_fault+0x30/0x80
[  912.886124]  ? do_syscall_64+0xfd/0x140
[  912.886127]  page_fault+0x28/0x30
(...snipped...)
[  912.892027] kworker/3:3     D12120  3217      2 0x00000080
[  912.892041] Workqueue: events console_callback
[  912.892042] Call Trace:
[  912.892047]  __schedule+0x23f/0x5d0
[  912.892051]  schedule+0x31/0x80
[  912.892056]  schedule_preempt_disabled+0x9/0x10
[  912.892061]  __mutex_lock.isra.2+0x2ac/0x4d0
[  912.892068]  __mutex_lock_slowpath+0xe/0x10
[  912.892072]  ? __mutex_lock_slowpath+0xe/0x10
[  912.892077]  mutex_lock+0x2a/0x30
[  912.892105]  vmw_fb_pan_display+0x35/0x90 [vmwgfx]
[  912.892114]  fb_pan_display+0xca/0x160
[  912.892118]  bit_update_start+0x1b/0x40
[  912.892123]  fbcon_switch+0x4a6/0x630
[  912.892128]  redraw_screen+0x15a/0x240
[  912.892132]  ? update_attr.isra.3+0x90/0x90
[  912.892139]  complete_change_console+0x3d/0xd0
[  912.892143]  change_console+0x57/0x90
[  912.892147]  console_callback+0x116/0x190
[  912.892153]  process_one_work+0x1f5/0x390
[  912.892156]  worker_thread+0x46/0x410
[  912.892161]  ? __schedule+0x247/0x5d0
[  912.892165]  kthread+0xff/0x140
[  912.892170]  ? process_one_work+0x390/0x390
[  912.892174]  ? kthread_create_on_node+0x60/0x60
[  912.892178]  ? do_syscall_64+0x13a/0x140
[  912.892181]  ret_from_fork+0x25/0x30
(...snipped...)
[  912.934633] kworker/0:0     D12824  4263      2 0x00000080
[  912.934643] Workqueue: events vmw_fb_dirty_flush [vmwgfx]
[  912.934643] Call Trace:
[  912.934645]  __schedule+0x23f/0x5d0
[  912.934646]  schedule+0x31/0x80
[  912.934647]  schedule_timeout+0x189/0x290
[  912.934649]  ? del_timer_sync+0x40/0x40
[  912.934650]  io_schedule_timeout+0x19/0x40
[  912.934651]  ? io_schedule_timeout+0x19/0x40
[  912.934653]  congestion_wait+0x7d/0xd0
[  912.934654]  ? wait_woken+0x80/0x80
[  912.934654]  shrink_inactive_list+0x3e3/0x4d0
[  912.934656]  shrink_node_memcg+0x360/0x780
[  912.934657]  ? list_lru_count_one+0x65/0x70
[  912.934658]  shrink_node+0xdc/0x310
[  912.934658]  ? shrink_node+0xdc/0x310
[  912.934659]  do_try_to_free_pages+0xea/0x370
[  912.934660]  try_to_free_pages+0xc3/0x100
[  912.934661]  __alloc_pages_slowpath+0x441/0xd50
[  912.934663]  ? ___slab_alloc+0x1b6/0x590
[  912.934664]  __alloc_pages_nodemask+0x20c/0x250
[  912.934665]  alloc_pages_current+0x65/0xd0
[  912.934666]  new_slab+0x472/0x600
[  912.934668]  ___slab_alloc+0x41b/0x590
[  912.934685]  ? drm_modeset_lock_all+0x1b/0xa0 [drm]
[  912.934691]  ? drm_modeset_lock_all+0x1b/0xa0 [drm]
[  912.934692]  __slab_alloc+0x1b/0x30
[  912.934693]  ? __slab_alloc+0x1b/0x30
[  912.934694]  kmem_cache_alloc+0x16b/0x1c0
[  912.934699]  drm_modeset_lock_all+0x1b/0xa0 [drm]
[  912.934702]  vmw_framebuffer_dmabuf_dirty+0x47/0x1d0 [vmwgfx]
[  912.934706]  vmw_fb_dirty_flush+0x229/0x270 [vmwgfx]
[  912.934708]  process_one_work+0x1f5/0x390
[  912.934709]  worker_thread+0x46/0x410
[  912.934710]  ? __schedule+0x247/0x5d0
[  912.934711]  kthread+0xff/0x140
[  912.934712]  ? process_one_work+0x390/0x390
[  912.934713]  ? kthread_create_on_node+0x60/0x60
[  912.934714]  ret_from_fork+0x25/0x30
(...snipped...)
[  912.934765] Showing busy workqueues and worker pools:
[  912.934785] workqueue events: flags=0x0
[  912.934798]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=2/256
[  912.934801]     in-flight: 3217:console_callback{79553} console_callback{79553}
[  912.934806]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  912.934807]     in-flight: 4263:vmw_fb_dirty_flush [vmwgfx]{130989} vmw_fb_dirty_flush [vmwgfx]{130989}
[  912.934849] workqueue events_freezable_power_: flags=0x84
[  912.934865]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  912.934866]     in-flight: 4264:disk_events_workfn{131975}
[  912.934906] workqueue writeback: flags=0x4e
[  912.934907]   pwq 128: cpus=0-63 flags=0x4 nice=0 active=2/256
[  912.934909]     in-flight: 357:wb_workfn{130154} wb_workfn{130154}
[  912.935347] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=5 idle: 33 3214 237
[  912.935351] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 29 95 3215
[  912.935364] pool 128: cpus=0-63 flags=0x4 nice=0 hung=130s workers=3 idle: 358 356
[  925.093875] sysrq: SysRq : Trigger a crash
[  925.093903] BUG: unable to handle kernel NULL pointer dereference at           (null)
------------------------------------------------------------

Pressing SysRq-c caused all locks to be released (doesn't it ?), and console
resumed printing pending output. The {value} after each work is delay in jiffies
since that work was queued (as suggested at 
http://lkml.kernel.org/r/201706291957.JGH39511.tQMOFSLOFJVHOF@I-love.SAKURA.ne.jp ).

Since disk_events_workfn was stalling for 131975 jiffies (CONFIG_HZ=1000)
as of uptime = 912, disk_events_workfn started stalling at uptime = 781
(which was 2 seconds after the OOM killer/reaper reclaimed some memory
for the last time).

Since vmw_fb_dirty_flush was stalling for 130989 jiffies,
vmw_fb_dirty_flush started stalling at uptime = 782. And
drm_modeset_lock_all() from vmw_fb_dirty_flush work started
GFP_KERNEL memory allocation

----------
void drm_modeset_lock_all(struct drm_device *dev)
{
        struct drm_mode_config *config = &dev->mode_config;
        struct drm_modeset_acquire_ctx *ctx;
        int ret;

        ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
        if (WARN_ON(!ctx))
                return;
(...snipped...)
}
----------

with par->bo_mutex mutex held at vmw_fb_dirty_flush().

----------
static void vmw_fb_dirty_flush(struct work_struct *work)
{
        struct vmw_fb_par *par = container_of(work, struct vmw_fb_par,
                                              local_work.work);
        struct vmw_private *vmw_priv = par->vmw_priv;
        struct fb_info *info = vmw_priv->fb_info;
        unsigned long irq_flags;
        s32 dst_x1, dst_x2, dst_y1, dst_y2, w, h;
        u32 cpp, max_x, max_y;
        struct drm_clip_rect clip;
        struct drm_framebuffer *cur_fb;
        u8 *src_ptr, *dst_ptr;

        if (vmw_priv->suspended)
                return;

        mutex_lock(&par->bo_mutex);
(...snipped...)
                WARN_ON_ONCE(par->set_fb->funcs->dirty(cur_fb, NULL, 0, 0,
                                                       &clip, 1));
(...snipped...)
        mutex_unlock(&par->bo_mutex);
}
----------

Since console_callback was stalling for 79553 jiffies, console_callback
started stalling at uptime = 833. (Thus, maybe there were OOM killer
activities a few more times, but they if any are hidden into
"** 1838 printk messages dropped **" message.)

vmw_fb_pan_display() from fb_pan_display() from console_callback work was
waiting for par->bo_mutex mutex held at vmw_fb_dirty_flush() which cannot
make progress due to infinite too_many_isolated() trap.

----------
static int vmw_fb_pan_display(struct fb_var_screeninfo *var,
                              struct fb_info *info)
{
        struct vmw_fb_par *par = info->par;

        if ((var->xoffset + var->xres) > var->xres_virtual ||
            (var->yoffset + var->yres) > var->yres_virtual) {
                DRM_ERROR("Requested panning can not fit in framebuffer\n");
                return -EINVAL;
        }

        mutex_lock(&par->bo_mutex);
        par->fb_x = var->xoffset;
        par->fb_y = var->yoffset;
        if (par->set_fb)
                vmw_fb_dirty_mark(par, par->fb_x, par->fb_y, par->set_fb->width,
                                  par->set_fb->height);
        mutex_unlock(&par->bo_mutex);

        return 0;
}
----------

As a result, console was not able to print SysRq-t output.

So, how should we avoid this problem? Of course, this problem would not
have occurred from the beginning if we did not fall into infinite
too_many_isolated() trap in shrink_inactive_list() (as proposed at 
http://lkml.kernel.org/r/20170307133057.26182-1-mhocko@kernel.org ).
But should fbcon, drm, tty and so on stop using __GFP_DIRECT_RECLAIM
memory allocations because consoles should be as responsive as printk() ?