[Intel-gfx] Fwd: __i915_gem_shrink / mm_find_pmd hogging CPU, then out of memory
Sam Jansen
sam.jansen at starleaf.com
Mon Jun 2 15:18:14 CEST 2014
Hello intel-gfx,
I'm working on an application using VA-API for H264 encode+decode, and JPEG
decode on an Atom E3815. Unfortunately we've hit what I believe is a kernel
bug, and the "perf top" output is pointing at i915 DRM code.
After some amount of time running my application, the system will become
unresponsive (userspace applications get very little CPU, system CPU will
go up to 80+%), and sometimes the system will appear out of memory for a
period (the OOM killer is sometimes invoked), even though there is a lot of
free memory on the system. I noticed this first on kernel 3.14.5, so I
moved to "drm-intel-nightly", built on Friday (2014-05-30), from
http://cgit.freedesktop.org/drm-intel. The results are the same.
Using "perf top", I have gathered the following traces showing high system
CPU at the time when the system was encountering this problem:
Samples: 15M of event 'cycles', Event count (approx.): 26497228423
+ 38.67% [kernel] [k] __i915_gem_shrink
+ 14.89% [kernel] [k] list_move_tail
+ 5.97% [kernel] [k] i915_gem_shrinker_count
+ 3.62% [kernel] [k] i915_gem_obj_is_pinned
+ 3.56% [kernel] [k] clflushopt
+ 3.02% perf [.] rb_next
+ 1.16% perf [.] __sort_chain_graph_rel
+ 1.10% [kernel] [k] clear_page_c_e
+ 0.94% [kernel] [k] drm_clflush_page
+ 0.83% perf [.] rb_first
+ 0.79% perf [.] dso__find_symbol
+ 0.71% [kernel] [k] delay_tsc
+ 0.44% perf [.] append_chain_children
+ 0.34% perf [.] rb_insert_color
+ 0.31% i965_drv_video.so [.] i965_post_processing_internal
+ 0.28% [kernel] [k] mm_find_pmd
+ 0.27% [kernel] [k] xhci_irq
+ 0.26% perf [.] hists__output_resort
+ 0.26% i965_drv_video.so [.] gen7_pp_avs_set_block_parameter
+ 0.24% [kernel] [k] __vlv_force_wake_get
+ 0.23% libc-2.18.so [.] _int_malloc
+ 0.23% [kernel] [k] __sg_page_iter_next
+ 0.22% [kernel] [k] __wake_up_bit
+ 0.22% [kernel] [k] do_raw_spin_lock
+ 0.22% [kernel] [k] get_page_from_freelist
+ 0.21% [kernel] [k] __page_check_address
+ 0.20% perf [.] sort__dso_cmp
+ 0.19% [kernel] [k] __acct_update_integrals
Maybe 5-10 seconds later, this had moved to look like the following:
Samples: 15M of event 'cycles', Event count (approx.): 11020159263
+ 7.31% perf [.] rb_next
+ 4.72% [kernel] [k] mm_find_pmd
+ 3.12% perf [.] __sort_chain_graph_rel
+ 2.86% [kernel] [k] __page_check_address
+ 2.67% [kernel] [k] super_cache_count
+ 2.15% perf [.] rb_first
+ 2.12% [kernel] [k] vma_interval_tree_iter_next
+ 1.88% [kernel] [k] do_raw_spin_lock
+ 1.56% [kernel] [k] clflushopt
+ 1.48% perf [.] dso__find_symbol
+ 1.28% perf [.] append_chain_children
+ 1.28% [kernel] [k] vma_interval_tree_subtree_search
+ 1.15% [kernel] [k] rmap_walk
+ 1.10% [kernel] [k] shrink_slab_node
+ 1.05% [kernel] [k] page_referenced_one
+ 0.88% perf [.] rb_insert_color
+ 0.87% [kernel] [k] shrink_slab
+ 0.86% [kernel] [k] clear_page_c_e
+ 0.85% [kernel] [k] grab_super_passive
+ 0.74% [kernel] [k] __vma_address
+ 0.69% [kernel] [k] __wake_up_bit
+ 0.69% [kernel] [k] mem_cgroup_page_lruvec
+ 0.68% [kernel] [k] get_page_from_freelist
+ 0.65% [kernel] [k] page_referenced
+ 0.65% [kernel] [k] isolate_lru_pages.isra.49
+ 0.62% [kernel] [k] shrink_page_list
+ 0.61% perf [.] hists__output_resort
+ 0.61% [kernel] [k] list_lru_count_node
+ 0.59% [kernel] [k] xhci_irq
+ 0.52% [kernel] [k] vma_last_pgoff
+ 0.49% [kernel] [k] __mod_zone_page_state
+ 0.48% libm-2.18.so [.] __ceil_sse41
+ 0.47% [kernel] [k] _raw_spin_lock
The system was still largely unresponsive at this time - e.g. "perf top",
"top", or even the shell might take seconds to respond to user input.
Eventually, "perf" was killed by the OOM killer. dmesg has has some
interesting lines in it ("Purging GPU memory"). Please see the dmesg and
vmstat output at the end of this email.
After running this a couple of times, I noticed the following trends in
"free" and "vmstat" output: free memory starts at 430M, and cached at 828M.
Eventually free calls to 64M, and cached rises to >1000M (total RAM is 2GB,
as you will see in the OOM output in dmesg, only a small amount of that is
in use by userspace apps). Around this point, vmstat shows "bi" (blocks
received from block devices) goes up from 0 to a few thousand per second,
and does system CPU%, from 16-17 to 70-80.
I would be happy to retest with a different kernel and/or config, or
increase debugging if that helps.
Thanks,
Sam
vmstat:
procs -----------memory---------- ---swap-- -----io---- -system---
------cpu--------
r b swpd free buff cache si so bi bo in cs us sy
id wa st
0 0 0 376696 12580 860280 0 0 0 0 4013 765 21 16
26 36 0
1 1 0 375864 12580 860760 0 0 0 0 4050 772 21 16
28 34 0
0 1 0 374908 12580 861304 0 0 0 0 3991 762 21 16
27 36 0
0 0 0 374232 12584 861784 0 0 0 0 4039 776 21 15
28 36 0
0 1 0 373292 12584 862368 0 0 0 0 4016 775 22 17
27 35 0
... snip ...
0 0 0 65264 304 1030056 0 0 0 0 4134 794 24 16
26 34 0
0 1 0 64440 312 1030536 0 0 0 0 4258 778 26 16
26 32 0
4 0 0 68180 312 1025992 0 0 0 0 6185 639 24 42
13 21 0
4 1 0 71172 64 1024016 0 0 1928 0 8010 590 11 78
0 11 0
2 0 0 68768 60 1024792 0 0 3440 0 11010 787 18 80
0 2 0
4 0 0 73456 56 1023640 0 0 3108 0 9140 807 18 78
0 5 0
2 0 0 71704 64 1025916 0 0 2748 0 8752 674 20 79
0 1 0
3 0 0 74720 64 1022316 0 0 3136 0 8530 756 16 74
0 10 0
2 2 0 81592 64 1015584 0 0 4096 0 12065 945 7 84
0 9 0
2 0 0 81896 56 1015332 0 0 4860 0 11592 1008 13 83
0 4 0
0 1 0 72244 56 1025908 0 0 3360 0 9384 912 15 80
0 5 0
0 1 0 73580 52 1025928 0 0 2400 0 8276 916 8 80
0 13 0
0 2 0 72696 60 1026316 0 0 4992 0 10721 1172 6 84
0 10 0
2 3 0 78844 64 1021024 0 0 5988 0 8707 920 6 84
0 10 0
1 2 0 81328 60 1019996 0 0 5020 0 7967 1194 8 75
0 17 0
3 0 0 73288 64 1028360 0 0 5980 0 7824 1560 17 59
0 24 0
3 2 0 80476 60 1021044 0 0 4992 0 8309 1548 22 61
2 16 0
0 1 0 78924 68 1026952 0 0 2792 0 7319 1472 30 45
2 23 0
0 0 0 68908 632 1035644 0 0 1144 0 4421 853 27 17
22 34 0
0 0 0 67596 640 1036700 0 0 72 28 4374 816 26 17
25 32 0
1 1 0 66912 640 1037252 0 0 0 0 4293 761 27 16
25 33 0
0 1 0 65956 640 1037760 0 0 0 0 4195 767 26 16
26 32 0
1 1 0 65248 648 1038320 0 0 0 12 3590 799 18 15
27 39 0
0 0 0 64564 648 1038784 0 0 0 0 2743 787 7 15
33 45 0
0 0 0 63856 648 1039312 0 0 0 8 2841 819 9 16
33 43 0
0 1 0 68588 60 1034984 0 0 2260 0 6159 1508 12 48
5 36 0
0 2 0 78016 64 1024456 0 0 7468 0 7740 1715 24 50
0 26 0
1 1 0 65312 56 1036944 0 0 11452 12 6577 1876 11 48
0 42 0
0 3 0 73876 52 1027820 0 0 17108 0 6994 2136 8 49
0 43 0
4 0 0 66688 52 1033736 0 0 15624 12 7419 2047 23 47
0 31 0
0 2 0 73032 56 1026724 0 0 27480 0 6056 2449 15 32
0 53 0
0 1 0 75000 52 1024268 0 0 22752 0 5869 2336 6 39
1 54 0
0 2 0 71720 52 1025556 0 0 35008 0 4508 2233 2 25
0 73 0
0 3 0 73704 52 1022796 0 0 34648 4 4107 1911 6 21
0 74 0
1 2 0 70968 52 1023524 0 0 36548 4 5040 2103 15 21
0 64 0
0 3 0 66632 68 1026176 0 0 41928 44 5139 2641 2 24
0 74 0
0 2 0 67932 52 1022424 0 0 38676 4 4604 2292 3 26
0 71 0
2 2 0 64516 52 1024708 0 0 44780 0 5345 2139 17 24
0 60 0
1 4 0 64008 52 1023216 0 0 50920 0 5242 2535 8 26
0 65 0
0 2 0 63256 52 1021840 0 0 35436 4 4448 2250 4 26
0 70 0
0 1 0 60516 52 1022780 0 0 33648 0 4557 2347 4 26
0 70 0
4 2 0 56156 52 1022756 0 0 46604 0 5921 2769 11 27
0 62 0
2 1 0 345644 472 1045888 0 0 13064 16 1973 1627 6 21
8 64 0
0 1 0 341956 1428 1048312 0 0 0 0 602 765 6 14
35 45 0
0 1 0 341388 1428 1048856 0 0 0 0 600 766 6 14
36 44 0
dmesg:
[244227.556881] Purging GPU memory, 2 bytes freed, 8429568 bytes still
pinned.
[244227.557018] Purging GPU memory, 0 bytes freed, 8429568 bytes still
pinned.
[244227.557129] perf invoked oom-killer: gfp_mask=0x201da, order=0,
oom_score_adj=0
[244227.557135] perf cpuset=/ mems_allowed=0
[244227.557146] CPU: 0 PID: 9481 Comm: perf Tainted: G W E
3.15.0-rc7-sl-01023-g085391259 #4
[244227.557151] Hardware name:
\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xfffff
fff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xfff
fffff\xffffffff\xffffffff
\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\x
ffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff
\xffffffff\xffffffff/DE3815TYKH, BIOS TYBYT10H.86A.0019.2014.0327.1516
03/27/201
[244227.557155] 0000000000000000 ffff88007014d968 ffffffff81669271
ffff88005adae480
[244227.557163] ffff88007014d9f0 ffffffff816665fd 0000000000015d40
0001456a0b81c281
[244227.557169] ffff88007014d9d0 ffffffff813a4180 000000000080a000
01ffffff8114d2f8
[244227.557176] Call Trace:
[244227.557190] [<ffffffff81669271>] dump_stack+0x45/0x56
[244227.557197] [<ffffffff816665fd>] dump_header+0x7a/0x1dc
[244227.557206] [<ffffffff813a4180>] ? i915_gem_shrinker_oom+0x15e/0x18d
[244227.557215] [<ffffffff81141ad9>] oom_kill_process+0x6d/0x2f7
[244227.557221] [<ffffffff811422da>] out_of_memory+0x418/0x431
[244227.557229] [<ffffffff81146a25>] __alloc_pages_nodemask+0x646/0x7ad
[244227.557239] [<ffffffff81172124>] alloc_pages_current+0xc6/0xe1
[244227.557247] [<ffffffff8113ed0d>] __page_cache_alloc+0x85/0x90
[244227.557254] [<ffffffff81140d13>] filemap_fault+0x239/0x3a8
[244227.557262] [<ffffffff8115cfda>] __do_fault+0x3b/0x89
[244227.557269] [<ffffffff8115eb43>] do_read_fault.isra.73+0x171/0x1f5
[244227.557276] [<ffffffff8115ff93>] handle_mm_fault+0x3eb/0xd58
[244227.557284] [<ffffffff81132493>] ? perf_event_aux_ctx+0x42/0x59
[244227.557293] [<ffffffff8167241d>] __do_page_fault+0x2aa/0x39b
[244227.557302] [<ffffffff8110e295>] ? acct_account_cputime+0x1c/0x1e
[244227.557309] [<ffffffff810b9b17>] ? account_system_time+0x102/0x10b
[244227.557318] [<ffffffff8103bf53>] ? native_sched_clock+0x35/0x37
[244227.557325] [<ffffffff8110e295>] ? acct_account_cputime+0x1c/0x1e
[244227.557342] [<ffffffff810b9a10>] ? account_user_time+0x46/0x4b
[244227.557349] [<ffffffff810b9d9f>] ? vtime_account_user+0x46/0x4b
[244227.557356] [<ffffffff81672549>] do_page_fault+0x3b/0x56
[244227.557364] [<ffffffff8166f7a2>] page_fault+0x22/0x30
[244227.557368] Mem-Info:
[244227.557372] Node 0 DMA per-cpu:
[244227.557377] CPU 0: hi: 0, btch: 1 usd: 0
[244227.557381] Node 0 DMA32 per-cpu:
[244227.557385] CPU 0: hi: 186, btch: 31 usd: 0
[244227.557395] active_anon:130938 inactive_anon:256046 isolated_anon:0
[244227.557395] active_file:13 inactive_file:72 isolated_file:0
[244227.557395] unevictable:0 dirty:0 writeback:1 unstable:0
[244227.557395] free:13132 slab_reclaimable:25378 slab_unreclaimable:48246
[244227.557395] mapped:3895 shmem:256101 pagetables:993 bounce:0
[244227.557395] free_cma:0
[244227.557406] Node 0 DMA free:7876kB min:368kB low:460kB high:552kB
active_anon:6736kB inactive_anon:632kB active_file:4kB inactive_file:84kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15924kB
managed:15840kB mlocked:0kB dirty:0kB writeback:0kB mapped:200kB
shmem:632kB slab_reclaimable:96kB slab_unreclaimable:132kB kernel_stack:8kB
pagetables:36kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
pages_scanned:148 all_unreclaimable? yes
[244227.557420] lowmem_reserve[]: 0 1877 1877 1877
[244227.557427] Node 0 DMA32 free:44652kB min:44684kB low:55852kB
high:67024kB active_anon:517016kB inactive_anon:1023552kB active_file:48kB
inactive_file:204kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:1972064kB managed:1925928kB mlocked:0kB dirty:0kB writeback:4kB
mapped:15380kB shmem:1023772kB slab_reclaimable:101416kB
slab_unreclaimable:192852kB kernel_stack:704kB pagetables:3936kB
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:436
all_unreclaimable? yes
[244227.557440] lowmem_reserve[]: 0 0 0 0
[244227.557447] Node 0 DMA: 3*4kB (UE) 20*8kB (EM) 16*16kB (EM) 9*32kB (EM)
4*64kB (UEM) 6*128kB (UEM) 4*256kB (UEM) 4*512kB (UE) 1*1024kB (E) 1*2048kB
(R) 0*4096kB = 7884kB
[244227.557476] Node 0 DMA32: 5553*4kB (UEM) 687*8kB (UEM) 349*16kB (UEM)
48*32kB (UEM) 106*64kB (UEM) 8*128kB (EM) 0*256kB 0*512kB 0*1024kB 1*2048kB
(R) 0*4096kB = 44684kB
[244227.557504] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
hugepages_size=2048kB
[244227.557508] 256203 total pagecache pages
[244227.557511] 496997 pages RAM
[244227.557514] 0 pages HighMem/MovableOnly
[244227.557517] 11534 pages reserved
[244227.557520] 0 pages hwpoisoned
[244227.557523] [ pid ] uid tgid total_vm rss nr_ptes swapents
oom_score_adj name
[244227.557538] [ 429] 0 429 10231 202 23 0
-1000 udevd
[244227.557545] [ 1264] 0 1264 3705 1728 10 0
0 dhclient
[244227.557552] [ 1432] 0 1432 65700 197 30 0
0 rsyslogd
[244227.557558] [ 1477] 0 1477 5911 53 17 0
0 cron
[244227.557575] [ 1498] 102 1498 8329 147 21 0
0 ntpd
[244227.557582] [ 1513] 0 1513 13743 167 29 0
-1000 sshd
[244227.557589] [ 1544] 0 1544 5942 134 15 0
0 nginx
[244227.557595] [ 1547] 65534 1547 6042 236 15 0
0 nginx
[244227.557601] [ 1548] 65534 1548 6042 240 15 0
0 nginx
[244227.557607] [ 1550] 0 1550 3721 39 13 0
0 getty
[244227.557614] [ 1551] 0 1551 15263 111 35 0
0 login
[244227.557620] [ 1552] 0 1552 3721 40 12 0
0 getty
[244227.557627] [ 1553] 0 1553 3721 41 12 0
0 getty
[244227.557633] [ 1554] 0 1554 3721 39 12 0
0 getty
[244227.557639] [ 1555] 0 1555 3721 41 13 0
0 getty
[244227.557646] [ 1564] 0 1564 4459 53 14 0
0 bash
[244227.557652] [ 1567] 0 1567 1078 29 7 0
0 startx
[244227.557658] [ 1584] 0 1584 3972 37 13 0
0 xinit
[244227.557664] [ 1585] 0 1585 36171 1275 69 0
0 Xorg
[244227.557670] [ 1590] 0 1590 1078 25 7 0
0 sh
[244227.557677] [ 1591] 0 1591 5541 67 17 0
0 dwm
[244227.557683] [ 1642] 0 1642 18136 245 41 0
0 xterm
[244227.557689] [ 1644] 0 1644 4527 131 15 0
0 bash
[244227.557696] [ 2756] 0 2756 20626 219 44 0
0 sshd
[244227.557714] [ 3812] 0 3812 4528 141 13 0
0 bash
[244227.557721] [ 9472] 0 9472 102030 78463 180 0
0 perf
[244227.557727] [24527] 0 24527 81704 49573 152 0
0 Main_(called_sl
[244227.557733] [24556] 0 24556 20626 233 43 0
0 sshd
[244227.557739] [24558] 0 24558 2375 76 10 0
0 bash
[244227.557746] [ 2351] 0 2351 2375 77 10 0
0 bash
[244227.557753] [ 2352] 0 2352 2120 33 10 0
0 vmstat
[244227.557759] [ 2353] 0 2353 1063 19 8 0
0 tail
[244227.557764] Out of memory: Kill process 9472 (perf) score 157 or
sacrifice child
[244227.557769] Killed process 9472 (perf) total-vm:408120kB,
anon-rss:313852kB, file-rss:0kB
[244227.558083] Purging GPU memory, 0 bytes freed, 8429568 bytes still
pinned.
[244227.560227] Purging GPU memory, 0 bytes freed, 8429568 bytes still
pinned.
[244227.560471] Purging GPU memory, 0 bytes freed, 8429568 bytes still
pinned.
[244227.560666] Purging GPU memory, 0 bytes freed, 8429568 bytes still
pinned.
[244227.560859] Purging GPU memory, 0 bytes freed, 8429568 bytes still
pinned.
[244227.561051] Purging GPU memory, 0 bytes freed, 8429568 bytes still
pinned.
[244227.561227] Purging GPU memory, 0 bytes freed, 8429568 bytes still
pinned.
[244227.561419] Purging GPU memory, 0 bytes freed, 8429568 bytes still
pinned.
[244227.561612] Purging GPU memory, 0 bytes freed, 8429568 bytes still
pinned.
[244227.561806] Purging GPU memory, 0 bytes freed, 8429568 bytes still
pinned.
[244227.561983] Purging GPU memory, 0 bytes freed, 8429568 bytes still
pinned.
[244227.562223] Purging GPU memory, 0 bytes freed, 8429568 bytes still
pinned.
[244227.562420] Purging GPU memory, 0 bytes freed, 8429568 bytes still
pinned.
[244227.562611] Purging GPU memory, 0 bytes freed, 8429568 bytes still
pinned.
[244227.562804] Purging GPU memory, 0 bytes freed, 8429568 bytes still
pinned.
[244227.562979] Purging GPU memory, 0 bytes freed, 8429568 bytes still
pinned.
[244227.563170] Purging GPU memory, 0 bytes freed, 8429568 bytes still
pinned.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/intel-gfx/attachments/20140602/f00073b5/attachment.html>
More information about the Intel-gfx
mailing list