[Intel-gfx] Fwd: __i915_gem_shrink / mm_find_pmd hogging CPU, then out of memory

Sam Jansen sam.jansen at starleaf.com
Wed Jun 4 16:28:07 CEST 2014


Hi Chris,

On 3 June 2014 16:12, Chris Wilson <chris at chris-wilson.co.uk> wrote:

> On Mon, Jun 02, 2014 at 02:18:14PM +0100, Sam Jansen wrote:
> >    Hello intel-gfx,
> >    I'm working on an application using VA-API for H264 encode+decode, and
> >    JPEG decode on an Atom E3815. Unfortunately we've hit what I believe
> is a
> >    kernel bug, and the "perf top" output is pointing at i915 DRM code.
> >    After some amount of time running my application, the system will
> become
> >    unresponsive (userspace applications get very little CPU, system CPU
> will
> >    go up to 80+%), and sometimes the system will appear out of memory
> for a
> >    period (the OOM killer is sometimes invoked), even though there is a
> lot
> >    of free memory on the system. I noticed this first on kernel 3.14.5,
> so I
> >    moved to "drm-intel-nightly", built on Friday (2014-05-30), from
> >    [1]http://cgit.freedesktop.org/drm-intel. The results are the same.
> >    Using "perf top", I have gathered the following traces showing high
> system
> >    CPU at the time when the system was encountering this problem:
>
> It's a buffer leak in the userspace va-api application. The buffers
> appear as cached memory, they are not yet accounted against the
> applications that have a reference to them. Look at
> /sys/kernel/debug/dri/0/i915_gem_objects for a breakdown of users.
>

Thanks for taking the time to respond. I had previously ruled out buffer
leaks by using valgrind and similar to track down any user-space leaks --
VA-API buffers have user-space metadata allocated with malloc/calloc, so if
you leak these it is fairly easy to track down.

However, given the new knowledge that the memory really is associated with
my app, I used divide-and-conquer to eventually track the issue down to my
JPEG decoder. I found that due to not updating one bit of state, I was
accidentily creating/destroying the surfaces and context every frame. I've
fixed that, and my application no longer leaks "cached" kernel memory.

I thought perhaps this is still a real bug, as it looks to me like my
application was cleaning up resources correctly. So I've managed to
reproduce my results using the "loadjpeg" test application distributed with
libva, with only minimal changes: looping to decode the JPEG image many
times a second, and cleaning up buffers each iteration. I've no idea if
this problem is limited to just the JPEG decoder, but it seemed the
simplest test app to hack. When I run this modified version of loadjpeg
with a ~720p image, I leak ~40M cached memory/sec, ~100 objects/sec (as
shown by i915_gem_objects).

I've attached the patch in case you are interested.

As an aside, while debugging this, I hit the attached OOPS a couple of
times, while running "watch cat /sys/kernel/debug/dri/0/i915_gem_objects".

Cheers,
Sam


>
> --
> Chris Wilson, Intel Open Source Technology Centre
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/intel-gfx/attachments/20140604/45934f0f/attachment.html>
-------------- next part --------------
[12969.134772] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
[12969.134935] IP: [<ffffffff81394951>] per_file_stats+0xc9/0x12d
[12969.135051] PGD 6b9fb067 PUD 5b5e1067 PMD 0 
[12969.135141] Oops: 0000 [#1] SMP 
[12969.135208] Modules linked in: lpc_ich(E) mfd_core(E) rtc_cmos(E) i2c_hid(E)
[12969.135358] CPU: 0 PID: 9578 Comm: cat Tainted: G            E 3.15.0-rc7-sl-01023-g085391259 #4
[12969.135514] Hardware name: \xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff \xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff\xffffffff/DE3815TYKH, BIOS TYBYT10H.86A.0019.2014.0327.1516 03/27/201
[12969.135755] task: ffff8800750ccb60 ti: ffff880059db0000 task.ti: ffff880059db0000
[12969.135887] RIP: 0010:[<ffffffff81394951>]  [<ffffffff81394951>] per_file_stats+0xc9/0x12d
[12969.136039] RSP: 0018:ffff880059db1d58  EFLAGS: 00010246
[12969.136134] RAX: 0000000000000000 RBX: ffff880059db1e40 RCX: 0000000000000000
[12969.136259] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88006852c000
[12969.136385] RBP: ffff880059db1d68 R08: 000000000000000a R09: 00000000fffffff7
[12969.136511] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88006852c000
[12969.136636] R13: 00000000ffffffff R14: ffff88006852c000 R15: ffffffff81394888
[12969.136763] FS:  00007f38d16a7700(0000) GS:ffff880079200000(0000) knlGS:0000000000000000
[12969.136905] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[12969.137007] CR2: 0000000000000028 CR3: 0000000067208000 CR4: 00000000001007f0
[12969.137132] Stack:
[12969.137170]  ffff880059db1d98 0000000000000167 ffff880059db1dd8 ffffffff812b0797
[12969.137316]  ffff880059db1e40 0000ffff00000000 ffff88006713a940 ffff88006713b180
[12969.137461]  ffff880059db1e08 ffff880059db1da8 ffff880059db1f58 ffff88005b737400
[12969.137606] Call Trace:
[12969.137659]  [<ffffffff812b0797>] idr_for_each+0xac/0xd7
[12969.137758]  [<ffffffff813947fc>] i915_gem_object_info+0x405/0x491
[12969.137874]  [<ffffffff811a849c>] seq_read+0x161/0x317
[12969.137970]  [<ffffffff8118d147>] vfs_read+0x95/0xf0
[12969.138063]  [<ffffffff8118d8e0>] SyS_read+0x46/0x79
[12969.138156]  [<ffffffff81676253>] tracesys+0xe1/0xe6
[12969.138245] Code: 53 20 eb 15 48 8b 92 e0 01 00 00 48 85 d2 74 3b 48 8b 3b 48 39 7a 10 74 32 48 8b 40 68 48 83 e8 68 eb a5 49 8b 44 24 08 4c 89 e7 <48> 8b 70 28 48 81 c6 90 79 00 00 e8 94 e8 00 00 84 c0 74 2b 49 
[12969.138849] RIP  [<ffffffff81394951>] per_file_stats+0xc9/0x12d
[12969.138959]  RSP <ffff880059db1d58>
[12969.139022] CR2: 0000000000000028
[12969.178528] ---[ end trace 8403dc25eeb2b354 ]---
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tinyjpeg.patch
Type: text/x-patch
Size: 1832 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/intel-gfx/attachments/20140604/45934f0f/attachment.bin>


More information about the Intel-gfx mailing list