[Bug 110848] Everything using GPU gets stuck after running+killing parallel Media loads (after running 3D benchmarks)

Fri Jun 7 08:07:12 UTC 2019

https://bugs.freedesktop.org/show_bug.cgi?id=110848

Eero Tamminen <eero.t.tamminen at intel.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|[BXT/APL] Everything using  |Everything using GPU gets
                   |GPU gets stuck after        |stuck after running+killing
                   |running parallel Media      |parallel Media loads (after
                   |loads after 3D benchmarks   |running 3D benchmarks)

--- Comment #3 from Eero Tamminen <eero.t.tamminen at intel.com> ---
(In reply to Eero Tamminen from comment #0)
> Test-case:
> 1. Run 3D benchmarks
> 2. Do several runs of 50 parallel instances of following H264 transcode
> operations:
>   ffmpeg -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v h264_qsv -i
> 1280x720p_29.97_10mb_h264_cabac.264 -c:v h264_qsv -b:v 800K -vf
> scale_qsv=w=352:h=240,fps=15 -compression_level 4 -frames 2400 -y output.h264

And on each round, kill the whole process group with the 50 parallel
transcoders after few minutes.

> 3. Do a single non-parallel run of above
> 
> Expected outcome:
> * Everything works fine, like with month older 5.1 drm-tip kernel, or with
> GEN9 Core devices

I'm pretty sure the reason this happens only on BXT, is it being much slower
than core devices, and this test often hitting specified timeout before it
finishes.  At timeout, the whole transcoding process group is killed.

=> I think the trigger for this issue is killing of large number of media
processes, and there being some issue / race / deadlock in the kernel media
resource de-allocation path.

=> I'll try to get sysrq task info / bt for them, after I've reproduced the
issue again.

> What Media case freezes, differs a bit e.g. based on FFmpeg version, and
> from week to week.  It's not always the same, but it's always single
> instance test freezing after parallel tests have seemingly [1] finished.

The reason why it's the single process that seems to freeze after parallel run,
is that when the parallel transcoding process group is killed, only the
parallelization wrapper actually dies because the transcoding processes are in
D state.  Because the wrapper died, test runner thinks that test finished, and
starts next one.

And as the single transcoding process directly invoked after them freezes and
refuses to die, it looks like that's where it froze, although freeze happened
already in previous test.

PS: while set of 3D tests run before media tests in setup 1 and setup 2 differ,
some of the tests in setup 1 are Wayland/EGL variants of the X/GLX tests in
setup 2.  Those are GfxBench 5.0 and SynMark2 v7.  X based GPU memory bandwidth
tests are same in both setups.  Rest of the tests differ, but are also simpler,
so most likely they don't matter for this bug.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are the QA Contact for the bug.
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-gfx-bugs/attachments/20190607/16a1976a/attachment.html>