[Bug 110848] New: [BXT/APL] Everything using GPU gets stuck after running parallel Media loads after 3D benchmarks

Thu Jun 6 12:22:26 UTC 2019

https://bugs.freedesktop.org/show_bug.cgi?id=110848

            Bug ID: 110848
           Summary: [BXT/APL] Everything using GPU gets stuck after
                    running parallel Media loads after 3D benchmarks
           Product: DRI
           Version: DRI git
          Hardware: Other
                OS: All
            Status: NEW
          Severity: normal
          Priority: medium
         Component: DRM/Intel
          Assignee: intel-gfx-bugs at lists.freedesktop.org
          Reporter: eero.t.tamminen at intel.com
        QA Contact: intel-gfx-bugs at lists.freedesktop.org
                CC: intel-gfx-bugs at lists.freedesktop.org

Setup 1:
* HW: BXT J4205
* OS: ClearLinux
* kernel: drm-tip compiled from git
* media: MediaSDK and its deps compiled from git (GitHub)
* FFmpeg: month old Git version: 2019-05-08 c636dc9819 "libavfilter/dnn: add
more data type support for dnn model input"
* GUI: Weston / Wayland / Mesa compiled from git

Setup 2 (differences from setup 1):
* OS: Ubuntu 18.04
* FFmpeg: latest git version
* GUI: Unity from Ubuntu with X server & MEsa compiled from git

Test-case:
1. Run 3D benchmarks
2. Do several runs of 50 parallel instances of following H264 transcode
operations:
  ffmpeg -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v h264_qsv -i
1280x720p_29.97_10mb_h264_cabac.264 -c:v h264_qsv -b:v 800K -vf
scale_qsv=w=352:h=240,fps=15 -compression_level 4 -frames 2400 -y output.h264
3. Do a single non-parallel run of above

Expected outcome:
* Everything works fine, like with month older 5.1 drm-tip kernel, or with GEN9
Core devices

Actual outcome:
* FFmpeg freezes at step 3
* No errors or warnings in dmesg
* Anything else trying to use GPU (even just glxinfo, clinfo) freezes
* Some processes not using GPU also freeze when started

There are no warnings or errors in dmesg.

First time I saw Media tests freezing was around 28th of May.  So the
regression may have come between following drm-tip commits:
* 2019-05-27 14:42:27 e8f06c34fa: drm-tip: 2019y-05m-27d-14h-41m-23s UTC
integration manifest
* 2019-05-28 15:48:05 8991a80f85: drm-tip: 2019y-05m-28d-15h-47m-22s UTC
integration manifest

What Media case freezes, differs a bit e.g. based on FFmpeg version, and from
week to week.  It's not always the same, but it's always single instance test
freezing after parallel tests have seemingly [1] finished.  Both FFmpeg QSV and
MediaSDK sample application H264 transcode cases have frozen. Which one,
differs between setup 1 & setup 2.

If I run just Media cases after boot, I don't see the freeze, so there's some
interaction with what 3D benchmarks do.

[1] With last night drm-tip kernel, "ps" output for processes in D state, and
few other relevant process is following:
---------------------------------------------
...
   38 ?        DN     0:00 [khugepaged]
...
  396 tty7     Ssl+   5:08 /opt/install/bin/weston --tty=7 --idle-time=0
--xwayland
...
  545 tty7     Dl+    0:00 Xwayland :0 -rootless -listen 60 -listen 61 -wm 62
-terminate
...
 8444 ?        D      0:06 [ffmpeg]
 8447 ?        Zl     0:06 [ffmpeg] <defunct>
 8448 ?        Zl     0:06 [ffmpeg] <defunct>
 8449 ?        Zl     0:06 [ffmpeg] <defunct>
 8451 ?        Zl     0:06 [ffmpeg] <defunct>
 8453 ?        D      0:06 [ffmpeg]
 8483 ?        Zl     0:06 [ffmpeg] <defunct>
 8497 ?        Zl     0:06 [ffmpeg] <defunct>
 8512 ?        Zl     0:06 [ffmpeg] <defunct>
 8525 ?        Zl     0:06 [ffmpeg] <defunct>
 8531 ?        Zl     0:06 [ffmpeg] <defunct>
 8546 ?        Zl     0:06 [ffmpeg] <defunct>
 8559 ?        Zl     0:06 [ffmpeg] <defunct>
 8574 ?        Zl     0:06 [ffmpeg] <defunct>
 8585 ?        Zl     0:06 [ffmpeg] <defunct>
 8603 ?        Zl     0:06 [ffmpeg] <defunct>
 8623 ?        D      0:06 [ffmpeg]
 8642 ?        Zl     0:06 [ffmpeg] <defunct>
 8650 ?        D      0:06 [ffmpeg]
 8678 ?        Zl     0:06 [ffmpeg] <defunct>
 8697 ?        Zl     0:06 [ffmpeg] <defunct>
 8704 ?        D      0:06 [ffmpeg]
 8711 ?        D      0:06 [ffmpeg]
 8723 ?        D      0:06 [ffmpeg]
 8733 ?        D      0:06 [ffmpeg]
 8756 ?        Zl   143:22 [ffmpeg] <defunct>
 8793 ?        Zl     0:06 [ffmpeg] <defunct>
 8822 ?        Zl     0:06 [ffmpeg] <defunct>
 8837 ?        Zl     0:06 [ffmpeg] <defunct>
 8845 ?        D      0:06 [ffmpeg]
 8851 ?        Zl     0:06 [ffmpeg] <defunct>
 8858 ?        Zl     0:06 [ffmpeg] <defunct>
 8871 ?        Zl     0:06 [ffmpeg] <defunct>
 8893 ?        Zl     0:06 [ffmpeg] <defunct>
 8942 ?        Zl     0:06 [ffmpeg] <defunct>
 8958 ?        D      0:06 [ffmpeg]
 8983 ?        Zl     0:06 [ffmpeg] <defunct>
 8991 ?        D      0:06 [ffmpeg]
 8999 ?        Zl     0:06 [ffmpeg] <defunct>
 9013 ?        D      0:06 [ffmpeg]
 9017 ?        D      0:06 [ffmpeg]
 9035 ?        Zl     0:06 [ffmpeg] <defunct>
 9058 ?        Zl     0:06 [ffmpeg] <defunct>
 9071 ?        Zl     0:06 [ffmpeg] <defunct>
 9117 ?        Zl     0:06 [ffmpeg] <defunct>
 9122 ?        D      0:06 [ffmpeg]
 9152 ?        Zl     0:06 [ffmpeg] <defunct>
 9165 ?        Zl     0:06 [ffmpeg] <defunct>
 9180 ?        D      0:06 [ffmpeg]
 9250 ?        D      0:06 [ffmpeg]
 9758 ?        Ds     0:00 ffmpeg -hwaccel qsv -qsv_device /dev/dri/renderD128
-c:v h264_qsv -i 1280x720p_29.97_10mb_h264_cabac.264 -c:v h264_qsv -b:v 800K
-vf scale_qsv=w=352:h=240,fps=15 -compression_level 4 -frames 2400 -y
output.h264
...
10184 ?        D      0:00 top
...
---------------------------------------------

50 ffmpeg instances can use a bit of GEM objects (and FFmpeg QSV uses more of
them than MediaSDK alone, or FFmpeg VAAPI), which can put memory pressure on
the system, so kernel "khugepaged" thread being in D state is suspicious.

Other notes:
* I had also a process running in the background which uses ftrace uprobes to
track frame update functions from the 3D and Media processes, and tracking
"i915:intel_gpu_freq_change" events
* Killing that process after freeze, caused ssh to stop working, so it's
possible that there's some connection to ftrace

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
You are the QA Contact for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-gfx-bugs/attachments/20190606/d8b46d54/attachment.html>