[Bug 103025] [GM45] loss of driver acceleration after some time (bisected)

Tue May 15 04:08:29 UTC 2018

https://bugs.freedesktop.org/show_bug.cgi?id=103025

--- Comment #68 from Adric Blake <promarbler14 at gmail.com> ---
I have found yet another test case towards triggering this bug, and this time
it's much faster. Using this method allows me to fairly easily trigger GPU
hangs and eventually cause loss of acceleration one way or another.

Arch Linux x86_64, as always.
Active packages:
linux 4.16.8-1    (having trouble running drm-tip...)
xorg-server 1.20.0-1
mesa 18.0.3-3
libdrm 2.4.92-1
xf86-video-intel 1:2.99.917+832+g35947721-1

In this case, I am running a freshly-installed cinnamon (1.8) and using
gnome-terminal. To trigger the bug, I open a new virtual terminal window and
resize the window in the horizontal direction (up-and-down alone doesn't work).
When you rapidly reduce the terminal window size in the horizontal direction,
it is very likely that the bug will occur. A larger window seems to help; the
contents or zoom level of the window might have an effect as well. Rapidly
performing this process for extended periods of time has very interesting
effects (see below).

About every time the bug is tripped by this method, visual flickering and/or
corruption occur with the window, but occasionally other random parts of the
screen bear corruption as well. If I manage to stop resizing the window when
the corruption occurs, the corruption tends to persist. The windows themselves
will contain the corruption; it can be captured by screenshots and the alt-tab
previews, as well as be minimized and unminimized without losing the corrupted
contents. I have several screenshots of the corruption as it builds. If you
patch the xf86-video-intel driver to change would-be asserts into driver
warnings (non-debug version), you'll see that the warnings (that are reachable
in the non-debug build) are emitted whenever the graphical corruption occurs.

When my test method is done repeatedly with varying intensity for extended
periods of time, after about several minutes the GPU will hang, sometimes
repeatedly. If you're unlucky, the reset can fail (haven't reproduced that on
my exact software setup though). Alternatively, the 2D driver can break and
lose acceleration almost like in the original bug report, but I haven't yet
replicated that either.

I have as many GPU error states as I could capture. However, except for a few
relative timestamps and maybe one or two other minor things, they all appear
*exactly* the same. Does it only capture the first error?

The accel loss bug I managed to trigger printed this:
[ 10991.936] (EE) intel(0): Failed to submit rendering commands (Input/output
error), disabling acceleration.
This occurred about the same time (~0.5 seconds after) one of multiple gpu
hangs and coupled with a few fence timeouts, which makes me think that it might
be unlucky timing, shown here:
...
[10868.532336] i915 0000:00:02.0: Resetting chip after gpu hang
[10877.492358] i915 0000:00:02.0: Resetting chip after gpu hang
[10937.439048] i915 0000:00:02.0: Resetting chip after gpu hang
[10946.399014] i915 0000:00:02.0: Resetting chip after gpu hang
[10948.532260] asynchronous wait on fence i915:[global]:25b0f4 timed out
[10955.359040] i915 0000:00:02.0: Resetting chip after gpu hang
[10964.532326] i915 0000:00:02.0: Resetting chip after gpu hang
[10973.492343] i915 0000:00:02.0: Resetting chip after gpu hang
[10982.452359] i915 0000:00:02.0: Resetting chip after gpu hang
[10991.412319] i915 0000:00:02.0: Resetting chip after gpu hang
[11000.376705] i915 0000:00:02.0: Resetting chip after gpu hang
[11002.505602] asynchronous wait on fence i915:[global]:25b0fc timed out
[11009.549091] i915 0000:00:02.0: Resetting chip after gpu hang
[11013.385588] asynchronous wait on fence i915:[global]:25b0fe timed out
[11018.505675] i915 0000:00:02.0: Resetting chip after gpu hang

The Xorg.0.log around that time (piped through uniq -c for compactness):
...
      1 [ 10931.396] (WW) intel(0): assertion failed:
`bo->pitch*kgem_aligned_height(kgem, height, bo->tiling) <= kgem_bo_size(bo)';
ignoring and trudging onward.
      1 [ 10931.397] (WW) intel(0): assertion failed: `box->y2 * bo->pitch <=
kgem_bo_size(bo)'; ignoring and trudging onward.
      2 [ 10931.398] (WW) intel(0): assertion failed: `box->y2 * bo->pitch <=
kgem_bo_size(bo)'; ignoring and trudging onward.
      6 [ 10931.399] (WW) intel(0): assertion failed: `box->y2 * bo->pitch <=
kgem_bo_size(bo)'; ignoring and trudging onward.
      4 [ 10931.400] (WW) intel(0): assertion failed: `box->y2 * bo->pitch <=
kgem_bo_size(bo)'; ignoring and trudging onward.
      6 [ 10931.401] (WW) intel(0): assertion failed: `box->y2 * bo->pitch <=
kgem_bo_size(bo)'; ignoring and trudging onward.
      1 [ 10931.402] (WW) intel(0): assertion failed: `box->y2 * bo->pitch <=
kgem_bo_size(bo)'; ignoring and trudging onward.
      1 [ 10931.439] (WW) intel(0): assertion failed:
`bo->pitch*kgem_aligned_height(kgem, height, bo->tiling) <= kgem_bo_size(bo)';
ignoring and trudging onward.
      3 [ 10931.440] (WW) intel(0): assertion failed: `box->y2 * bo->pitch <=
kgem_bo_size(bo)'; ignoring and trudging onward.
     13 [ 10931.441] (WW) intel(0): assertion failed: `box->y2 * bo->pitch <=
kgem_bo_size(bo)'; ignoring and trudging onward.
      8 [ 10931.442] (WW) intel(0): assertion failed: `box->y2 * bo->pitch <=
kgem_bo_size(bo)'; ignoring and trudging onward.
      1 [ 10991.936] (EE) intel(0): Failed to submit rendering commands
(Input/output error), disabling acceleration.
      1 [ 10991.937] (EE) intel(0): When reporting this, please include
/sys/class/drm/card0/error and the full dmesg.

I might be able to test using a debug driver if need be.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-gfx-bugs/attachments/20180515/913868f7/attachment-0001.html>