[Intel-gfx] [PATCH] HAX drm/i915: Disable CSR (DMC) for Kabylake

Thu Oct 12 21:07:16 UTC 2017

Quoting Rodrigo Vivi (2017-10-12 21:55:24)
> On Thu, Oct 12, 2017 at 07:43:04PM +0000, Chris Wilson wrote:
> > Quoting Rodrigo Vivi (2017-10-12 19:04:45)
> > > On Thu, Oct 12, 2017 at 10:18:13AM +0000, Chris Wilson wrote:
> > > > ---
> > > 
> > > Why?
> > 
> > Have you looked at the random but frequent mmio death on Kabylake?
> > Seems rather reminiscent of earlier DMC bugs.
> 
> hm... Could you please give us an example?
> a link?

There's usually one per-shard run, in the baseline CI results for this
patch you can see one.
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3222/shard-kbl4/dmesg14.log

The pattern I'm looking at starts with
<3>[  160.736276] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:56:pipe C] flip_done timed out
usually from kms_flip, but it's probably anything that cycles the
powerwell at just the right frequency, leading to a GPU hang and
<3>[  188.832189] [drm:i915_gem_idle_work_handler [i915]] *ERROR* Timeout waiting for engines to idle
and then repeats until the shard is rebooted.

> and why do you think that would be dmc?

That it always seems to be triggered from interaction with the display
powerwells.

> I've seen many bugs there on kbl, but mostly associated to LSPCON
> and link trainings, and gpu hangs... and the ones I looked now
> seemed that DC state was blocked.

Those DC hangs are suspected to DMC issues as I am sure you already know,
such as the one where it causes lost interrupts.

> But well... yeap... it is a black box right?! so the
> experiment is valid for sure.

It's a known broken blackbox that we are always waiting on for
bugfixes. Sounds familiar. :(
-Chris