Regression of v4.6-rc vs. v4.5 bisected: a98ee79317b4 "drm/i915/fbc: enable FBC by default on HSW and BDW"

Zanoni, Paulo R paulo.r.zanoni at intel.com
Thu May 5 18:50:14 UTC 2016


Em Qui, 2016-05-05 às 19:45 +0200, Stefan Richter escreveu:
> On Apr 30 Stefan Richter wrote:
> > 
> > On Apr 29 Stefan Richter wrote:
> > > 
> > > On Apr 26 Stefan Richter wrote:  
> > > > 
> > > > v4.6-rc solidly hangs after a short while after boot, login to
> > > > X11, and
> > > > doing nothing much remarkable on the just brought up X desktop.
> > > > 
> > > > Hardware: x86-64, E3-1245 v3 (Haswell),
> > > >           mainboard Supermicro X10SAE,
> > > >           using integrated Intel graphics (HD P4600, i915
> > > > driver),
> > > >           C226 PCH's AHCI and USB 2/3, ASMedia ASM1062 AHCI,
> > > >           Intel LAN (i217, igb driver),
> > > >           several IEEE 1394 controllers, some of them behind
> > > >           PCIe bridges (IDT, PLX) or PCIe-to-PCI bridges (TI,
> > > > Tundra)
> > > >           and one PCI-to-CardBus bridge (Ricoh)
> > > > 
> > > > kernel.org kernel, Gentoo Linux userland
> > > > 
> > > > 1. known good:  v4.5-rc5 (gcc 4.9.3)
> > > >    known bad:   v4.6-rc2 (gcc 4.9.3), only tried one time
> > > > 
> > > > 2. known good:  v4.5.2 (gcc 5.2.0)
> > > >    known bad:   v4.6-rc5 (gcc 5.2.0), only tried one time
> > > > 
> > > > I will send my linux-4.6-rc5/.config in a follow-up message.  
> >  .config: http://www.spinics.net/lists/kernel/msg2243444.html
> >    lspci: http://www.spinics.net/lists/kernel/msg2243447.html
> > 
> > Some userland package versions, in case these have any bearing:
> > x11-base/xorg-drivers-1.17
> > x11-base/xorg-server-1.17.4
> > x11-bas/xorg-x11-7.4-r2
> Furthermore, there is a single display hooked up via DisplayPort.
> 
> > 
> > > 
> > > After it proved impossible to capture an oops through netconsole,
> > > I
> > > started git bisect.  This will apparently take almost a week, as
> > > git
> > > estimated 13 bisection steps and I will be allowing about 12
> > > hours of
> > > uptime as a sign for a good kernel.  (In my four or five tests of
> > > bad
> > > kernels before I started bisection, they hung after 3
> > > minutes...5.5 hours
> > > uptime, with no discernible difference in workload.  Maybe 12 h
> > > cutoff is
> > > even too short...)  
> I took at least 18 hours uptime (usually 24 hours) as a sign for good
> kernels.  During the bisection, bad kernels hung after 3 h, 2 h, 9
> min,
> 45 min, and 4 min uptime.  Thus I arrived at a98ee79317b4
> "drm/i915/fbc:
> enable FBC by default on HSW and BDW" as the point where the hangs
> are
> introduced.
> 
> Quoting the changelog of the commit:

Thanks for following the instructions on the commit message! :)

> 
>     Oh, and in case you - the person reading this commit message -
> found
>     this commit through git bisect, please do the following:
>      - Check your dmesg and see if there are error messages
> mentioning
>        underruns around the time your problem started happening.
> 
> Well, I always had the followings lines in dmesg:
> [drm:intel_set_cpu_fifo_underrun_reporting] *ERROR* uncleared fifo
> underrun on pipe A
> [drm:intel_cpu_fifo_underrun_irq_handler] *ERROR* CPU pipe A FIFO
> underrun

Oh, well... I had a patch that would just disable FBC in case we saw a
FIFO underrun, but it was rejected. Maybe this is the time to think
about it again? Otherwise, I can't think of much besides disabling FBC
on HSW until all the underruns and watermarks regressions are fixed
forever.

> 
> I always got these when I switch on the DisplayPort attached monitor.
> Recently I changed userland from kdm to sddm and noticed that I
> apparently get these when sddm shuts down.  I am not aware of whether
> or not this also already happened with kdm.
> 
> However, "around the time your problem started happening" there is
> nothing in dmesg, because "your problem" is a complete hang without
> possibility of disk IO and without netconsole output.
> 
>      - Download intel-gpu-tools, compile it, and run:
>        $ sudo ./tests/kms_frontbuffer_tracking --run-subtest '*fbc-*' 
> 2>&1 | tee fbc.txt
>        Then send us the fbc.txt file, especially if you get a
> failure.
>        This will really maximize your chances of getting the bug
> fixed
>        quickly.
> 
> Do you need this while FBC is enabled, or can I run it while FBC is
> disabled?

FBC enabled. Considering your description, my hope is that maybe some
specific subtest will be able to hang your machine, so testing this
again will require only running the specific subtest instead of waiting
18 hours.

> 
>      - Try to find a reliable way to reproduce the problem, and tell
> us.
> 
> The reliable way is to just wait for the kernel to hang after about
> 3 minutes to 5.5 hours.  I have not identified any special activity
> which would trigger the hang.
> 
>      - Boot with drm.debug=0xe, reproduce the problem, then send us
> the
>        dmesg file.
> 
> I can try this, but I am skeptical about getting any useful kernel
> messages from before the hang.

Agree.

> 
> PS:
> I am mentioning the following just in case that it has any
> relationship
> with the FBC related kernel freezes.  Maybe it doesn't...  There is
> another recent regression on this PC, but I have not yet figured out
> whether it was introduced by any particular kernel version.  The
> regression is:  When switching from X11 to text console by
> [Ctrl][Alt][Fx]
> or by shutting down sddm, I often only get a blank screen.  I suspect
> that this regression was introduced when I replaced kdm by sddm, but
> I am not sure about that.

Maybe there is some relationship, since this operation involves a mode
change. You can also try checking dmesg to see if there are underruns
right when you do the change.


If you don't want to keep carrying a manual revert, you can just boot
with i915.enable_fbc=0 for now (or write a /etc/modprobe.d file). Also,
it would be good to know in case you still somehow see the machine
hangs even with FBC disabled.

Thanks,
Paulo


More information about the dri-devel mailing list