Kernel Freeze with American Megatrends BIOS

Peter Wu peter at lekensteyn.nl
Tue Aug 30 19:53:37 UTC 2016


On Mon, Aug 29, 2016 at 11:02:10AM -0500, Bjorn Helgaas wrote:
> [+cc linux-acpi, linux-kernel, dri-devel]
> 
> Hi Roland,
> 
> I have no idea how to debug this problem.  Are you seeing something
> that suggests it may be a PCI problem?

Yes I suspect there is an ACPI and/ or PCI problem, possibly
device-specific. Steps to reproduce on the affected machines:

 1. Load nouveau.
 2. Wait for it to runtime suspend.
 2. Invoke 'lspci', this resumes the Nvidia PCI device via nouveau.
 3. lspci never returns, few moments later an AML_INFINITE_LOOP is
    reported.

If you use the external bbswitch module, the effect is the same. I have
been trying to debug this for some time on nouveau with no luck. The
PCI/PM D3cold patches from Mika makes no difference.

Runtime resume via nouveau triggers some ACPI methods (I'll assume the
Windows 8-style PR method and take the Clevo P651 as example):

    \_SB.PCI0.PEG0.PG00._ON () ->
        \_SB.PCI0.PGON (0)

Then:

    Method (PGON, 1, Serialized) {
        PION = Arg0     // note: 0 for PG00
        // ...
        If ((OSYS != 0x07DF)) { /* Not Windows 2015 (Windows 10), see below */ }
        Else {
            LKEN (PION)
        }
        // this is the infinite loop: it tries to bring the PCIe link to
        // full speed, but fails to do so.
        While ((\_SB.PCI0.PEG0.LNKS < 0x07)) {
            Local0 = 0x20
            While (Local0) {
                If ((\_SB.PCI0.PEG0.LNKS < 0x07)) {
                    Stall (0x64)
                    Local0--
                } Else { Break }
            }
            If ((Local0 == Zero)) {
                \_SB.PCI0.PEG0.RTLK = One
                Stall (0x64)
            }
        }
        // ...
    }

Without any workaround, this piece of code is invoked:

    Method (LKEN, 1, NotSerialized) {
        Local3 = (CPEX & 0x0F)  // CPEX at 0x5ff9be7f and has value 000506e3
        If ((Local3 == Zero)) {
            /* Similar to below, but with Q0L0 -> P0L0 (register 0xBC bit 6) */
        } ElseIf ((Local3 != Zero)) {
            If ((Arg0 == Zero)) {
                /* Enter L0 Activate state.
                 * (LKDS tries to enter L2, deep-energy-saving state.) */
                Q0L0 = One      // register 0x249 bit 0; \_SB.PCI0.OPG0.Q0L0 00:01.0
                Sleep (0x10)
                Local0 = Zero
                While (Q0L0) {
                    If ((Local0 > 0x04)) { Break }
                    Sleep (0x10)
                    Local0++
                }
            } else { /* other cases, but we are only interested in PGON(0) */ }
        }
    }

The acpi_osi="!Windows 2015" workaround will invoke this instead:

    If ((OSYS != 0x07DF)) {
        If ((PION == Zero)) {
            P0AP = Zero  /* PGOF writes 3 */
            P0RM = Zero  /* PGOF writes 1 */
        }
        If ((PBGE != Zero)) { /* Observed to be false (PBGE == 0) */
            If (SBDL (PION)) {
                PUAB (PION)
                CBDL = GUBC (PION)
                MBDL = GMXB (PION)
                If ((CBDL > MBDL)) {
                    CBDL = MBDL /* \_SB_.PCI0.MBDL */
                }
                PDUB (PION, CBDL)
            }
        }
        If ((PION == Zero)) {
            P0LD = Zero     /* Link Disable = 0, PGOF sets 1 instead. */
            P0TR = One      /* Train? (PGOF does not set this). */
            TCNT = Zero
            While ((TCNT < LDLY)) { /* LDLY = 300 */
                If ((P0VC == Zero)) {
                    /* VC Negotiation Pending 0 means VC negotation is complete. */
                    Break
                }
                Sleep (0x10)
                TCNT += 0x10 /* At most 19 iterations, sleeping for 304ms. */
            }
        }
    }

The comments above are my own interpretation based on the acpidumps I
extracted from the machine. These notes and ACPI tables can be found at
https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt
https://github.com/Lekensteyn/acpi-stuff/tree/master/dsl/Clevo_P651RA

Other affected devices have similar code, differences are small:
 - No check for LNKS (avoids the infinite loop, but device is still off)
 - Instead of a check for != "Windows 2015", they check for == "Windows
   2009" or even for == "Windows 2009" || "Windows 2013" (Dell Inspiron
   7559).

The tested kernels (with bbswitch or nouveau) were Linux 4.4.0, 4.6,
4.7 (nouveau + PCI/PM + nouveau PR patches). The PCIe device is
something from the GTX 9xxM family in all cases.

I have a bunch of PCI config dumps from Windows and Linux, but there is
nothing extraordinary. Also did an ACPI trace via a Checked/Debug build
of Windows, but it just confirms that the ACPI method we use for the
Nvidia device is the correct one.

Let me know if you need more information, I would be glad to provide.

Kind regards,
Peter

> On Tue, Aug 23, 2016 at 11:23:45AM +0200, Roland Singer wrote:
> > Hi,
> > 
> > hope somebody can help me fix this kernel problem which affects the following machines:
> > 
> > - Clevo P651RA (i7-6700HQ/GTX 965M, part of the P6xxRx family which are also affected)
> > - MSI GE62 Apache Pro (i7-6700HQ/GTX 960M)
> > - Gigabyte P35V5 (i7-6700HQ/GTX 970M)
> > - Razer Blade 14" (2016) (i7-6700HQ/GTX 970M) (BIOS 5.11, 04/07/2016)
> > 
> > 
> > The kernel freezes if the graphical user session (Xorg & Wayland) is
> > started with a switched off discrete GPU card (NVIDIA).
> > If the discrete GPU is switched off after the graphical session start,
> > then everything works as expected, until the graphical session is restarted.
> > 
> > This problem seams to be linked to specific BIOS settings. If the computer
> > is started with the following command line:
> > 
> > acpi_osi=! acpi_osi="Windows 2009"
> > 
> > then the kernel freeze does not occur anymore. However this required a special
> > ACPI DSDT firmware patch for the Razer Blade 2016 laptop:
> > 
> > https://github.com/m4ng0squ4sh/razer_blade_14_2016_acpi_dsdt
> > 
> > I strongly recommend to fix this in the kernel and I am ready to help and solve
> > this problem with some help.
> > 
> > Here is a link to the GitHub issue with further information:
> > 
> > https://github.com/Bumblebee-Project/Bumblebee/issues/764#issuecomment-241212595
> > 
> > Here are some more detailed information:
> > 
> > https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt
> > 
> > Hope somebody can help.


More information about the dri-devel mailing list