4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups
Michael S. Tsirkin
mst at redhat.com
Wed Nov 28 02:49:44 UTC 2018
On Tue, Nov 27, 2018 at 11:36:50AM +0200, Mika Westerberg wrote:
> +linux-acpi
>
> Hi Michael,
>
> On Mon, Nov 26, 2018 at 10:53:26PM -0500, Michael S. Tsirkin wrote:
> > So a new thinkpad:
> > 01:00.0 VGA compatible controller: NVIDIA Corporation GP107GLM [Quadro P2000 Mobile] (rev a1)
> >
> > Hangs whenever I try to poke at the card. It starts happily enough with
> >
> > [ 3.971515] ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20181003/nsarguments-66)
> > [ 3.971553] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20181003/nsarguments-66)
> > [ 3.971721] pci 0000:01:00.0: optimus capabilities: enabled, status dynamic power, hda bios codec supported
> > [ 3.971726] VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG0.PEGP handle
> > [ 3.971727] nouveau: detected PR support, will not use DSM
BTW this is also a bit strange. It says it will not use DSM - but did
it maybe use DSM previously? The ACPI Warning seems to indicate that
it did ...
And just to complete the picture here's the _DSM from ACPI:
Method (_DSM, 4, Serialized) // _DSM: Device-Specific Method
{
If ((Arg0 == ToUUID ("3e5b41c6-eb1d-4260-9d15-c71fbadae414")))
{
Switch (ToInteger (Arg2))
{
Case (Zero)
{
If ((Arg1 == One))
{
Return (0x0001E7FF)
}
}
Case (One)
{
If ((Arg1 == One))
{
If (((S0ID == One) && (OSYS < 0x07DF)))
{
If (((DerefOf (Arg3 [Zero]) & 0xFF) == One))
{
\GUAM (One)
}
Local0 = (DerefOf (Arg3 [One]) & 0xFF)
If ((Local0 == Zero))
{
\GUAM (Zero)
}
}
If ((DerefOf (Arg3 [Zero]) == Zero))
{
Local0 = CLID /* \_SB_.PCI0.GFX0.CLID */
If ((0x80000000 & Local0))
{
CLID &= 0x0F
GLID (CLID)
}
}
Return (One)
}
}
Case (0x02)
{
If ((Arg1 == One))
{
Return (One)
}
}
Case (0x03)
{
If ((Arg1 == One))
{
Return (One)
}
}
Case (0x04)
{
If ((Arg1 == One))
{
Return (One)
}
}
Case (0x05)
{
If ((Arg1 == One))
{
Return (One)
}
}
Case (0x06)
{
If ((Arg1 == One))
{
Return (One)
}
}
Case (0x07)
{
If ((Arg1 == One))
{
IBTT = (DerefOf (Arg3 [Zero]) & 0xFF)
Return (One)
}
}
Case (0x08)
{
If ((Arg1 == One))
{
IPSC = (DerefOf (Arg3 [Zero]) & 0xFF)
If ((DerefOf (Arg3 [One]) & 0xFF))
{
IPAT = (DerefOf (Arg3 [One]) & 0xFF)
IPAT--
}
IBIA = ((DerefOf (Arg3 [0x02]) >> 0x04) & 0x07)
Return (One)
}
}
Case (0x09)
{
If ((Arg1 == One))
{
Return (One)
}
}
Case (0x0A)
{
If ((Arg1 == One))
{
Local0 = (LIDS << 0x08)
Local0 += 0x0100
Return (Local0)
}
}
Case (0x0D)
{
If ((Arg1 == One))
{
Local0 = ((DerefOf (Arg3 [0x03]) << 0x18) | (DerefOf (
Arg3 [0x02]) << 0x10))
Local0 &= 0xEFFF0000
Local0 &= (DerefOf (DBTB [IBTT]) << 0x10)
Local0 |= IBTT
Return (Local0)
}
}
Case (0x0E)
{
If ((Arg1 == One))
{
Local0 = IPSC /* \IPSC */
Local0 |= (IPAT << 0x08)
Local0 += 0x0100
Local0 |= (LIDS << 0x10)
Local0 += 0x00010000
Local0 |= (IBIA << 0x14)
Return (Local0)
}
}
Case (0x0F)
{
If ((Arg1 == One))
{
Local0 = GIVD /* \_SB_.PCI0.GFX0.GIVD */
Local0 ^= One
Local0 |= (GMFN << One)
Local0 |= 0x1800
Local0 |= (IDMS << 0x11)
Local0 |= (DerefOf (DerefOf (CDCT [HVCO]) [CDVL]) <<
0x15)
Return (Local0)
}
}
Case (0x10)
{
If ((Arg1 == One))
{
Name (KSVP, Package (0x02)
{
0x80000000,
0x8000
})
KSVP [Zero] = KSV0 /* \KSV0 */
KSVP [One] = KSV1 /* \KSV1 */
Return (KSVP) /* \_SB_.PCI0.GFX0._DSM.KSVP */
}
}
}
}
Return (Buffer (One)
{
0x00 // .
})
}
I am not sure what makes it think that Arg3 is a buffer and
not a package: IIRC Index (decoded C-style here as []) can apply
to either.
Am I maybe misunderstanding the warning?
> > [ 3.971745] nouveau 0000:01:00.0: enabling device (0006 -> 0007)
> > [ 3.971923] nouveau 0000:01:00.0: NVIDIA GP107 (137000a1)
> > [ 4.009875] PM: Image not found (code -22)
> > [ 4.135752] nouveau 0000:01:00.0: DRM: VRAM: 4096 MiB
> > [ 4.135753] nouveau 0000:01:00.0: DRM: GART: 536870912 MiB
> > [ 4.135754] nouveau 0000:01:00.0: DRM: BIT table 'A' not found
> > [ 4.135755] nouveau 0000:01:00.0: DRM: BIT table 'L' not found
> > [ 4.135756] nouveau 0000:01:00.0: DRM: TMDS table version 2.0
> > [ 4.135756] nouveau 0000:01:00.0: DRM: DCB version 4.1
> > [ 4.135757] nouveau 0000:01:00.0: DRM: DCB outp 00: 02800f76 04600020
> > [ 4.135758] nouveau 0000:01:00.0: DRM: DCB outp 01: 02011f62 00020010
> > [ 4.135759] nouveau 0000:01:00.0: DRM: DCB outp 02: 01022f46 04600010
> > [ 4.135760] nouveau 0000:01:00.0: DRM: DCB outp 03: 01033f56 04600020
> > [ 4.135761] nouveau 0000:01:00.0: DRM: DCB conn 00: 00020047
> > [ 4.135761] nouveau 0000:01:00.0: DRM: DCB conn 01: 00010161
> > [ 4.135762] nouveau 0000:01:00.0: DRM: DCB conn 02: 00001246
> > [ 4.135763] nouveau 0000:01:00.0: DRM: DCB conn 03: 00002346
> > [ 4.508355] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
> > [ 4.508355] [drm] Driver supports precise vblank timestamp query.
> > [ 4.509812] [drm] Cannot find any crtc or sizes
> > [ 4.510144] [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 2
> >
> >
> > Although that type mismatch is a bit worrying. And I'm not sure what
> > prints PM: Image not found.
>
> That is fine, it just says it cannot find a hibernation image from swap
> device. I guess you have resume=... in the kernel command line.
>
> > But after a short while it gets pretty busy:
> >
> >
> > [ 52.917009] No Local Variables are initialized for Method [NVPO]
> > [ 52.917011] No Arguments are initialized for method [NVPO]
> > [ 52.917012] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PEGP.NVPO, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> > [ 52.917063] ACPI Error: Method parse/execution failed \_SB.PCI0.PGON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> > [ 52.917084] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PG00._ON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> > [ 52.917108] acpi device:00: Failed to change power state to D0
>
> Here it seems to fail to turn on the power resource for the device.
>
> > [ 52.969287] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown)
> > [ 52.969289] pci_raw_set_power_state: 2 callbacks suppressed
> > [ 52.969291] nouveau 0000:01:00.0: Refused to change power state, currently in D3
> > [ 53.029514] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown)
> > [ 53.041027] nouveau 0000:01:00.0: Refused to change power state, currently in D3
> > [ 53.041035] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown)
> > [ 53.053008] nouveau 0000:01:00.0: Refused to change power state, currently in D3
> >
> >
> > And then kernel proceeds to throw up errors at random places, e.g.
> >
> > [ 67.021892] cfg80211: failed to load regulatory.db
> > [ 67.021895] cfg80211: failed to load regulatory.db
> > [ 67.021897] cfg80211: failed to load regulatory.db
> > [ 67.021900] cfg80211: failed to load regulatory.db
> > [ 67.021927] cfg80211: failed to load regulatory.db
> > [ 67.021928] cfg80211: failed to load regulatory.db
> > [ 67.021932] cfg80211: failed to load regulatory.db
> > [ 67.021934] cfg80211: failed to load regulatory.db
> > [ 67.024463] cfg80211: failed to load regulatory.db
> > [ 99.980625] iwlwifi 0000:00:14.3: Error sending STATISTICS_CMD: time out after 2000ms.
BTW this one might indicate that somehow iwlwifi got powered
down too somehow. It's at
00:14.3 Network controller: Intel Corporation Wireless-AC 9560 [Jefferson Peak] (rev 10)
so really shouldn't be affected, but go figure. If driver really is getting
all-ones from the device, it just might try to poke at a wrong b:d.f by mistake
maybe ...
> > followed by soft lockups and sometimes hard lockups in places
> > like attempts to walk skb lists.
> >
> > Adding runpm=0 does away with this issue.
> >
> > The specific test was with noaccel=1 - it does not seem to change
> > things for me.
> >
> > I poked at the ACPI method NVPO and yes it does actually
> > seem to execute a while loop waiting for some register
> > to become 0. Which I guess never happens? Because card
> > is in a low power state and so reads return ffffffff maybe?
>
> Yes, it could be the case.
>
> > X isn't happy even with runpm=0 but that might be a different
> > issue - I thought runpm=0 might be an easier place to start debugging
> > things given there are logs of the failure.
> >
> > Using 4.20.0-rc3 right now.
> >
> > Userspace bits are from fedora 29:
> > xorg-x11-drv-nouveau-1.0.15-6.fc29.x86_64
> >
> > firmware is pretty recent:
> > linux-firmware-20181008-88.gitc6b6265d.fc29.noarch
> >
> > More hints for debugging would be appreciated.
> > If anyone wants me to play with a different kernel tree,
> > let me know.
>
> Can you share full dmesg and acpidump of the system? I would like to
> check the power resource methods.
Pls find it as an attachment here:
https://bugs.freedesktop.org/show_bug.cgi?id=108873
If you like I can send it to you directly too -
spamming the list with it isn't helpful I guess?
--
MST
More information about the dri-devel
mailing list