4.20.0-rc3 nouveau/Quadro P2000 Mobile: runpm causing ACPI errors, lockups

Mika Westerberg mika.westerberg at linux.intel.com
Tue Nov 27 09:36:50 UTC 2018


+linux-acpi

Hi Michael,

On Mon, Nov 26, 2018 at 10:53:26PM -0500, Michael S. Tsirkin wrote:
> So a new thinkpad:
> 01:00.0 VGA compatible controller: NVIDIA Corporation GP107GLM [Quadro P2000 Mobile] (rev a1)
> 
> Hangs whenever I try to poke at the card. It starts happily enough with
> 
> [    3.971515] ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20181003/nsarguments-66)
> [    3.971553] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20181003/nsarguments-66)
> [    3.971721] pci 0000:01:00.0: optimus capabilities: enabled, status dynamic power, hda bios codec supported
> [    3.971726] VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG0.PEGP handle
> [    3.971727] nouveau: detected PR support, will not use DSM
> [    3.971745] nouveau 0000:01:00.0: enabling device (0006 -> 0007)
> [    3.971923] nouveau 0000:01:00.0: NVIDIA GP107 (137000a1)
> [    4.009875] PM: Image not found (code -22)
> [    4.135752] nouveau 0000:01:00.0: DRM: VRAM: 4096 MiB
> [    4.135753] nouveau 0000:01:00.0: DRM: GART: 536870912 MiB
> [    4.135754] nouveau 0000:01:00.0: DRM: BIT table 'A' not found
> [    4.135755] nouveau 0000:01:00.0: DRM: BIT table 'L' not found
> [    4.135756] nouveau 0000:01:00.0: DRM: TMDS table version 2.0
> [    4.135756] nouveau 0000:01:00.0: DRM: DCB version 4.1
> [    4.135757] nouveau 0000:01:00.0: DRM: DCB outp 00: 02800f76 04600020
> [    4.135758] nouveau 0000:01:00.0: DRM: DCB outp 01: 02011f62 00020010
> [    4.135759] nouveau 0000:01:00.0: DRM: DCB outp 02: 01022f46 04600010
> [    4.135760] nouveau 0000:01:00.0: DRM: DCB outp 03: 01033f56 04600020
> [    4.135761] nouveau 0000:01:00.0: DRM: DCB conn 00: 00020047
> [    4.135761] nouveau 0000:01:00.0: DRM: DCB conn 01: 00010161
> [    4.135762] nouveau 0000:01:00.0: DRM: DCB conn 02: 00001246
> [    4.135763] nouveau 0000:01:00.0: DRM: DCB conn 03: 00002346
> [    4.508355] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
> [    4.508355] [drm] Driver supports precise vblank timestamp query.
> [    4.509812] [drm] Cannot find any crtc or sizes
> [    4.510144] [drm] Initialized nouveau 1.3.1 20120801 for 0000:01:00.0 on minor 2
> 
> 
> Although that type mismatch is a bit worrying. And I'm not sure what
> prints PM: Image not found.

That is fine, it just says it cannot find a hibernation image from swap
device. I guess you have resume=... in the kernel command line.

> But after a short while it gets pretty busy:
> 
> 
> [   52.917009] No Local Variables are initialized for Method [NVPO]
> [   52.917011] No Arguments are initialized for method [NVPO]
> [   52.917012] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PEGP.NVPO, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> [   52.917063] ACPI Error: Method parse/execution failed \_SB.PCI0.PGON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> [   52.917084] ACPI Error: Method parse/execution failed \_SB.PCI0.PEG0.PG00._ON, AE_AML_LOOP_TIMEOUT (20181003/psparse-516)
> [   52.917108] acpi device:00: Failed to change power state to D0

Here it seems to fail to turn on the power resource for the device.

> [   52.969287] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown)
> [   52.969289] pci_raw_set_power_state: 2 callbacks suppressed
> [   52.969291] nouveau 0000:01:00.0: Refused to change power state, currently in D3
> [   53.029514] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown)
> [   53.041027] nouveau 0000:01:00.0: Refused to change power state, currently in D3
> [   53.041035] video LNXVIDEO:00: Cannot transition to power state D0 for parent in (unknown)
> [   53.053008] nouveau 0000:01:00.0: Refused to change power state, currently in D3
> 
> 
> And then kernel proceeds to throw up errors at random places, e.g.
> 
> [   67.021892] cfg80211: failed to load regulatory.db
> [   67.021895] cfg80211: failed to load regulatory.db
> [   67.021897] cfg80211: failed to load regulatory.db
> [   67.021900] cfg80211: failed to load regulatory.db
> [   67.021927] cfg80211: failed to load regulatory.db
> [   67.021928] cfg80211: failed to load regulatory.db
> [   67.021932] cfg80211: failed to load regulatory.db
> [   67.021934] cfg80211: failed to load regulatory.db
> [   67.024463] cfg80211: failed to load regulatory.db
> [   99.980625] iwlwifi 0000:00:14.3: Error sending STATISTICS_CMD: time out after 2000ms.
> 
> followed by soft lockups and sometimes hard lockups in places
> like attempts to walk skb lists.
> 
> Adding runpm=0 does away with this issue.
> 
> The specific test was with noaccel=1 - it does not seem to change
> things for me.
> 
> I poked at the ACPI method NVPO and yes it does actually
> seem to execute a while loop waiting for some register
> to become 0. Which I guess never happens? Because card
> is in a low power state and so reads return ffffffff maybe?

Yes, it could be the case.

> X isn't happy even with runpm=0 but that might be a different
> issue - I thought runpm=0 might be an easier place to start debugging
> things given there are logs of the failure.
> 
> Using 4.20.0-rc3 right now.
> 
> Userspace bits are from fedora 29:
> xorg-x11-drv-nouveau-1.0.15-6.fc29.x86_64
> 
> firmware is pretty recent:
> linux-firmware-20181008-88.gitc6b6265d.fc29.noarch
> 
> More hints for debugging would be appreciated.
> If anyone wants me to play with a different kernel tree,
> let me know.

Can you share full dmesg and acpidump of the system? I would like to
check the power resource methods.


More information about the dri-devel mailing list