[Bug 108546] New: Loading i915 kernel module breaks NVMe PCI device on the new Coffee Lake box

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Thu Oct 25 06:44:04 UTC 2018


https://bugs.freedesktop.org/show_bug.cgi?id=108546

            Bug ID: 108546
           Summary: Loading i915 kernel module breaks NVMe PCI device on
                    the new Coffee Lake box
           Product: DRI
           Version: XOrg git
          Hardware: x86-64 (AMD64)
                OS: All
            Status: NEW
          Severity: normal
          Priority: medium
         Component: DRM/Intel
          Assignee: intel-gfx-bugs at lists.freedesktop.org
          Reporter: tiwai at suse.de
        QA Contact: intel-gfx-bugs at lists.freedesktop.org
                CC: intel-gfx-bugs at lists.freedesktop.org

The new Coffee Lake board DTCL2KEBQ contains the new i915 PCI ID 8086:3e98,
which support was recently added by commit d0e062ebb3a4.
The problem is that loading i915 kernel module on this machine breaks the NVMe
PCI device out of sudden.

The symptom was found at first on SUSE Linux Enterprise 15 update kernel
containing the patch above, then confirmed to be present on the latest
linux-next.

The graphics device itself seems working after loading i915, the frame buffer
switches to the native resolution and proceeds the boot.  However, it triggers
DPC AER messages from the pcieport, and NVMe gets broken.

[    4.881913] dpc 0000:00:1b.0:pcie010: DPC containment event, status:0x1f00
source:0x0000
[    4.881987] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
[    4.882063] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected,
type=Physical Layer, id=00d8(Receiver ID)
[    4.882136] pcieport 0000:00:1b.0:   device [8086:a340] error
status/mask=00000001/00002000
[    4.882196] pcieport 0000:00:1b.0:    [ 0] Receiver Error         (First)
[    4.921257] systemd: 25 output lines suppressed due to ratelimiting
[    5.162150] dpc 0000:00:1b.0:pcie010: DPC containment event, status:0x1f00
source:0x0000
[    5.162230] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
[    5.162289] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected,
type=Physical Layer, id=00d8(Receiver ID)
[    5.162355] pcieport 0000:00:1b.0:   device [8086:a340] error
status/mask=00000001/00002000
[    5.162412] pcieport 0000:00:1b.0:    [ 0] Receiver Error         (First)
....
[   39.804133] nvme nvme0: controller is down; will reset: CSTS=0xffffffff,
PCI_STATUS=0x10
[   39.928128] nvme 0000:01:00.0: enabling device (0000 -> 0002)
[   39.928732] nvme nvme0: Removing after probe failure status: -19

It turned out that the PCI device entries of other slots are changed just by
loading i915 driver.  Attached below are the output of lspci -vvv between
good-working (no i915) and bad-working cases.  For example, you can see the
difference in LTR:

In the good case we have:
        Capabilities: [2d0 v1] Latency Tolerance Reporting
                Max snoop latency: 3145728ns
                Max no snoop latency: 3145728ns

and in the bad case:
        Capabilities: [2d0 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns


The PCI DPC/AER messages could be suppressed by passing pcie_aspm=off boot
option (which could be used as a workaround for AMD ThreadRipper), but this
didn't "fix" the actual NVMe error itself.

Also I tried intel_iommu=igfx_off and pci=nommconf, but in vain.

The problem isn't about the boot sequence either.  Booting with nomodeset
option and let the system start up, then re-load i915 with modeset=1.  Then it
screws up NVMe as above.  So it's actually i915 driver that triggers the
breakage.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are the QA Contact for the bug.
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-gfx-bugs/attachments/20181025/be4242b0/attachment-0001.html>


More information about the intel-gfx-bugs mailing list