QAIC reset failure

Jeffrey Hugo quic_jhugo at quicinc.com
Mon Jan 22 22:57:11 UTC 2024


On 1/16/2024 9:58 AM, Baruch Siach wrote:
> Hi qaic driver maintainers,

Sorry I was holiday last week and I am just now catching up on email and 
seeing this.

> I am testing an A100 device on arm64 platform. Kernel version is current
> Linus master as of commit 052d534373b7. The driver is unable to reset
> the device properly.
> 
> [  137.706765] pci 0000:01:00.0: enabling device (0000 -> 0002)
> [  137.712528] pci 0000:02:00.0: enabling device (0000 -> 0002)
> [  137.718230] qaic 0000:03:00.0: enabling device (0000 -> 0002)
> [  137.725720] [drm] Initialized qaic 0.0.0 20190618 for 0000:03:00.0 on minor 0
> [  137.734326] mhi mhi0: Requested to power ON
> [  137.738520] mhi mhi0: Power on setup success
> [  137.855108] mhi mhi0: Wait for device to enter SBL or Mission mode

This all looks good

> [  137.861578] qaic_timesync mhi0_QAIC_TIMESYNC: 20: Failed to receive START channel command completion
> [  137.870733] qaic_timesync mhi0_QAIC_TIMESYNC: 21: Failed to reset channel, still resetting
> [  137.879063] qaic_timesync mhi0_QAIC_TIMESYNC: 20: Failed to reset channel, still resetting
> [  137.887334] qaic_timesync: probe of mhi0_QAIC_TIMESYNC failed with error -5
> [  137.894866] qaic_timesync mhi0_QAIC_TIMESYNC: 20: Failed to receive START channel command completion
> [  137.904006] qaic_timesync mhi0_QAIC_TIMESYNC: 21: Failed to reset channel, still resetting
> [  137.912263] qaic_timesync mhi0_QAIC_TIMESYNC: 20: Failed to reset channel, still resetting
> [  137.920517] qaic_timesync: probe of mhi0_QAIC_TIMESYNC failed with error -5
> [  140.807091] mhi mhi0: Device failed to enter MHI Ready
> [  143.695094] mhi mhi0: Device failed to enter MHI Ready

This looks like the device stopped responding to the host, early in 
boot.  Trying to access channels while the device is not in MHI Ready 
state is odd.

> This is with firmware from SDK version 1.12.2.0. I tried also version
> 1.10.0.193 with similar results.
> 
> Some more state information from MHI debugfs below.
> 
> /sys/kernel/debug/mhi/mhi0/regdump:
> Host PM state: SYS ERROR Process Device state: RESET EE: DISABLE
> Device EE: PRIMARY BOOTLOADER state: SYS ERROR
> MHI_REGLEN: 0x100
> MHI_VER: 0x1000000
> MHI_CFG: 0x8000000
> MHI_CTRL: 0x0
> MHI_STATUS: 0xff04
> MHI_WAKE_DB: 0x1
> BHI_EXECENV: 0x0
> BHI_STATUS: 0xa93f0935
> BHI_ERRCODE: 0x0
> BHI_ERRDBG1: 0xc0300000
> BHI_ERRDBG2: 0xb
> BHI_ERRDBG3: 0xcabb0

This suggests that the device crashed, which is unexpected.

> /sys/kernel/debug/mhi/mhi0/states:
> PM state: SYS ERROR Process Device: Inactive MHI state: RESET EE: DISABLE wake: true
> M0: 2 M2: 0 M3: 0 device wake: 0 pending packets: 0
> 
> Any idea?

We may need our firmware engineers involved.  I think there is already a 
thread with some of the POCs involved.

-Jeff


More information about the dri-devel mailing list