EM7565 failing if host reboots whithout power cycling the modem
Bjørn Mork
bjorn at mork.no
Sun Mar 19 10:27:40 UTC 2023
This is not an issue caused by ModemManager or any of the host drivers.
Nut I don't know where else to ask. And this documents the issue for my
next Google search :-)
I'm using a Sierra Wireless EM7565 connected to the USB3 port of a
Linksys WRT1900ACv1 "Mamba" (Marvell Armada 370/XP SoC), which has an
Etron EJ168 PCIe connected xhci controller. The system runs a recent
OpenWrt build (v5.15 kernel) with ModemManager.
This works fine after the initial power-on or after physically
reconnecting the modem each time the host is rebooted. But it fails
consistently if I reboot the host without disconnecting the modem.
The xhci driver will then log these messages while MM is probing the
modem:
[ 78.014486] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 78.021419] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 78.028341] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 78.035270] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 78.043035] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 78.049973] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 78.056886] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 78.063808] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 78.070723] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 78.077633] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 78.084552] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 78.092256] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 78.099154] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 78.106082] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 81.007691] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 81.014629] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 81.021554] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 81.028474] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 81.035391] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 81.042306] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 81.049212] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 81.056669] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 81.063590] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 81.070504] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
[ 81.077411] xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
and probing eventually fails. Likewise with any attempt to use mbimcli
etc. The /dev/cdc-wdm0 device appears dead. So I assume those warnings
are real errors, masked by the XHCI_TRUST_TX_LENGTH quirk applied to
this xhci device a decade ago.
My main concern is figuring out an acceptable host based workaround.
Physically reconnecting the modem after each host reboot is not an
option. I've tried usbreset, but the EM7565 will only reset its USB
device controller Which causes a successful USB disconnect/reconnect and
new host driver probing etc. But the modem firmware is not rebooted by
that, and the error state is not cleared.
I believe the root cause is revealed by logging into the modem with "adb
shell" and running dmesg (the host system was rebooted 289 seconds after
the last power cycling here):
[ 23.224873] mbimd: IP config response sent
[ 60.640200] sierra_startup_monitor
[ 289.304104] msm-dwc3 8a00000.ssusb: Avail curr from USB = 0
[ 289.304165] diag: USB channel diag disconnected
[ 289.305154] dwc3 8a00000.dwc3: request c67e8b00 was not queued to ep0out
[ 289.306464] android_work: android_work: sent uevent USB_STATE=DISCONNECTED
[ 289.362231] dwc3 8a00000.dwc3: failed to stop controller
[ 289.362592] read descriptors
[ 289.362609] read strings
[ 289.362643] Match for Function Name: diag, Int #0
[ 289.362651] name:diag, interface id:0, id:0
[ 289.362690] Match for Function Name: Function FS Gadget, Int #1
[ 289.362699] name:Function FS Gadget, interface id:1, id:1
[ 289.362712] Match for Function Name: nmea, Int #2
[ 289.362719] name:nmea, interface id:2, id:2
[ 289.362739] Match for Function Name: modem, Int #3
[ 289.362746] name:modem, interface id:3, id:3
[ 289.362772] Match for Function Name: mbim, Int #12
[ 289.362779] name:mbim, interface id:12, id:4
[ 289.362786] Match for Function Name: mbim, Int #13
[ 289.362793] name:mbim, interface id:13, id:5
[ 289.362815] msm-dwc3 8a00000.ssusb: ep [gsi-epin,29] already configured as msm endpoint
[ 289.362826] msm-dwc3 8a00000.ssusb: ep [gsi-epout,30] already configured as msm endpoint
[ 289.414838] android_work: android_work: did not send uevent (0 0 00000000)
[ 289.715338] android_work: android_work: did not send uevent (0 0 00000000)
[ 289.924428] android_work: android_work: sent uevent USB_STATE=CONNECTED
[ 289.990573] android_usb gadget: super-speed config #1: 86000c8.android_usb
[ 289.990633] diag: USB channel diag connected
[ 289.990795] msm-dwc3 8a00000.ssusb: Avail curr from USB = 900
[ 290.050233] android_work: android_work: sent uevent USB_STATE=CONFIGURED
[ 298.260790] ipa_usb_notify_cb: ipa_usb_notify_cb: Set net_ready_trigger
[ 308.165691] ------------[ cut here ]------------
[ 308.165857] WARNING: CPU: 0 PID: 730 at /work/mdm9x50/apps_proc/oe-core/build/tmp-glibc/work-shared/mdm9650-perf/kernel-source/include/linux/kref.h:47 glink_get_ch_ctx+0x48/0x6c()
[ 308.165879] Modules linked in: alx(O) embms_kernel(O)
[ 308.165930] CPU: 0 PID: 730 Comm: kworker/u2:10 Tainted: G W O 3.18.31 #1
[ 308.165966] Workqueue: glink_pkt_wq glink_pkt_queue_rx_intent_worker
[ 308.166035] [<c001483c>] (unwind_backtrace) from [<c0011dc4>] (show_stack+0x10/0x14)
[ 308.166092] [<c0011dc4>] (show_stack) from [<c001f1e8>] (warn_slowpath_common+0x68/0x88)
[ 308.166144] [<c001f1e8>] (warn_slowpath_common) from [<c001f298>] (warn_slowpath_null+0x18/0x20)
[ 308.166191] [<c001f298>] (warn_slowpath_null) from [<c0291aac>] (glink_get_ch_ctx+0x48/0x6c)
[ 308.166240] [<c0291aac>] (glink_get_ch_ctx) from [<c0292734>] (glink_queue_rx_intent+0x18/0x93c)
[ 308.166290] [<c0292734>] (glink_queue_rx_intent) from [<c02baad0>] (glink_pkt_queue_rx_intent_worker+0x64/0x120)
[ 308.166340] [<c02baad0>] (glink_pkt_queue_rx_intent_worker) from [<c0032108>] (process_one_work+0x244/0x3fc)
[ 308.166389] [<c0032108>] (process_one_work) from [<c0032f04>] (worker_thread+0x2f0/0x440)
[ 308.166439] [<c0032f04>] (worker_thread) from [<c00366dc>] (kthread+0xc4/0xd8)
[ 308.166489] [<c00366dc>] (kthread) from [<c000e6e0>] (ret_from_fork+0x14/0x34)
[ 308.166515] ---[ end trace e8ab27ce022bebf1 ]---
[ 308.166546] <CORE> (null):(null):DATA40_CNTL[8:0] glink_queue_rx_intent: Channel is not fully opened
[ 308.166609] <GLINK_PKT> err: glink_pkt_queue_rx_intent_worker queue_rx_intent failed
[ 308.166657] <CORE> (null):(null):DATA40_CNTL[8:0] glink_queue_rx_intent: Channel is not fully opened
[ 308.166693] <GLINK_PKT> err: glink_pkt_queue_rx_intent_worker queue_rx_intent failed
[ 308.182553] ------------[ cut here ]------------
[ 308.182568] kernel BUG at /work/mdm9x50/apps_proc/oe-core/build/tmp-glibc/work-shared/mdm9650-perf/kernel-source/mm/slub.c:3403!
[ 308.182577] Internal error: Oops - BUG: 0 [#1] PREEMPT ARM
[ 308.182584] Modules linked in: alx(O) embms_kernel(O)
[ 308.182602] CPU: 0 PID: 1007 Comm: mbimd Tainted: G W O 3.18.31 #1
[ 308.182611] task: c151b480 ti: c4bfa000 task.ti: c4bfa000
[ 308.182623] PC is at kfree+0xd8/0x1e4
[ 308.182633] LR is at free_pipe_info+0x60/0x6c
[ 308.182643] pc : [<c00e0c54>] lr : [<c00ee46c>] psr: 40070013
[ 308.182643] sp : c4bfbf20 ip : c0e9f80c fp : 7f615358
[ 308.182653] r10: c49dcc08 r9 : c6ce1d90 r8 : c5401ee0
[ 308.182661] r7 : c00ee46c r6 : c577a000 r5 : 00000010 r4 : cfd76d28
[ 308.182669] r3 : 00000000 r2 : 000c4d28 r1 : 00000000 r0 : cfcb2000
[ 308.182678] Flags: nZcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
[ 308.182687] Control: 10c53c7d Table: 8501c059 DAC: 00000015
[ 308.182695] Process mbimd (pid: 1007, stack limit = 0xc4bfa230)
[ 308.182703] Stack: (0xc4bfbf20 to 0xc4bfc000)
[ 308.182716] bf20: c4c97700 00000010 00000018 00000008 c5401ee0 c6ce1d90 c49dcc08 c00ee46c
[ 308.182729] bf40: c4c97700 c49dcc00 c5402cb8 c00ee590 c49dcc00 c5402cb8 00000000 c00e83e4
[ 308.182742] bf60: 00000000 00000000 c151b90c c151b480 00000000 c0eefed0 c000e7c4 c4bfa000
[ 308.182755] bf80: 00000000 c0035218 c49dcc00 c000e7c4 c4bfa000 c4bfbfb0 00000004 c0011794
[ 308.182768] bfa0: 7f653a58 00000000 7f653918 c000e688 00000001 b6ecb060 00000001 b6ecce8c
[ 308.182781] bfc0: 7f653a58 00000000 7f653918 00000004 be8a7b54 7f615390 7f653918 7f615358
[ 308.182793] bfe0: 00000002 be8a7ae8 b6a5a860 b6d3a454 60060010 0000004d 00000000 00000000
[ 308.182816] [<c00e0c54>] (kfree) from [<c00ee46c>] (free_pipe_info+0x60/0x6c)
[ 308.182833] [<c00ee46c>] (free_pipe_info) from [<c00ee590>] (pipe_release+0xa0/0xb0)
[ 308.182849] [<c00ee590>] (pipe_release) from [<c00e83e4>] (__fput+0xd4/0x1d8)
[ 308.182866] [<c00e83e4>] (__fput) from [<c0035218>] (task_work_run+0x90/0xa4)
[ 308.182885] [<c0035218>] (task_work_run) from [<c0011794>] (do_work_pending+0xb4/0xc8)
[ 308.182902] [<c0011794>] (do_work_pending) from [<c000e688>] (work_pending+0xc/0x20)
[ 308.182915] Code: 1a00000b e5943000 e3130903 1a000000 (e7f001f2)
[ 308.182924] ---[ end trace e8ab27ce022bebf2 ]---
So it looks like the mbimd process in the modem manages to crash the
modem kernel. Nice...
But what can be done about that? From my point of view, fixing the
issue in the modem firmware is out of the question. The modem is running
SWI9X50C_01.14.13.00, which the latest available.
We need a woraround for the host system. But what? Since I have adb
access, I can obviously just reboot the modem. Which works. But what
about the "normal" modem configuration? Figure out a way to cut USB
port power? I guess this is possible on that platform, although I don't
think it's implemented. It's not a very generic solution, though. But
maybe this isn't a very generic problem? Maybe it's just this exact
combo of host controller, device and continous power to the port during
reboot (which I'm very much in favour of in general)?
Have anyone else noticed similar problems with the EM7565? Or any other
Qualcomm MDM9x50 device? I assume this bug isn't really Sierra Wireless
specific, but rather something they got from Qualcomm.
You'll obviously not see this problem if your host system cuts power to
the USB port on reboots. Which unfortunately often is the case.
I've tried to provoke the issue by e.g unbinding and rebinding the
xhci_pci driver, or by resetting the USB device. But none of these
actions cause the error to show up. There are some occasional snags,
but nothing casuing MM probe failures. The modem firmware logs the
suspicious
"dwc3 8a00000.dwc3: failed to stop controller"
on unbind, but it doesn't crash. And it works as usual on rebind.
Bjørn
More information about the ModemManager-devel
mailing list