Sierra MC7354

Tue Mar 11 03:01:54 PDT 2014

Bjørn Mork <bjorn at mork.no> writes:
> szlin <lin.sunze at gmail.com> writes:
>
>>>> #qmicli -d /dev/cdc-wdm0 --get-service-version-info
>>>> [   79.644073] qmi_wwan 1-1:1.8: nonzero urb status received: -EPIPE
>>>> [10 Mar 2014, 20:22:03] -Warning ** Error reading from istream: Error
>>>> reading from file descriptor: Input/output error
>>>>
>>>> error: couldn't get service version info: Transaction timed out
>>>
>>> Hmm, wonder if that isn't a symptom of a firmware crash maybe?
>>
>> After seeing error message,  the system is frozen without any kernel
>> panic/oops msg.
>> (ping is available but other service such like ssh is unavailable)
>
> OK, that sounds like we end up in a busy loop somewhere. Are you testing
> this on a single core system?
>
>> I'll keep trace on communication protocol between module and qmi_wwan.
>
> I'll try to figure out how we can end up looping at this point. It does
> look like it is related to the status endpoint handling, so it is
> probably in the cdc-wdm driver somewhere.  We have changed that driver
> somewhat for v3.13, but I don't think any of the changes would fix a bug
> like this. It would still be useful to know if you can recreate the
> problem on v3.13.2 or later, if that is something you easily can do.

Hmm, looking at the code involved I don't think we have ever changed the
behaviour for EPIPE here.  We're reading from the control endpoint, so
there isn't really much we can do about it...  According to

 https://www.mail-archive.com/linux-usb-devel@lists.sourceforge.net/msg36748.html

the cause is either a transmission error or that the device didn't
understand the command. We never change the command - it' the same read
URB all the time - so I don't see how the device could suddenly not
understand it. Unless something scribbles on the memory, in which case
you are in bigger trouble.

So I guess we are looking at a transmission error of some sort.

We should of course deal with that without hanging anything. The output
from qmicli shows that the read terminates with EIO as expected in this
case.  The question is where it goes wrong from there.  I don't really
know.  It looks like we might end up never reading from the device
again, unless we receive another notification, but that should really be
fine in a sense.  The device or controller has failed and we don't know
how to get it into a workable state again.  But where does it hang?

I have no other ideas than trying to sprinkle a few debug printk's all
around cdc-wdm.c.  At least enable the debug output that's already
there.  If you have dynamic debugging, you can do:

  mount -t debugfs none /sys/kernel/debug
  echo 'file cdc-wdm.c +fp' > /sys/kernel/debug/dynamic_debug/control 

You should then see at least one "zero length - clearing WDM_READ"
debug message after the error.  But since your system fails at this
point, you might want to add more debug messages around the in callback
and read functions first.

Bjørn