MTU issues.

Mon Jan 27 18:04:03 UTC 2020

I cannot exlain the issues you are having, but I can try to explain the
reasons behind the RX_KILL stuff since I happened to write that..

One of my modems would sometimes end up in a state where it flooded the
host with 0 length frames.  This happened at a high enough rate to bring
the host to a complete halt, since usbnet was desperately trying to
allocate new max sized skbs at the same rate...  So I figured it was
better to let the host have a break now and then when the incoming error
rate was above some arbitrary threshold.  This allowed the host to e.g
reset the modem to bring it out of the faulty state.

The underlying issue was of course a modem firmware bug, and there
wasn't much we could do about that. I wonder if that might be the case
for you to?  What does the frames triggering these rx errors look like?
Myabe you could snoop on the USB bus to see if there is something
obviously wrong?

Bjørn

Paul Gildea <gildeap at tcd.ie> writes:

> Hi.
>
> I was mistaken about this ever happening with a 1500 MTU, it does not.
> Looking into this further it seems to be an issue with the usbnet driver
> that qmi_wwan is leveraging.
> Turning on debugfs and enabling debugging for usbnet.c shows the errors:
>
> mount -t debugfs none /sys/kernel/debug
> echo -n 'file usbnet.c +p' > /sys/kernel/debug/dynamic_debug/control
>
> Looking at the log you can see that the rxqlen counter rapidly increases
> until it is full and a throttle occurs, this throttle error then repeats
> forever (quite rapidly) until the modem is reset, even if the ping is
> stopped:
>
> Sep 23 10:16:16 T6 kernel: [  715.615456] qmi_wwan 2-2:1.8 wwan0: rx
> throttle -71
> Sep 23 10:16:16 T6 kernel: [  715.743152] qmi_wwan 2-2:1.8 wwan0: rxqlen
> 281 --> 291
> Sep 23 10:16:16 T6 kernel: [  715.743307] qmi_wwan 2-2:1.8 wwan0: rxqlen
> 291 --> 301
> Sep 23 10:16:16 T6 kernel: [  715.743320] qmi_wwan 2-2:1.8 wwan0: rxqlen
> 301 --> 311
> Sep 23 10:16:16 T6 kernel: [  715.743331] qmi_wwan 2-2:1.8 wwan0: rxqlen
> 311 --> 318
> Sep 23 10:16:16 T6 kernel: [  715.744947] qmi_wwan 2-2:1.8 wwan0: rx
> throttle -71
> Sep 23 10:16:16 T6 kernel: [  715.871077] qmi_wwan 2-2:1.8 wwan0: rxqlen
> 281 --> 291
> Sep 23 10:16:16 T6 kernel: [  715.871115] qmi_wwan 2-2:1.8 wwan0: rxqlen
> 291 --> 301
> Sep 23 10:16:16 T6 kernel: [  715.871128] qmi_wwan 2-2:1.8 wwan0: rxqlen
> 301 --> 311
> Sep 23 10:16:16 T6 kernel: [  715.871138] qmi_wwan 2-2:1.8 wwan0: rxqlen
> 311 --> 318
> Sep 23 10:16:16 T6 kernel: [  715.874448] qmi_wwan 2-2:1.8 wwan0: rx
> throttle -71
> Sep 23 10:16:17 T6 kernel: [  716.007133] qmi_wwan 2-2:1.8 wwan0: rxqlen
> 280 --> 290
> Sep 23 10:16:17 T6 kernel: [  716.007172] qmi_wwan 2-2:1.8 wwan0: rxqlen
> 290 --> 300
> Sep 23 10:16:17 T6 kernel: [  716.007185] qmi_wwan 2-2:1.8 wwan0: rxqlen
> 300 --> 310
> Sep 23 10:16:17 T6 kernel: [  716.007197] qmi_wwan 2-2:1.8 wwan0: rxqlen
> 310 --> 318
>
>
> This happens on USB3 and not USB2 (unless I pick a small MTU like 500).
> With USB2 I generally don't get any of the above errors and can then go
> back to pinging with regular sized packets no problem.
> The above error, 71 refers to:
>
> #define EPROTO 71 /* Protocol error */
>
> The rxqlen error is coming from usbnet_bh () which is described as *“tasklet
> (work deferred from completions, in_irq) or timer”* in the source code.
> This is too large to paste, but looks like the cleanup function to release
> the Socket Buffers (skb) and USB Request Blocks (urb). Interesting snippet
> of code from this function, after releasing the buffers:
>
> 	/* restart RX again after disabling due to high error rate */
> 	clear_bit(EVENT_RX_KILL, &dev->flags);
>
> Buffers are coming in with -EPROTO at a high rate so the usbnet driver
> disables RX. The buffers get released, which explains the packet loss, and
> Rx is restarted. Obviously whatever issue was underlying is still present
> and so Rx is disabled again and the dance restarts.
>
> There are a few scenarios it can occur, but the two easiest to reproduce
> are:
>
>    1. Send pings larger than the non-1500 MTU, and they occur after
>    ~300seconds.
>    2. Set a non 1500 MTU.  reset the modem without sending anything, and
>    they occur immediately afterwards.
>
>
> Still trying to figure it out,
>
> --
> Paul
>
> On Fri, 24 Jan 2020 at 17:26, Paul Gildea <gildeap at tcd.ie> wrote:
>
>> Hi guys, have done some more testing with this. Using ubuntu 16.04 I
>> installed libqmi (1.16) to test that also, on a fresh system (a laptop).
>> Set the MTU of the private network to 1430 and pinged with large packets
>> (2000) and after a couple of hundred pings this behaviour repeated itself.
>> When I thought that the system was fine yesterday, it turns out it just
>> took more pings for it to fall over. At 1500 MTU and 300 pings the same
>> thing occurred with the private network.
>> Massive amounts of input errors, modem becomes unusable until reset.
>>
>> I can see pings arriving at the back end, it's when they are returning to
>> the modem that there appears to be an issue.
>> This is happening with SW and Telit with varying modems and multiple
>> networks.
>> It seems to me to be a qmi_wwan driver issue, what do you think?
>>
>> Regards
>>
>> --
>> Paul
>>
>> On Thu, 23 Jan 2020 at 17:15, Paul Gildea <gildeap at tcd.ie> wrote:
>>
>>> Hi,
>>>
>>> We are seeing MTU issues with the ATT network and were just wondering if
>>> you had heard of anything like this happening before? ATT push an MTU of
>>> 1430, we apply that to our linux interface and everything works fine.
>>> However when we try and ping with a large packet and the pings fail instead
>>> of fragmenting correctly, we also see input errors on the linux interface.
>>>
>>> In some scenarios the input errors never stop (getting thousands of them
>>> every second) after we stop pinging and the modem needs to be rebooted to
>>> pass any traffic again. Changed MTU values to a lot of different things and
>>> this always occurs.
>>>
>>> On a private network here (MTU 1404) and Vodafone (MTU 1500) doing the
>>> same thing with the network pushed MTU values causes no issues and
>>> everything works fine. I tested moving away from 1500 on vodafone to lower
>>> values and once the issues reoccurred, same as with ATT.
>>>
>>> Libqmi 1.24 and I have seen this behaviour with all modems tested so far:
>>> MC7455, EM7565, EM7511 and Telit LM960A18.
>>>
>>> Regards,
>>>
>>> --
>>> Paul
>>>
>>
> _______________________________________________
> libqmi-devel mailing list
> libqmi-devel at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/libqmi-devel