MTU issues.

Bjørn Mork bjorn at mork.no
Fri Feb 7 08:52:10 UTC 2020


Great finding!

This all makes sense when you explain it like that.  The proposed patch
looks fine to me, preferably with most of your explanation as commit
message.

But maybe we should consider decoupling mru and mtu with a new
_change_mtu() function instead of the '+ 4' hack? We don't have to use
usbnet_change_mtu().  It was just convenient as long as it did what we
wanted.  When it doesn't, then we might as well replace it.

This would also allows us to solve the issue Daniele points out wrt
QMI_WDA_SET_DATA_FORMAT.

Don't let these proposals stall a bugfix though.  We can always use your
proposed patch as a quickfix now, and then DTRT later when we figure out
what that is.



Bjørn

Paul Gildea <gildeap at tcd.ie> writes:

> Hi,
>
> Thanks for that Bjørn, helped me understand! Dug deeper into this and found
> the issue and a solution.
> Packets are being rejected in the ring buffer used by the xHCI controller.
>
> Enabled traces for usbnet.c and xhci-ring.c file in debugfs as follows:
>
> mount -t debugfs none /sys/kernel/debug
> echo -n 'file usbnet.c +p' > /sys/kernel/debug/dynamic_debug/control
> echo -n 'file xhci-ring.c +p' > /sys/kernel/debug/dynamic_debug/control
>
> Saw a lot of errors indicating URB with len=0 when it was expecting
> len=1430. This was preceded by an URB of len 1024 bytes:
>
> xhci_hcd 0000:00:14.0: Babble error for slot 10 ep 12 on endpoint
> xhci_hcd 0000:00:14.0: Giveback URB ffff8802193be540, len = 1024, expected
> = 1430, status = -75
> xhci_hcd 0000:00:14.0: Transfer error for slot 10 ep 12 on endpoint
> /repeated
>
> Error -75 is EOVERFLOW, which makes sense as per*
> include/uapi/asm-generic/errno.h*
>
> #define EOVERFLOW 75 /* Value too large for defined data type */
>
> *"Babble error"*: When a packet larger than MTU arrives in Linux from the
> modem and is discarded with -EOVERFLOW error.
> This is seen on USB3.0 and USB2.0 busses. This is essentially because the
> MRU (Max Receive Size) is not a separate entity to the MTU (Max Transmit
> Size) and the received packets can be larger than those transmitted.
>
> *"Endless input error"*: Following the babble error we see an endless
> supply of zero-length URBs which are rejected with -EPROTO (increasing the
> rx input error counter each time).
> This is only seen on USB3.0. These continue to come ad infinitum until the
> modem is shutdown.
>
> There appears to be a bug in the core USB handling code in Linux that
> doesn't deal well with network MTUs smaller than 1500 bytes. By default the
> dev->hard_mtu (the "real" MTU) is in lockstep with dev->rx_urb_size
> (essentially an MRU), and it's the latter that is causing trouble. This has
> nothing to do with the modems; the issue can be reproduced by getting a
> USB-Ethernet dongle, setting the MTU to 1430, and pinging with size greater
> than 1406.
>
> Will submit the below patch that solves the issue, if that is acceptable?
>
>
> +diff -Naur linux-4.14.73/drivers/net/usb/qmi_wwan.c
> linux-4.14.73-rx_size_fix/drivers/net/usb/qmi_wwan.c
> +--- linux-4.14.73/drivers/net/usb/qmi_wwan.c	2018-09-29
> 11:06:07.000000000 +0100
> ++++ linux-4.14.73-rx_size_fix/drivers/net/usb/qmi_wwan.c	2020-01-31
> 18:05:07.709008785 +0000
> +@@ -740,6 +740,14 @@
> + 	}
> + 	dev->net->netdev_ops = &qmi_wwan_netdev_ops;
> + 	dev->net->sysfs_groups[0] = &qmi_wwan_sysfs_attr_group;
> ++
> ++	/* LTE Networks don't always respect their own MTU on receive side;
> ++	 * e.g. AT&T pushes 1430 MTU but still allows 1500 byte packets from
> ++	 * far-end network. Make receive buffer large enough to accommodate
> ++	 * them, and add four bytes so MTU does not equal MRU on network
> ++	 * with 1500 MTU otherwise usbnet_change_mtu() will change both.
>
> ++     * This is a sufficient max receive buffer as over 1500 MTU,
>
> ++     * USB driver issues are not seen.
>
> ++	 */
> ++	dev->rx_urb_size = ETH_DATA_LEN + 4;
> + err:
> + 	return status;
> + }
>
>
>
> Regards,
>
> --
> Paul
>
> On Mon, 27 Jan 2020 at 18:04, Bjørn Mork <bjorn at mork.no> wrote:
>
>> I cannot exlain the issues you are having, but I can try to explain the
>> reasons behind the RX_KILL stuff since I happened to write that..
>>
>> One of my modems would sometimes end up in a state where it flooded the
>> host with 0 length frames.  This happened at a high enough rate to bring
>> the host to a complete halt, since usbnet was desperately trying to
>> allocate new max sized skbs at the same rate...  So I figured it was
>> better to let the host have a break now and then when the incoming error
>> rate was above some arbitrary threshold.  This allowed the host to e.g
>> reset the modem to bring it out of the faulty state.
>>
>> The underlying issue was of course a modem firmware bug, and there
>> wasn't much we could do about that. I wonder if that might be the case
>> for you to?  What does the frames triggering these rx errors look like?
>> Myabe you could snoop on the USB bus to see if there is something
>> obviously wrong?
>>
>>
>>
>> Bjørn
>>
>> Paul Gildea <gildeap at tcd.ie> writes:
>>
>> > Hi.
>> >
>> > I was mistaken about this ever happening with a 1500 MTU, it does not.
>> > Looking into this further it seems to be an issue with the usbnet driver
>> > that qmi_wwan is leveraging.
>> > Turning on debugfs and enabling debugging for usbnet.c shows the errors:
>> >
>> > mount -t debugfs none /sys/kernel/debug
>> > echo -n 'file usbnet.c +p' > /sys/kernel/debug/dynamic_debug/control
>> >
>> > Looking at the log you can see that the rxqlen counter rapidly increases
>> > until it is full and a throttle occurs, this throttle error then repeats
>> > forever (quite rapidly) until the modem is reset, even if the ping is
>> > stopped:
>> >
>> > Sep 23 10:16:16 T6 kernel: [  715.615456] qmi_wwan 2-2:1.8 wwan0: rx
>> > throttle -71
>> > Sep 23 10:16:16 T6 kernel: [  715.743152] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 281 --> 291
>> > Sep 23 10:16:16 T6 kernel: [  715.743307] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 291 --> 301
>> > Sep 23 10:16:16 T6 kernel: [  715.743320] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 301 --> 311
>> > Sep 23 10:16:16 T6 kernel: [  715.743331] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 311 --> 318
>> > Sep 23 10:16:16 T6 kernel: [  715.744947] qmi_wwan 2-2:1.8 wwan0: rx
>> > throttle -71
>> > Sep 23 10:16:16 T6 kernel: [  715.871077] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 281 --> 291
>> > Sep 23 10:16:16 T6 kernel: [  715.871115] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 291 --> 301
>> > Sep 23 10:16:16 T6 kernel: [  715.871128] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 301 --> 311
>> > Sep 23 10:16:16 T6 kernel: [  715.871138] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 311 --> 318
>> > Sep 23 10:16:16 T6 kernel: [  715.874448] qmi_wwan 2-2:1.8 wwan0: rx
>> > throttle -71
>> > Sep 23 10:16:17 T6 kernel: [  716.007133] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 280 --> 290
>> > Sep 23 10:16:17 T6 kernel: [  716.007172] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 290 --> 300
>> > Sep 23 10:16:17 T6 kernel: [  716.007185] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 300 --> 310
>> > Sep 23 10:16:17 T6 kernel: [  716.007197] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 310 --> 318
>> >
>> >
>> > This happens on USB3 and not USB2 (unless I pick a small MTU like 500).
>> > With USB2 I generally don't get any of the above errors and can then go
>> > back to pinging with regular sized packets no problem.
>> > The above error, 71 refers to:
>> >
>> > #define EPROTO 71 /* Protocol error */
>> >
>> > The rxqlen error is coming from usbnet_bh () which is described as
>> *“tasklet
>> > (work deferred from completions, in_irq) or timer”* in the source code.
>> > This is too large to paste, but looks like the cleanup function to
>> release
>> > the Socket Buffers (skb) and USB Request Blocks (urb). Interesting
>> snippet
>> > of code from this function, after releasing the buffers:
>> >
>> >       /* restart RX again after disabling due to high error rate */
>> >       clear_bit(EVENT_RX_KILL, &dev->flags);
>> >
>> > Buffers are coming in with -EPROTO at a high rate so the usbnet driver
>> > disables RX. The buffers get released, which explains the packet loss,
>> and
>> > Rx is restarted. Obviously whatever issue was underlying is still present
>> > and so Rx is disabled again and the dance restarts.
>> >
>> > There are a few scenarios it can occur, but the two easiest to reproduce
>> > are:
>> >
>> >    1. Send pings larger than the non-1500 MTU, and they occur after
>> >    ~300seconds.
>> >    2. Set a non 1500 MTU.  reset the modem without sending anything, and
>> >    they occur immediately afterwards.
>> >
>> >
>> > Still trying to figure it out,
>> >
>> > --
>> > Paul
>> >
>> > On Fri, 24 Jan 2020 at 17:26, Paul Gildea <gildeap at tcd.ie> wrote:
>> >
>> >> Hi guys, have done some more testing with this. Using ubuntu 16.04 I
>> >> installed libqmi (1.16) to test that also, on a fresh system (a laptop).
>> >> Set the MTU of the private network to 1430 and pinged with large packets
>> >> (2000) and after a couple of hundred pings this behaviour repeated
>> itself.
>> >> When I thought that the system was fine yesterday, it turns out it just
>> >> took more pings for it to fall over. At 1500 MTU and 300 pings the same
>> >> thing occurred with the private network.
>> >> Massive amounts of input errors, modem becomes unusable until reset.
>> >>
>> >> I can see pings arriving at the back end, it's when they are returning
>> to
>> >> the modem that there appears to be an issue.
>> >> This is happening with SW and Telit with varying modems and multiple
>> >> networks.
>> >> It seems to me to be a qmi_wwan driver issue, what do you think?
>> >>
>> >> Regards
>> >>
>> >> --
>> >> Paul
>> >>
>> >> On Thu, 23 Jan 2020 at 17:15, Paul Gildea <gildeap at tcd.ie> wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> We are seeing MTU issues with the ATT network and were just wondering
>> if
>> >>> you had heard of anything like this happening before? ATT push an MTU
>> of
>> >>> 1430, we apply that to our linux interface and everything works fine.
>> >>> However when we try and ping with a large packet and the pings fail
>> instead
>> >>> of fragmenting correctly, we also see input errors on the linux
>> interface.
>> >>>
>> >>> In some scenarios the input errors never stop (getting thousands of
>> them
>> >>> every second) after we stop pinging and the modem needs to be rebooted
>> to
>> >>> pass any traffic again. Changed MTU values to a lot of different
>> things and
>> >>> this always occurs.
>> >>>
>> >>> On a private network here (MTU 1404) and Vodafone (MTU 1500) doing the
>> >>> same thing with the network pushed MTU values causes no issues and
>> >>> everything works fine. I tested moving away from 1500 on vodafone to
>> lower
>> >>> values and once the issues reoccurred, same as with ATT.
>> >>>
>> >>> Libqmi 1.24 and I have seen this behaviour with all modems tested so
>> far:
>> >>> MC7455, EM7565, EM7511 and Telit LM960A18.
>> >>>
>> >>> Regards,
>> >>>
>> >>> --
>> >>> Paul
>> >>>
>> >>
>> > _______________________________________________
>> > libqmi-devel mailing list
>> > libqmi-devel at lists.freedesktop.org
>> > https://lists.freedesktop.org/mailman/listinfo/libqmi-devel
>>


More information about the libqmi-devel mailing list