MTU issues.

Daniele Palmas dnlplm at gmail.com
Thu Feb 6 12:15:08 UTC 2020


Hi Bjørn and Paul,

Il giorno gio 6 feb 2020 alle ore 12:36 Paul Gildea <gildeap at tcd.ie> ha scritto:
>
> Hi,
>
> Thanks for that Bjørn, helped me understand! Dug deeper into this and found the issue and a solution.
> Packets are being rejected in the ring buffer used by the xHCI controller.
>
> Enabled traces for usbnet.c and xhci-ring.c file in debugfs as follows:
>
> mount -t debugfs none /sys/kernel/debug
> echo -n 'file usbnet.c +p' > /sys/kernel/debug/dynamic_debug/control
> echo -n 'file xhci-ring.c +p' > /sys/kernel/debug/dynamic_debug/control
>
> Saw a lot of errors indicating URB with len=0 when it was expecting len=1430. This was preceded by an URB of len 1024 bytes:
>
> xhci_hcd 0000:00:14.0: Babble error for slot 10 ep 12 on endpoint
> xhci_hcd 0000:00:14.0: Giveback URB ffff8802193be540, len = 1024, expected = 1430, status = -75
> xhci_hcd 0000:00:14.0: Transfer error for slot 10 ep 12 on endpoint
> /repeated
>
> Error -75 is EOVERFLOW, which makes sense as per include/uapi/asm-generic/errno.h
>
> #define EOVERFLOW 75 /* Value too large for defined data type */
>
> "Babble error": When a packet larger than MTU arrives in Linux from the modem and is discarded with -EOVERFLOW error.
> This is seen on USB3.0 and USB2.0 busses. This is essentially because the MRU (Max Receive Size) is not a separate entity to the MTU (Max Transmit Size) and the received packets can be larger than those transmitted.
>
> "Endless input error": Following the babble error we see an endless supply of zero-length URBs which are rejected with -EPROTO (increasing the rx input error counter each time).
> This is only seen on USB3.0. These continue to come ad infinitum until the modem is shutdown.
>
> There appears to be a bug in the core USB handling code in Linux that doesn't deal well with network MTUs smaller than 1500 bytes. By default the dev->hard_mtu (the "real" MTU) is in lockstep with dev->rx_urb_size (essentially an MRU), and it's the latter that is causing trouble. This has nothing to do with the modems; the issue can be reproduced by getting a USB-Ethernet dongle, setting the MTU to 1430, and pinging with size greater than 1406.
>
> Will submit the below patch that solves the issue, if that is acceptable?
>
>
> +diff -Naur linux-4.14.73/drivers/net/usb/qmi_wwan.c linux-4.14.73-rx_size_fix/drivers/net/usb/qmi_wwan.c
> +--- linux-4.14.73/drivers/net/usb/qmi_wwan.c 2018-09-29 11:06:07.000000000 +0100
> ++++ linux-4.14.73-rx_size_fix/drivers/net/usb/qmi_wwan.c 2020-01-31 18:05:07.709008785 +0000
> +@@ -740,6 +740,14 @@
> + }
> + dev->net->netdev_ops = &qmi_wwan_netdev_ops;
> + dev->net->sysfs_groups[0] = &qmi_wwan_sysfs_attr_group;
> ++
> ++ /* LTE Networks don't always respect their own MTU on receive side;
> ++ * e.g. AT&T pushes 1430 MTU but still allows 1500 byte packets from
> ++ * far-end network. Make receive buffer large enough to accommodate
> ++ * them, and add four bytes so MTU does not equal MRU on network
> ++ * with 1500 MTU otherwise usbnet_change_mtu() will change both.
>
> ++     * This is a sufficient max receive buffer as over 1500 MTU,
>
> ++     * USB driver issues are not seen.
>
> ++ */
> ++ dev->rx_urb_size = ETH_DATA_LEN + 4;
> + err:
> + return status;
> + }
>

could it make sense to have rx_urb_size configurable from userspace
(e.g. sysfs file)?

This is useful also when changing downlink maximum packet size with
QMI_WDA_SET_DATA_FORMAT and is required for getting high-cat modems
maximum throughput.

Regards,
Daniele

>
>
> Regards,
>
> --
> Paul
>
> On Mon, 27 Jan 2020 at 18:04, Bjørn Mork <bjorn at mork.no> wrote:
>>
>> I cannot exlain the issues you are having, but I can try to explain the
>> reasons behind the RX_KILL stuff since I happened to write that..
>>
>> One of my modems would sometimes end up in a state where it flooded the
>> host with 0 length frames.  This happened at a high enough rate to bring
>> the host to a complete halt, since usbnet was desperately trying to
>> allocate new max sized skbs at the same rate...  So I figured it was
>> better to let the host have a break now and then when the incoming error
>> rate was above some arbitrary threshold.  This allowed the host to e.g
>> reset the modem to bring it out of the faulty state.
>>
>> The underlying issue was of course a modem firmware bug, and there
>> wasn't much we could do about that. I wonder if that might be the case
>> for you to?  What does the frames triggering these rx errors look like?
>> Myabe you could snoop on the USB bus to see if there is something
>> obviously wrong?
>>
>>
>>
>> Bjørn
>>
>> Paul Gildea <gildeap at tcd.ie> writes:
>>
>> > Hi.
>> >
>> > I was mistaken about this ever happening with a 1500 MTU, it does not.
>> > Looking into this further it seems to be an issue with the usbnet driver
>> > that qmi_wwan is leveraging.
>> > Turning on debugfs and enabling debugging for usbnet.c shows the errors:
>> >
>> > mount -t debugfs none /sys/kernel/debug
>> > echo -n 'file usbnet.c +p' > /sys/kernel/debug/dynamic_debug/control
>> >
>> > Looking at the log you can see that the rxqlen counter rapidly increases
>> > until it is full and a throttle occurs, this throttle error then repeats
>> > forever (quite rapidly) until the modem is reset, even if the ping is
>> > stopped:
>> >
>> > Sep 23 10:16:16 T6 kernel: [  715.615456] qmi_wwan 2-2:1.8 wwan0: rx
>> > throttle -71
>> > Sep 23 10:16:16 T6 kernel: [  715.743152] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 281 --> 291
>> > Sep 23 10:16:16 T6 kernel: [  715.743307] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 291 --> 301
>> > Sep 23 10:16:16 T6 kernel: [  715.743320] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 301 --> 311
>> > Sep 23 10:16:16 T6 kernel: [  715.743331] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 311 --> 318
>> > Sep 23 10:16:16 T6 kernel: [  715.744947] qmi_wwan 2-2:1.8 wwan0: rx
>> > throttle -71
>> > Sep 23 10:16:16 T6 kernel: [  715.871077] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 281 --> 291
>> > Sep 23 10:16:16 T6 kernel: [  715.871115] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 291 --> 301
>> > Sep 23 10:16:16 T6 kernel: [  715.871128] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 301 --> 311
>> > Sep 23 10:16:16 T6 kernel: [  715.871138] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 311 --> 318
>> > Sep 23 10:16:16 T6 kernel: [  715.874448] qmi_wwan 2-2:1.8 wwan0: rx
>> > throttle -71
>> > Sep 23 10:16:17 T6 kernel: [  716.007133] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 280 --> 290
>> > Sep 23 10:16:17 T6 kernel: [  716.007172] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 290 --> 300
>> > Sep 23 10:16:17 T6 kernel: [  716.007185] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 300 --> 310
>> > Sep 23 10:16:17 T6 kernel: [  716.007197] qmi_wwan 2-2:1.8 wwan0: rxqlen
>> > 310 --> 318
>> >
>> >
>> > This happens on USB3 and not USB2 (unless I pick a small MTU like 500).
>> > With USB2 I generally don't get any of the above errors and can then go
>> > back to pinging with regular sized packets no problem.
>> > The above error, 71 refers to:
>> >
>> > #define EPROTO 71 /* Protocol error */
>> >
>> > The rxqlen error is coming from usbnet_bh () which is described as *“tasklet
>> > (work deferred from completions, in_irq) or timer”* in the source code.
>> > This is too large to paste, but looks like the cleanup function to release
>> > the Socket Buffers (skb) and USB Request Blocks (urb). Interesting snippet
>> > of code from this function, after releasing the buffers:
>> >
>> >       /* restart RX again after disabling due to high error rate */
>> >       clear_bit(EVENT_RX_KILL, &dev->flags);
>> >
>> > Buffers are coming in with -EPROTO at a high rate so the usbnet driver
>> > disables RX. The buffers get released, which explains the packet loss, and
>> > Rx is restarted. Obviously whatever issue was underlying is still present
>> > and so Rx is disabled again and the dance restarts.
>> >
>> > There are a few scenarios it can occur, but the two easiest to reproduce
>> > are:
>> >
>> >    1. Send pings larger than the non-1500 MTU, and they occur after
>> >    ~300seconds.
>> >    2. Set a non 1500 MTU.  reset the modem without sending anything, and
>> >    they occur immediately afterwards.
>> >
>> >
>> > Still trying to figure it out,
>> >
>> > --
>> > Paul
>> >
>> > On Fri, 24 Jan 2020 at 17:26, Paul Gildea <gildeap at tcd.ie> wrote:
>> >
>> >> Hi guys, have done some more testing with this. Using ubuntu 16.04 I
>> >> installed libqmi (1.16) to test that also, on a fresh system (a laptop).
>> >> Set the MTU of the private network to 1430 and pinged with large packets
>> >> (2000) and after a couple of hundred pings this behaviour repeated itself.
>> >> When I thought that the system was fine yesterday, it turns out it just
>> >> took more pings for it to fall over. At 1500 MTU and 300 pings the same
>> >> thing occurred with the private network.
>> >> Massive amounts of input errors, modem becomes unusable until reset.
>> >>
>> >> I can see pings arriving at the back end, it's when they are returning to
>> >> the modem that there appears to be an issue.
>> >> This is happening with SW and Telit with varying modems and multiple
>> >> networks.
>> >> It seems to me to be a qmi_wwan driver issue, what do you think?
>> >>
>> >> Regards
>> >>
>> >> --
>> >> Paul
>> >>
>> >> On Thu, 23 Jan 2020 at 17:15, Paul Gildea <gildeap at tcd.ie> wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> We are seeing MTU issues with the ATT network and were just wondering if
>> >>> you had heard of anything like this happening before? ATT push an MTU of
>> >>> 1430, we apply that to our linux interface and everything works fine.
>> >>> However when we try and ping with a large packet and the pings fail instead
>> >>> of fragmenting correctly, we also see input errors on the linux interface.
>> >>>
>> >>> In some scenarios the input errors never stop (getting thousands of them
>> >>> every second) after we stop pinging and the modem needs to be rebooted to
>> >>> pass any traffic again. Changed MTU values to a lot of different things and
>> >>> this always occurs.
>> >>>
>> >>> On a private network here (MTU 1404) and Vodafone (MTU 1500) doing the
>> >>> same thing with the network pushed MTU values causes no issues and
>> >>> everything works fine. I tested moving away from 1500 on vodafone to lower
>> >>> values and once the issues reoccurred, same as with ATT.
>> >>>
>> >>> Libqmi 1.24 and I have seen this behaviour with all modems tested so far:
>> >>> MC7455, EM7565, EM7511 and Telit LM960A18.
>> >>>
>> >>> Regards,
>> >>>
>> >>> --
>> >>> Paul
>> >>>
>> >>
>> > _______________________________________________
>> > libqmi-devel mailing list
>> > libqmi-devel at lists.freedesktop.org
>> > https://lists.freedesktop.org/mailman/listinfo/libqmi-devel
>
> _______________________________________________
> libqmi-devel mailing list
> libqmi-devel at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/libqmi-devel


More information about the libqmi-devel mailing list