MTU issues.
Paul Gildea
gildeap at tcd.ie
Mon Feb 10 11:30:08 UTC 2020
Hi Bjørn,
I think initially a new change_mtu function instead of usbnet_change_mtu()
makes sense,
fix MRU to 1500, change the MTU as desired. Not having the added
complication of needing
to make the code work with usbnet_change_mtu() would mean there is no need
for the "+4" hack.
Regards,
--
Paul
On Fri, 7 Feb 2020 at 08:52, Bjørn Mork <bjorn at mork.no> wrote:
> Great finding!
>
> This all makes sense when you explain it like that. The proposed patch
> looks fine to me, preferably with most of your explanation as commit
> message.
>
> But maybe we should consider decoupling mru and mtu with a new
> _change_mtu() function instead of the '+ 4' hack? We don't have to use
> usbnet_change_mtu(). It was just convenient as long as it did what we
> wanted. When it doesn't, then we might as well replace it.
>
> This would also allows us to solve the issue Daniele points out wrt
> QMI_WDA_SET_DATA_FORMAT.
>
> Don't let these proposals stall a bugfix though. We can always use your
> proposed patch as a quickfix now, and then DTRT later when we figure out
> what that is.
>
>
>
> Bjørn
>
> Paul Gildea <gildeap at tcd.ie> writes:
>
> > Hi,
> >
> > Thanks for that Bjørn, helped me understand! Dug deeper into this and
> found
> > the issue and a solution.
> > Packets are being rejected in the ring buffer used by the xHCI
> controller.
> >
> > Enabled traces for usbnet.c and xhci-ring.c file in debugfs as follows:
> >
> > mount -t debugfs none /sys/kernel/debug
> > echo -n 'file usbnet.c +p' > /sys/kernel/debug/dynamic_debug/control
> > echo -n 'file xhci-ring.c +p' > /sys/kernel/debug/dynamic_debug/control
> >
> > Saw a lot of errors indicating URB with len=0 when it was expecting
> > len=1430. This was preceded by an URB of len 1024 bytes:
> >
> > xhci_hcd 0000:00:14.0: Babble error for slot 10 ep 12 on endpoint
> > xhci_hcd 0000:00:14.0: Giveback URB ffff8802193be540, len = 1024,
> expected
> > = 1430, status = -75
> > xhci_hcd 0000:00:14.0: Transfer error for slot 10 ep 12 on endpoint
> > /repeated
> >
> > Error -75 is EOVERFLOW, which makes sense as per*
> > include/uapi/asm-generic/errno.h*
> >
> > #define EOVERFLOW 75 /* Value too large for defined data type */
> >
> > *"Babble error"*: When a packet larger than MTU arrives in Linux from the
> > modem and is discarded with -EOVERFLOW error.
> > This is seen on USB3.0 and USB2.0 busses. This is essentially because the
> > MRU (Max Receive Size) is not a separate entity to the MTU (Max Transmit
> > Size) and the received packets can be larger than those transmitted.
> >
> > *"Endless input error"*: Following the babble error we see an endless
> > supply of zero-length URBs which are rejected with -EPROTO (increasing
> the
> > rx input error counter each time).
> > This is only seen on USB3.0. These continue to come ad infinitum until
> the
> > modem is shutdown.
> >
> > There appears to be a bug in the core USB handling code in Linux that
> > doesn't deal well with network MTUs smaller than 1500 bytes. By default
> the
> > dev->hard_mtu (the "real" MTU) is in lockstep with dev->rx_urb_size
> > (essentially an MRU), and it's the latter that is causing trouble. This
> has
> > nothing to do with the modems; the issue can be reproduced by getting a
> > USB-Ethernet dongle, setting the MTU to 1430, and pinging with size
> greater
> > than 1406.
> >
> > Will submit the below patch that solves the issue, if that is acceptable?
> >
> >
> > +diff -Naur linux-4.14.73/drivers/net/usb/qmi_wwan.c
> > linux-4.14.73-rx_size_fix/drivers/net/usb/qmi_wwan.c
> > +--- linux-4.14.73/drivers/net/usb/qmi_wwan.c 2018-09-29
> > 11:06:07.000000000 +0100
> > ++++ linux-4.14.73-rx_size_fix/drivers/net/usb/qmi_wwan.c 2020-01-31
> > 18:05:07.709008785 +0000
> > +@@ -740,6 +740,14 @@
> > + }
> > + dev->net->netdev_ops = &qmi_wwan_netdev_ops;
> > + dev->net->sysfs_groups[0] = &qmi_wwan_sysfs_attr_group;
> > ++
> > ++ /* LTE Networks don't always respect their own MTU on receive side;
> > ++ * e.g. AT&T pushes 1430 MTU but still allows 1500 byte packets
> from
> > ++ * far-end network. Make receive buffer large enough to accommodate
> > ++ * them, and add four bytes so MTU does not equal MRU on network
> > ++ * with 1500 MTU otherwise usbnet_change_mtu() will change both.
> >
> > ++ * This is a sufficient max receive buffer as over 1500 MTU,
> >
> > ++ * USB driver issues are not seen.
> >
> > ++ */
> > ++ dev->rx_urb_size = ETH_DATA_LEN + 4;
> > + err:
> > + return status;
> > + }
> >
> >
> >
> > Regards,
> >
> > --
> > Paul
> >
> > On Mon, 27 Jan 2020 at 18:04, Bjørn Mork <bjorn at mork.no> wrote:
> >
> >> I cannot exlain the issues you are having, but I can try to explain the
> >> reasons behind the RX_KILL stuff since I happened to write that..
> >>
> >> One of my modems would sometimes end up in a state where it flooded the
> >> host with 0 length frames. This happened at a high enough rate to bring
> >> the host to a complete halt, since usbnet was desperately trying to
> >> allocate new max sized skbs at the same rate... So I figured it was
> >> better to let the host have a break now and then when the incoming error
> >> rate was above some arbitrary threshold. This allowed the host to e.g
> >> reset the modem to bring it out of the faulty state.
> >>
> >> The underlying issue was of course a modem firmware bug, and there
> >> wasn't much we could do about that. I wonder if that might be the case
> >> for you to? What does the frames triggering these rx errors look like?
> >> Myabe you could snoop on the USB bus to see if there is something
> >> obviously wrong?
> >>
> >>
> >>
> >> Bjørn
> >>
> >> Paul Gildea <gildeap at tcd.ie> writes:
> >>
> >> > Hi.
> >> >
> >> > I was mistaken about this ever happening with a 1500 MTU, it does not.
> >> > Looking into this further it seems to be an issue with the usbnet
> driver
> >> > that qmi_wwan is leveraging.
> >> > Turning on debugfs and enabling debugging for usbnet.c shows the
> errors:
> >> >
> >> > mount -t debugfs none /sys/kernel/debug
> >> > echo -n 'file usbnet.c +p' > /sys/kernel/debug/dynamic_debug/control
> >> >
> >> > Looking at the log you can see that the rxqlen counter rapidly
> increases
> >> > until it is full and a throttle occurs, this throttle error then
> repeats
> >> > forever (quite rapidly) until the modem is reset, even if the ping is
> >> > stopped:
> >> >
> >> > Sep 23 10:16:16 T6 kernel: [ 715.615456] qmi_wwan 2-2:1.8 wwan0: rx
> >> > throttle -71
> >> > Sep 23 10:16:16 T6 kernel: [ 715.743152] qmi_wwan 2-2:1.8 wwan0:
> rxqlen
> >> > 281 --> 291
> >> > Sep 23 10:16:16 T6 kernel: [ 715.743307] qmi_wwan 2-2:1.8 wwan0:
> rxqlen
> >> > 291 --> 301
> >> > Sep 23 10:16:16 T6 kernel: [ 715.743320] qmi_wwan 2-2:1.8 wwan0:
> rxqlen
> >> > 301 --> 311
> >> > Sep 23 10:16:16 T6 kernel: [ 715.743331] qmi_wwan 2-2:1.8 wwan0:
> rxqlen
> >> > 311 --> 318
> >> > Sep 23 10:16:16 T6 kernel: [ 715.744947] qmi_wwan 2-2:1.8 wwan0: rx
> >> > throttle -71
> >> > Sep 23 10:16:16 T6 kernel: [ 715.871077] qmi_wwan 2-2:1.8 wwan0:
> rxqlen
> >> > 281 --> 291
> >> > Sep 23 10:16:16 T6 kernel: [ 715.871115] qmi_wwan 2-2:1.8 wwan0:
> rxqlen
> >> > 291 --> 301
> >> > Sep 23 10:16:16 T6 kernel: [ 715.871128] qmi_wwan 2-2:1.8 wwan0:
> rxqlen
> >> > 301 --> 311
> >> > Sep 23 10:16:16 T6 kernel: [ 715.871138] qmi_wwan 2-2:1.8 wwan0:
> rxqlen
> >> > 311 --> 318
> >> > Sep 23 10:16:16 T6 kernel: [ 715.874448] qmi_wwan 2-2:1.8 wwan0: rx
> >> > throttle -71
> >> > Sep 23 10:16:17 T6 kernel: [ 716.007133] qmi_wwan 2-2:1.8 wwan0:
> rxqlen
> >> > 280 --> 290
> >> > Sep 23 10:16:17 T6 kernel: [ 716.007172] qmi_wwan 2-2:1.8 wwan0:
> rxqlen
> >> > 290 --> 300
> >> > Sep 23 10:16:17 T6 kernel: [ 716.007185] qmi_wwan 2-2:1.8 wwan0:
> rxqlen
> >> > 300 --> 310
> >> > Sep 23 10:16:17 T6 kernel: [ 716.007197] qmi_wwan 2-2:1.8 wwan0:
> rxqlen
> >> > 310 --> 318
> >> >
> >> >
> >> > This happens on USB3 and not USB2 (unless I pick a small MTU like
> 500).
> >> > With USB2 I generally don't get any of the above errors and can then
> go
> >> > back to pinging with regular sized packets no problem.
> >> > The above error, 71 refers to:
> >> >
> >> > #define EPROTO 71 /* Protocol error */
> >> >
> >> > The rxqlen error is coming from usbnet_bh () which is described as
> >> *“tasklet
> >> > (work deferred from completions, in_irq) or timer”* in the source
> code.
> >> > This is too large to paste, but looks like the cleanup function to
> >> release
> >> > the Socket Buffers (skb) and USB Request Blocks (urb). Interesting
> >> snippet
> >> > of code from this function, after releasing the buffers:
> >> >
> >> > /* restart RX again after disabling due to high error rate */
> >> > clear_bit(EVENT_RX_KILL, &dev->flags);
> >> >
> >> > Buffers are coming in with -EPROTO at a high rate so the usbnet driver
> >> > disables RX. The buffers get released, which explains the packet loss,
> >> and
> >> > Rx is restarted. Obviously whatever issue was underlying is still
> present
> >> > and so Rx is disabled again and the dance restarts.
> >> >
> >> > There are a few scenarios it can occur, but the two easiest to
> reproduce
> >> > are:
> >> >
> >> > 1. Send pings larger than the non-1500 MTU, and they occur after
> >> > ~300seconds.
> >> > 2. Set a non 1500 MTU. reset the modem without sending anything,
> and
> >> > they occur immediately afterwards.
> >> >
> >> >
> >> > Still trying to figure it out,
> >> >
> >> > --
> >> > Paul
> >> >
> >> > On Fri, 24 Jan 2020 at 17:26, Paul Gildea <gildeap at tcd.ie> wrote:
> >> >
> >> >> Hi guys, have done some more testing with this. Using ubuntu 16.04 I
> >> >> installed libqmi (1.16) to test that also, on a fresh system (a
> laptop).
> >> >> Set the MTU of the private network to 1430 and pinged with large
> packets
> >> >> (2000) and after a couple of hundred pings this behaviour repeated
> >> itself.
> >> >> When I thought that the system was fine yesterday, it turns out it
> just
> >> >> took more pings for it to fall over. At 1500 MTU and 300 pings the
> same
> >> >> thing occurred with the private network.
> >> >> Massive amounts of input errors, modem becomes unusable until reset.
> >> >>
> >> >> I can see pings arriving at the back end, it's when they are
> returning
> >> to
> >> >> the modem that there appears to be an issue.
> >> >> This is happening with SW and Telit with varying modems and multiple
> >> >> networks.
> >> >> It seems to me to be a qmi_wwan driver issue, what do you think?
> >> >>
> >> >> Regards
> >> >>
> >> >> --
> >> >> Paul
> >> >>
> >> >> On Thu, 23 Jan 2020 at 17:15, Paul Gildea <gildeap at tcd.ie> wrote:
> >> >>
> >> >>> Hi,
> >> >>>
> >> >>> We are seeing MTU issues with the ATT network and were just
> wondering
> >> if
> >> >>> you had heard of anything like this happening before? ATT push an
> MTU
> >> of
> >> >>> 1430, we apply that to our linux interface and everything works
> fine.
> >> >>> However when we try and ping with a large packet and the pings fail
> >> instead
> >> >>> of fragmenting correctly, we also see input errors on the linux
> >> interface.
> >> >>>
> >> >>> In some scenarios the input errors never stop (getting thousands of
> >> them
> >> >>> every second) after we stop pinging and the modem needs to be
> rebooted
> >> to
> >> >>> pass any traffic again. Changed MTU values to a lot of different
> >> things and
> >> >>> this always occurs.
> >> >>>
> >> >>> On a private network here (MTU 1404) and Vodafone (MTU 1500) doing
> the
> >> >>> same thing with the network pushed MTU values causes no issues and
> >> >>> everything works fine. I tested moving away from 1500 on vodafone to
> >> lower
> >> >>> values and once the issues reoccurred, same as with ATT.
> >> >>>
> >> >>> Libqmi 1.24 and I have seen this behaviour with all modems tested so
> >> far:
> >> >>> MC7455, EM7565, EM7511 and Telit LM960A18.
> >> >>>
> >> >>> Regards,
> >> >>>
> >> >>> --
> >> >>> Paul
> >> >>>
> >> >>
> >> > _______________________________________________
> >> > libqmi-devel mailing list
> >> > libqmi-devel at lists.freedesktop.org
> >> > https://lists.freedesktop.org/mailman/listinfo/libqmi-devel
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/libqmi-devel/attachments/20200210/8fe92702/attachment-0001.htm>
More information about the libqmi-devel
mailing list