<div dir="ltr"><div>Hi Bjørn,</div><div><br></div><div>I think initially a new change_mtu function instead of usbnet_change_mtu() makes sense,</div><div>fix MRU to 1500, change the MTU as desired. Not having the added complication of needing </div><div>to make the code work with usbnet_change_mtu() would mean there is no need for the "+4" hack. </div><div><br></div><div><br></div><div>Regards,</div><div><br></div><div>--</div><div>Paul</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 7 Feb 2020 at 08:52, Bjørn Mork <<a href="mailto:bjorn@mork.no">bjorn@mork.no</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Great finding!<br>
<br>
This all makes sense when you explain it like that. The proposed patch<br>
looks fine to me, preferably with most of your explanation as commit<br>
message.<br>
<br>
But maybe we should consider decoupling mru and mtu with a new<br>
_change_mtu() function instead of the '+ 4' hack? We don't have to use<br>
usbnet_change_mtu(). It was just convenient as long as it did what we<br>
wanted. When it doesn't, then we might as well replace it.<br>
<br>
This would also allows us to solve the issue Daniele points out wrt<br>
QMI_WDA_SET_DATA_FORMAT.<br>
<br>
Don't let these proposals stall a bugfix though. We can always use your<br>
proposed patch as a quickfix now, and then DTRT later when we figure out<br>
what that is.<br>
<br>
<br>
<br>
Bjørn<br>
<br>
Paul Gildea <<a href="mailto:gildeap@tcd.ie" target="_blank">gildeap@tcd.ie</a>> writes:<br>
<br>
> Hi,<br>
><br>
> Thanks for that Bjørn, helped me understand! Dug deeper into this and found<br>
> the issue and a solution.<br>
> Packets are being rejected in the ring buffer used by the xHCI controller.<br>
><br>
> Enabled traces for usbnet.c and xhci-ring.c file in debugfs as follows:<br>
><br>
> mount -t debugfs none /sys/kernel/debug<br>
> echo -n 'file usbnet.c +p' > /sys/kernel/debug/dynamic_debug/control<br>
> echo -n 'file xhci-ring.c +p' > /sys/kernel/debug/dynamic_debug/control<br>
><br>
> Saw a lot of errors indicating URB with len=0 when it was expecting<br>
> len=1430. This was preceded by an URB of len 1024 bytes:<br>
><br>
> xhci_hcd 0000:00:14.0: Babble error for slot 10 ep 12 on endpoint<br>
> xhci_hcd 0000:00:14.0: Giveback URB ffff8802193be540, len = 1024, expected<br>
> = 1430, status = -75<br>
> xhci_hcd 0000:00:14.0: Transfer error for slot 10 ep 12 on endpoint<br>
> /repeated<br>
><br>
> Error -75 is EOVERFLOW, which makes sense as per*<br>
> include/uapi/asm-generic/errno.h*<br>
><br>
> #define EOVERFLOW 75 /* Value too large for defined data type */<br>
><br>
> *"Babble error"*: When a packet larger than MTU arrives in Linux from the<br>
> modem and is discarded with -EOVERFLOW error.<br>
> This is seen on USB3.0 and USB2.0 busses. This is essentially because the<br>
> MRU (Max Receive Size) is not a separate entity to the MTU (Max Transmit<br>
> Size) and the received packets can be larger than those transmitted.<br>
><br>
> *"Endless input error"*: Following the babble error we see an endless<br>
> supply of zero-length URBs which are rejected with -EPROTO (increasing the<br>
> rx input error counter each time).<br>
> This is only seen on USB3.0. These continue to come ad infinitum until the<br>
> modem is shutdown.<br>
><br>
> There appears to be a bug in the core USB handling code in Linux that<br>
> doesn't deal well with network MTUs smaller than 1500 bytes. By default the<br>
> dev->hard_mtu (the "real" MTU) is in lockstep with dev->rx_urb_size<br>
> (essentially an MRU), and it's the latter that is causing trouble. This has<br>
> nothing to do with the modems; the issue can be reproduced by getting a<br>
> USB-Ethernet dongle, setting the MTU to 1430, and pinging with size greater<br>
> than 1406.<br>
><br>
> Will submit the below patch that solves the issue, if that is acceptable?<br>
><br>
><br>
> +diff -Naur linux-4.14.73/drivers/net/usb/qmi_wwan.c<br>
> linux-4.14.73-rx_size_fix/drivers/net/usb/qmi_wwan.c<br>
> +--- linux-4.14.73/drivers/net/usb/qmi_wwan.c 2018-09-29<br>
> 11:06:07.000000000 +0100<br>
> ++++ linux-4.14.73-rx_size_fix/drivers/net/usb/qmi_wwan.c 2020-01-31<br>
> 18:05:07.709008785 +0000<br>
> +@@ -740,6 +740,14 @@<br>
> + }<br>
> + dev->net->netdev_ops = &qmi_wwan_netdev_ops;<br>
> + dev->net->sysfs_groups[0] = &qmi_wwan_sysfs_attr_group;<br>
> ++<br>
> ++ /* LTE Networks don't always respect their own MTU on receive side;<br>
> ++ * e.g. AT&T pushes 1430 MTU but still allows 1500 byte packets from<br>
> ++ * far-end network. Make receive buffer large enough to accommodate<br>
> ++ * them, and add four bytes so MTU does not equal MRU on network<br>
> ++ * with 1500 MTU otherwise usbnet_change_mtu() will change both.<br>
><br>
> ++ * This is a sufficient max receive buffer as over 1500 MTU,<br>
><br>
> ++ * USB driver issues are not seen.<br>
><br>
> ++ */<br>
> ++ dev->rx_urb_size = ETH_DATA_LEN + 4;<br>
> + err:<br>
> + return status;<br>
> + }<br>
><br>
><br>
><br>
> Regards,<br>
><br>
> --<br>
> Paul<br>
><br>
> On Mon, 27 Jan 2020 at 18:04, Bjørn Mork <<a href="mailto:bjorn@mork.no" target="_blank">bjorn@mork.no</a>> wrote:<br>
><br>
>> I cannot exlain the issues you are having, but I can try to explain the<br>
>> reasons behind the RX_KILL stuff since I happened to write that..<br>
>><br>
>> One of my modems would sometimes end up in a state where it flooded the<br>
>> host with 0 length frames. This happened at a high enough rate to bring<br>
>> the host to a complete halt, since usbnet was desperately trying to<br>
>> allocate new max sized skbs at the same rate... So I figured it was<br>
>> better to let the host have a break now and then when the incoming error<br>
>> rate was above some arbitrary threshold. This allowed the host to e.g<br>
>> reset the modem to bring it out of the faulty state.<br>
>><br>
>> The underlying issue was of course a modem firmware bug, and there<br>
>> wasn't much we could do about that. I wonder if that might be the case<br>
>> for you to? What does the frames triggering these rx errors look like?<br>
>> Myabe you could snoop on the USB bus to see if there is something<br>
>> obviously wrong?<br>
>><br>
>><br>
>><br>
>> Bjørn<br>
>><br>
>> Paul Gildea <<a href="mailto:gildeap@tcd.ie" target="_blank">gildeap@tcd.ie</a>> writes:<br>
>><br>
>> > Hi.<br>
>> ><br>
>> > I was mistaken about this ever happening with a 1500 MTU, it does not.<br>
>> > Looking into this further it seems to be an issue with the usbnet driver<br>
>> > that qmi_wwan is leveraging.<br>
>> > Turning on debugfs and enabling debugging for usbnet.c shows the errors:<br>
>> ><br>
>> > mount -t debugfs none /sys/kernel/debug<br>
>> > echo -n 'file usbnet.c +p' > /sys/kernel/debug/dynamic_debug/control<br>
>> ><br>
>> > Looking at the log you can see that the rxqlen counter rapidly increases<br>
>> > until it is full and a throttle occurs, this throttle error then repeats<br>
>> > forever (quite rapidly) until the modem is reset, even if the ping is<br>
>> > stopped:<br>
>> ><br>
>> > Sep 23 10:16:16 T6 kernel: [ 715.615456] qmi_wwan 2-2:1.8 wwan0: rx<br>
>> > throttle -71<br>
>> > Sep 23 10:16:16 T6 kernel: [ 715.743152] qmi_wwan 2-2:1.8 wwan0: rxqlen<br>
>> > 281 --> 291<br>
>> > Sep 23 10:16:16 T6 kernel: [ 715.743307] qmi_wwan 2-2:1.8 wwan0: rxqlen<br>
>> > 291 --> 301<br>
>> > Sep 23 10:16:16 T6 kernel: [ 715.743320] qmi_wwan 2-2:1.8 wwan0: rxqlen<br>
>> > 301 --> 311<br>
>> > Sep 23 10:16:16 T6 kernel: [ 715.743331] qmi_wwan 2-2:1.8 wwan0: rxqlen<br>
>> > 311 --> 318<br>
>> > Sep 23 10:16:16 T6 kernel: [ 715.744947] qmi_wwan 2-2:1.8 wwan0: rx<br>
>> > throttle -71<br>
>> > Sep 23 10:16:16 T6 kernel: [ 715.871077] qmi_wwan 2-2:1.8 wwan0: rxqlen<br>
>> > 281 --> 291<br>
>> > Sep 23 10:16:16 T6 kernel: [ 715.871115] qmi_wwan 2-2:1.8 wwan0: rxqlen<br>
>> > 291 --> 301<br>
>> > Sep 23 10:16:16 T6 kernel: [ 715.871128] qmi_wwan 2-2:1.8 wwan0: rxqlen<br>
>> > 301 --> 311<br>
>> > Sep 23 10:16:16 T6 kernel: [ 715.871138] qmi_wwan 2-2:1.8 wwan0: rxqlen<br>
>> > 311 --> 318<br>
>> > Sep 23 10:16:16 T6 kernel: [ 715.874448] qmi_wwan 2-2:1.8 wwan0: rx<br>
>> > throttle -71<br>
>> > Sep 23 10:16:17 T6 kernel: [ 716.007133] qmi_wwan 2-2:1.8 wwan0: rxqlen<br>
>> > 280 --> 290<br>
>> > Sep 23 10:16:17 T6 kernel: [ 716.007172] qmi_wwan 2-2:1.8 wwan0: rxqlen<br>
>> > 290 --> 300<br>
>> > Sep 23 10:16:17 T6 kernel: [ 716.007185] qmi_wwan 2-2:1.8 wwan0: rxqlen<br>
>> > 300 --> 310<br>
>> > Sep 23 10:16:17 T6 kernel: [ 716.007197] qmi_wwan 2-2:1.8 wwan0: rxqlen<br>
>> > 310 --> 318<br>
>> ><br>
>> ><br>
>> > This happens on USB3 and not USB2 (unless I pick a small MTU like 500).<br>
>> > With USB2 I generally don't get any of the above errors and can then go<br>
>> > back to pinging with regular sized packets no problem.<br>
>> > The above error, 71 refers to:<br>
>> ><br>
>> > #define EPROTO 71 /* Protocol error */<br>
>> ><br>
>> > The rxqlen error is coming from usbnet_bh () which is described as<br>
>> *“tasklet<br>
>> > (work deferred from completions, in_irq) or timer”* in the source code.<br>
>> > This is too large to paste, but looks like the cleanup function to<br>
>> release<br>
>> > the Socket Buffers (skb) and USB Request Blocks (urb). Interesting<br>
>> snippet<br>
>> > of code from this function, after releasing the buffers:<br>
>> ><br>
>> > /* restart RX again after disabling due to high error rate */<br>
>> > clear_bit(EVENT_RX_KILL, &dev->flags);<br>
>> ><br>
>> > Buffers are coming in with -EPROTO at a high rate so the usbnet driver<br>
>> > disables RX. The buffers get released, which explains the packet loss,<br>
>> and<br>
>> > Rx is restarted. Obviously whatever issue was underlying is still present<br>
>> > and so Rx is disabled again and the dance restarts.<br>
>> ><br>
>> > There are a few scenarios it can occur, but the two easiest to reproduce<br>
>> > are:<br>
>> ><br>
>> > 1. Send pings larger than the non-1500 MTU, and they occur after<br>
>> > ~300seconds.<br>
>> > 2. Set a non 1500 MTU. reset the modem without sending anything, and<br>
>> > they occur immediately afterwards.<br>
>> ><br>
>> ><br>
>> > Still trying to figure it out,<br>
>> ><br>
>> > --<br>
>> > Paul<br>
>> ><br>
>> > On Fri, 24 Jan 2020 at 17:26, Paul Gildea <<a href="mailto:gildeap@tcd.ie" target="_blank">gildeap@tcd.ie</a>> wrote:<br>
>> ><br>
>> >> Hi guys, have done some more testing with this. Using ubuntu 16.04 I<br>
>> >> installed libqmi (1.16) to test that also, on a fresh system (a laptop).<br>
>> >> Set the MTU of the private network to 1430 and pinged with large packets<br>
>> >> (2000) and after a couple of hundred pings this behaviour repeated<br>
>> itself.<br>
>> >> When I thought that the system was fine yesterday, it turns out it just<br>
>> >> took more pings for it to fall over. At 1500 MTU and 300 pings the same<br>
>> >> thing occurred with the private network.<br>
>> >> Massive amounts of input errors, modem becomes unusable until reset.<br>
>> >><br>
>> >> I can see pings arriving at the back end, it's when they are returning<br>
>> to<br>
>> >> the modem that there appears to be an issue.<br>
>> >> This is happening with SW and Telit with varying modems and multiple<br>
>> >> networks.<br>
>> >> It seems to me to be a qmi_wwan driver issue, what do you think?<br>
>> >><br>
>> >> Regards<br>
>> >><br>
>> >> --<br>
>> >> Paul<br>
>> >><br>
>> >> On Thu, 23 Jan 2020 at 17:15, Paul Gildea <<a href="mailto:gildeap@tcd.ie" target="_blank">gildeap@tcd.ie</a>> wrote:<br>
>> >><br>
>> >>> Hi,<br>
>> >>><br>
>> >>> We are seeing MTU issues with the ATT network and were just wondering<br>
>> if<br>
>> >>> you had heard of anything like this happening before? ATT push an MTU<br>
>> of<br>
>> >>> 1430, we apply that to our linux interface and everything works fine.<br>
>> >>> However when we try and ping with a large packet and the pings fail<br>
>> instead<br>
>> >>> of fragmenting correctly, we also see input errors on the linux<br>
>> interface.<br>
>> >>><br>
>> >>> In some scenarios the input errors never stop (getting thousands of<br>
>> them<br>
>> >>> every second) after we stop pinging and the modem needs to be rebooted<br>
>> to<br>
>> >>> pass any traffic again. Changed MTU values to a lot of different<br>
>> things and<br>
>> >>> this always occurs.<br>
>> >>><br>
>> >>> On a private network here (MTU 1404) and Vodafone (MTU 1500) doing the<br>
>> >>> same thing with the network pushed MTU values causes no issues and<br>
>> >>> everything works fine. I tested moving away from 1500 on vodafone to<br>
>> lower<br>
>> >>> values and once the issues reoccurred, same as with ATT.<br>
>> >>><br>
>> >>> Libqmi 1.24 and I have seen this behaviour with all modems tested so<br>
>> far:<br>
>> >>> MC7455, EM7565, EM7511 and Telit LM960A18.<br>
>> >>><br>
>> >>> Regards,<br>
>> >>><br>
>> >>> --<br>
>> >>> Paul<br>
>> >>><br>
>> >><br>
>> > _______________________________________________<br>
>> > libqmi-devel mailing list<br>
>> > <a href="mailto:libqmi-devel@lists.freedesktop.org" target="_blank">libqmi-devel@lists.freedesktop.org</a><br>
>> > <a href="https://lists.freedesktop.org/mailman/listinfo/libqmi-devel" rel="noreferrer" target="_blank">https://lists.freedesktop.org/mailman/listinfo/libqmi-devel</a><br>
>><br>
</blockquote></div></div>