[PATCH 09/15] net: hbl_en: add habanalabs Ethernet driver

Sun Jun 23 06:22:31 UTC 2024

On 6/19/24 19:13, Andrew Lunn wrote:
> On Wed, Jun 19, 2024 at 07:16:20AM +0000, Omer Shpigelman wrote:
>> On 6/18/24 17:19, Andrew Lunn wrote:
>>>>>> +static u32 hbl_en_get_mtu(struct hbl_aux_dev *aux_dev, u32 port_idx)
>>>>>> +{
>>>>>> +     struct hbl_en_port *port = HBL_EN_PORT(aux_dev, port_idx);
>>>>>> +     struct net_device *ndev = port->ndev;
>>>>>> +     u32 mtu;
>>>>>> +
>>>>>> +     if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
>>>>>> +             netdev_err(ndev, "port is in reset, can't get MTU\n");
>>>>>> +             return 0;
>>>>>> +     }
>>>>>> +
>>>>>> +     mtu = ndev->mtu;
>>>>>
>>>>> I think you need a better error message. All this does is access
>>>>> ndev->mtu. What does it matter if the port is in reset? You don't
>>>>> access it.
>>>>>
>>>>
>>>> This function is called from the CN driver to get the current MTU in order
>>>> to configure it to the HW, for exmaple when configuring an IB QP. The MTU
>>>> value might be changed by user while we execute this function.
>>>
>>> Change of MTU will happen while holding RTNL. Why not simply hold RTNL
>>> while programming the hardware? That is the normal pattern for MAC
>>> drivers.
>>>
>>
>> I can hold the RTNL lock while configuring the HW but it seems like a big
>> overhead. Configuring the HW might take some time due to QP draining or
>> cache invalidation.
> 
> How often does the MTU change? Once, maybe twice on boot, and never
> again? MTU change is not hot path. For slow path code, KISS is much
> better, so it is likely to be correct. 
> 

Yeah, it's not a hot path so I guess we can just return the MTU value
regardless of a parallel reset flow.

>> To me it seems unnecessary but if that's the common way then I'll change
>> it.
>>  
>>>>>> +static int hbl_en_change_mtu(struct net_device *netdev, int new_mtu)
>>>>>> +{
>>>>>> +     struct hbl_en_port *port = hbl_netdev_priv(netdev);
>>>>>> +     int rc = 0;
>>>>>> +
>>>>>> +     if (atomic_cmpxchg(&port->in_reset, 0, 1)) {
>>>>>> +             netdev_err(netdev, "port is in reset, can't change MTU\n");
>>>>>> +             return -EBUSY;
>>>>>> +     }
>>>>>> +
>>>>>> +     if (netif_running(port->ndev)) {
>>>>>> +             hbl_en_port_close(port);
>>>>>> +
>>>>>> +             /* Sleep in order to let obsolete events to be dropped before re-opening the port */
>>>>>> +             msleep(20);
>>>>>> +
>>>>>> +             netdev->mtu = new_mtu;
>>>>>> +
>>>>>> +             rc = hbl_en_port_open(port);
>>>>>> +             if (rc)
>>>>>> +                     netdev_err(netdev, "Failed to reinit port for MTU change, rc %d\n", rc);
>>>>>
>>>>> Does that mean the port is FUBAR?
>>>>>
>>>>> Most operations like this are expected to roll back to the previous
>>>>> working configuration on failure. So if changing the MTU requires new
>>>>> buffers in your ring, you should first allocate the new buffers, then
>>>>> free the old buffers, so that if allocation fails, you still have
>>>>> buffers, and the device can continue operating.
>>>>>
>>>>
>>>> A failure in opening a port is a fatal error. It shouldn't happen. This is
>>>> not something we wish to recover from.
>>>
>>> What could cause open to fail? Is memory allocated?
>>>
>>
>> Memory is allocated but it is freed in case of a failure.
>> Port opening can fail due to other reasons as well like some HW timeout
>> while configuring the ETH QP.
> 
> If the hardware timeout because the hardware is dead, there is nothing
> you can do about it. Its dead.
> 

In our case the HW might timeout without being dead. Our ETH and RDMA QPs
are being configured in the same path in the HW so it is possible that a
timeout for ETH QP configuration will occur due to many parallel RDMA QPs
configurations and so a simple ETH QP configuration retry will solve it.

> But what about when the system is under memory pressure? You say it
> allocates memory. What happens if those allocations fail. Does
> changing the MTU take me from a working system to a dead system? It is
> good practice to not kill a working system under situations like
> memory pressure. You try to first allocate the memory you need to
> handle the new MTU, and only if successful do you free existing memory
> you no longer need. That means if you cannot allocate the needed
> memory, you still have the old memory, you can keep the old MTU and
> return -ENOMEM, and the system keeps running.
> 

That's a good optimization for these kind of on-the-fly configurations but
as you wrote before, changing an MTU value is not a hot path so out of
cost-benefit considerations we didn't find it mandatory to optimize this
flow.
But let me check this option for the next patch set version.

>> I didn't check that prior to my submit. Regarding this "no new module
>> parameters allowed" rule, is that documented anywhere?
> 
> Lots of emails that fly passed on the mailing list. Maybe once every
> couple of months when a vendor tries to mainline a new driver without
> reading the mailing list for a few months to know how mainline
> actually works. I _guess_ Davem has been pushing back on module
> parameters for 10 years? Maybe more.
> 
>

Ok, I'll just drop it in the next patch set version.

>> if not, is that the
>> common practice? not to try to do something that was not done recently?
>> how "recently" is defined?
>> I just want to clarify this because it's hard to handle these submissions
>> when we write some code based on existing examples but then we are
>> rejected because "we don't do that here anymore".
>> I want to avoid future cases of this mismatch.
> 
> My suggestion would be to spend 30 minutes every day reading patches
> and review comment on the mailing list. Avoid making the same mistakes
> others make, especially newbies to mainline, and see what others are
> doing in the same niche as this device. 30 minutes might seem like a
> lot, but how much time did you waste implementing polling mode, now
> you are going to throw it away?
>

I get your point but still it will be good if it would be documented
somewhere IMHO.

>>>>>> +                     ethtool_link_ksettings_add_link_mode(cmd, lp_advertising, Autoneg);
>>>>>
>>>>> That looks odd. Care to explain?
>>>>>
>>>>
>>>> The HW of all of our ports supports autoneg.
>>>> But in addition, the ports are divided to two groups:
>>>> internal: ports which are connected to other Gaudi2 ports in the same server.
>>>> external: ports which are connected to an external switch.
>>>> Only internal ports use autoneg.
>>>> The ports mask which sets each port as internal/external is fetched from
>>>> the FW on device load.
>>>
>>> That is not what i meant. lc_advertising should indicate the link
>>> modes the peer is advertising. If this was a copper link, it typically
>>> would contain 10BaseT-Half, 10BaseT-Full, 100BaseT-Half,
>>> 100BaseT-Full, 1000BaseT-Half. Setting the Autoneg bit is pointless,
>>> since the peer must be advertising in order that lp_advertising has a
>>> value!
>>>
>>
>> Sorry, but I don't get this. The problem is the setting of the Autoneg bit
>> in lp_advertising? is that redundant? I see that other vendors set it too
>> in case that Autoneg was completed.
> 
> 
> $ ethtool eth0
> Settings for eth0:
> 	Supported ports: [ TP	 MII ]
> 	Supported link modes:   10baseT/Half 10baseT/Full
> 	                        100baseT/Half 100baseT/Full
> 	                        1000baseT/Full
> 
> This is `supported`. The hardware can do these link modes.
> 
> 	Supported pause frame use: Symmetric Receive-only
> 	Supports auto-negotiation: Yes
> 
> It also support symmetric pause, and can do autoneg.
> 
> 	Supported FEC modes: Not reported
> 	Advertised link modes:  10baseT/Half 10baseT/Full
> 	                        100baseT/Half 100baseT/Full
> 	                        1000baseT/Full
> 	Advertised pause frame use: Symmetric Receive-only
> 	Advertised auto-negotiation: Yes
> 	Advertised FEC modes: Not reported
> 
> This is `advertising`, and is what this device is advertising to the
> link partner. By default you copy supported into advertising, but the
> user can use ethtool -s advertise N, where N is a list of link modes,
> to change what is advertised to the link partner.
> 
> 	Link partner advertised link modes:  10baseT/Half 10baseT/Full
> 	                                     100baseT/Half 100baseT/Full
> 	                                     1000baseT/Full
> 	Link partner advertised pause frame use: Symmetric
> 	Link partner advertised auto-negotiation: Yes
> 	Link partner advertised FEC modes: Not reported
> 
> This is `lp_advertising`, what the link partner is advertising to this
> device. Once you have this, you mask lp_advertising with advertising,
> and generally pick the link mode with the highest bandwidth:
> 
> 	Speed: 1000Mb/s
> 	Duplex: Full
> 
> So autoneg resolved to 1000baseT/Full
> 
> 	Andrew

I'm familiar with this logic but I don't understand your point. The point
you are making is that setting this Autoneg bit in lp_advertising is
pointless? I see other vendors setting it too in case that autoneg was
completed.
Is that redundant also in their case? because it looks to me that in this
case we followed the same logic and conventions other vendors followed.