[systemd-devel] Request for Feedback on Design Issue with Systemd and "Consistent Network Device Naming"

Simon Foley simon at simonfoley.net
Wed Apr 21 09:13:41 UTC 2021


Hi all,

     I wonder if you can help. I'm trying to find a contact in systemd 
dev who has been involved in the "Consistent Network Device Naming" 
initiative.

As a HPC compute architect I was surprised to come across some changes 
in RHEL8 when testing that seem to originate from systemd work.

While I applaud the initiative, I think that there has been some 
fundamental oversight on real world use cases for network device management.

Rather than create a more *consistent* OS environment for applications 
the implementation will, in the real world,  make the environment 
fundamentally more confusing divergent for users.  More importantly for 
commercial businesses there will be a $ impact on managing the the 
changes in the data center and require people to invalidate commercial 
support by disabling the feature via a kernel bootstrap argument 
net.ifnames=0 to disable the feature.


### PROBLEM ###

The issue is around the depreciation of the support for the HWADDR= 
argument in the ifcfg files (RHEL, other distros are available).

This feature is used in the real world to migrate device names around 
physical NIC cards and ports in *order to create a more consistent 
environment* for application users in multi homed servers. In HPC one of 
the challenges we face is the fact that our server farms are depreciated 
over 3-5 years and during that time capacity expansions mean we don't 
have 100% consistent hardware, especially when it comes to NIC 
implementation.  Dedicated on board, discrete PCIe NIC Cards and Server 
flex-lom/riser and their firmware's are constantly changing with their 
version iterations. This means that the systemd project can never 
control server HW manufactures;

1. PCIe implementations and lane allocations to specific slots on the 
motherboard.
2. Decision on the number of "on board" Chipset NIC's (typically RJ45 
1GbE though i'm sure we will soon see 10GbE SFP+ becoming a norm)
3. Default "FlexLom/Riser" cards (can vary from 2 to 4 1GbE Ports or 1 
to 2 x 10GbE SFP+ Ports)
4. Ports numbers (RJ45 and SFP+) they put by default on NIC 
Manufacturers cards as their model iterations increase.
5. Firmware changes on NIC cards that can affect the order of the 
initialization of ports on the PCIe bus for each CPU.
6. The OEM relationships with NIC makers that servers manufactures have 
where on board and flexlom NIC Chipsets change regularly with each base 
revision (broadcom, realtek,  Intel etc )

Now in HPC one of the biggest challenge we face is to maximize 
performance on the increasing amount of compute cores we get per socket 
per and to maximize efficiency and lower latency.
In order to do this a common approach (see attached diagrams for use 
cases) is to separate data flows into an ingress and egress paradigm.
We can then use multi homed servers with discrete PCIe high performance 
NIC's exploiting full bandwidth 16 lane's going directly into a processors.
Dual socket servers allow then us to split the compute data flows into 
reader and writer threads and dedicate a Processor, DDR RAM Banks, and a 
NIC Card for each thread type.
Typically a sweet spot is a Dual Socket white box server where HPC 
Designers in the OS Space target interfaces for functional roles

Processor O ->  PCIe Slot 1 (Full 16 lane) => Ingress Threads.
Processor 1 ->  PCIe Slot 4 (Full 16 lane) => Egress Threads.

Now because of all the issues listed (1-6) we can *never* guarantee 
which interface device name Linux will allocate to these key NIC ports.
And yet we want to create a consistent environment for the application 
team to know which processor and interface they need to pin their 
processes to.
They need to know this in order to minimize memory NUMA latency and 
irrelevant NIC interrupts.

How HPC architects try to help sysadmins and application teams in the 
process is to have post build modifications.
Here we can use the HWADDR= variable in the ifcfg-[device name] files to 
move a *specific* device name to these targeted NIC cards and ports.
This way application teams can always associated a *specific* device 
name for a specific functional purpose (Feed,Backbone,Access) and know 
for them where to tie their reader and writer threads.
Also we can always standardize that a given interface is always the 
"default route" interface for a specific server blueprint.

It would appear in RHEL8 that due to systemd the HWADDR= is no longer 
supported and we have lost this fundamentally important feature.

### REQUIREMENT ###

Sysadmins and HPC Designers need a supported way to either swap / move 
kernel allocated device names around the physical NIC Cards and ports to 
create consistent compute environments.
The HWADDR= solution was rather brutal, but effective way of achieving 
this but it would appear now that this is no longer supported in systemd.
A better solution would be the support for the user to define unique 
device names for NIC card interfaces to they can be more explicit in 
their naming conventions


e.g.

Ethernet:     enofeed1.0, enofeed1.1, enoback1.0, enoback1.1, 
enoaccess1.0, enoaccess1.1,
Infiniband:     ibofeed1.0, ibofeed1.1, iboback1.0, iboback1.1, 
iboaccess1.0, iboaccess1.1,

### THE FUTURE ###
The industry is moving towards moving compute *closer to the network* 
and NIC Cards are having FPGA, DDR Memory Banks, GPU, Many-Core all 
integrated on the PCB attached to the PCIe slot. The Linux kernel needs 
to enable sysadmins and HPC architects to create consistent compute 
environments across heterogeneous server environments.

Who can I discuss these design issues with in the systemd space ?

Yours Sincerely
Axel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ConsistantNaming3.png
Type: image/png
Size: 105149 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/systemd-devel/attachments/20210421/1f9538a0/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ConsistantNaming2.png
Type: image/png
Size: 100862 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/systemd-devel/attachments/20210421/1f9538a0/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ConsistantNaming1.png
Type: image/png
Size: 119872 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/systemd-devel/attachments/20210421/1f9538a0/attachment-0005.png>


More information about the systemd-devel mailing list