[systemd-devel] Resolving systemd naming problems on multi-port PCI cards
Jordan Hargrave
jharg93 at gmail.com
Thu Apr 7 23:42:37 UTC 2016
On Thu, Apr 7, 2016 at 11:48 AM, Kay Sievers <kay at vrfy.org> wrote:
> On Thu, Apr 7, 2016 at 6:08 PM, Jordan Hargrave <jharg93 at gmail.com> wrote:
>> The current systemd naming scheme for Network cards has a problem
>> correctly naming multi-port NIC devices in a PCI slot.
>>
>> Systemd currently generates names of the form:
>>
>> enpAsBfCdD
>>
>> pA = PCI bus number
>> sB = PCI device number (confusingly called 'SLOT')
>
> Geographical addressing uses sometimes slot sometimes device. The
> kernel uses "slot"
> https://github.com/torvalds/linux/blob/master/arch/x86/pci/early.c
>
>> fC = PCI function number
>> [dD = NIC device port (sysfs dev_port)]
>>
>> eg. enp5s0f0 for a NIC at 05:00.0, dev_port = 0
>>
>> These names already aren't necessarily persistent if PCI bus topology
>> changes (Bus number changes due to adding cards across reboot, etc).
>
> Sure, geographical addressing is not expected to cover hardware
> reconfiguration or firmwares which just do "random" renumbering at
> reboot time.
>
>> --or--
>> ensBfCdD
>>
>> sB = _SUN slot
>> fC = PCI function number
>> [dD = NIC device port (sysfs dev_port)]
>>
>> eg. ens2f0d1 for a single-port NIC at 0?:00.0 in PCI slot 2, dev_port = 1
>>
>> The problem is the 2nd naming scheme cannot handle multi-port NICs.
>> Multi-port NICs often have one or more bridges before the PCI slot
>> number itself.
>>
>> eg. for my quad-port Intel NIC in PCI slot 2 the devices are actually:
>> 44:00.0
>> 44:00.1
>> 45:00.0
>> 45:00.1
>>
>> Using the 2nd naming scheme, the names generated are:
>> ens2f0
>> ens2f1
>> ens2f0
>> ens2f1
>>
>> Oops. Problem. There is a name collision.
>> So depending on who gets
>> initialized first I'll see either:
>>
>> ens2f0
>> ens2f1
>> enp69s0f0
>> enp69s1f0
>>
>> or
>> enp68s0f0
>> enp68s1f0
>> ens2f0
>> ens2f1
>
> How does /sys/bus/pci/slots/ look in that case?
>
There are three entries:
/sys/bus/pci/slots/PCI1 : address = 0000:41:00.0
/sys/bus/pci/slots/PCI2 : address = 0000:42:00.0
/sys/bus/pci/slots/PCI3 : address = 0000:04:00.0
Normally systemd won't discover "PCI2" on my multi-port as it only
looks at a matching device in /sys/bus/pci/slots/address. So it
checks 0000:44:00.0, 0000:44:00.1, etc. That doesn't match. On a
single-port NIC in a PCI slot, it would match.
Here's the device tree of the devices that all live under 0000:42:00.0
/sys/devices/pci0000:40/0000:40:03.0/0000:42:00.0/0000:43:02.0,PCI2
/sys/devices/pci0000:40/0000:40:03.0/0000:42:00.0/0000:43:04.0,PCI2
/sys/devices/pci0000:40/0000:40:03.0/0000:42:00.0/0000:43:02.0/0000:44:00.0,PCI2
NIC Port 1
/sys/devices/pci0000:40/0000:40:03.0/0000:42:00.0/0000:43:02.0/0000:44:00.1,PCI2
NIC Port 2
/sys/devices/pci0000:40/0000:40:03.0/0000:42:00.0/0000:43:04.0/0000:45:00.0,PCI2
NIC Port 3
/sys/devices/pci0000:40/0000:40:03.0/0000:42:00.0/0000:43:04.0/0000:45:00.1,PCI2
NIC Port 4
I changed systemd to also search the parent devices for a match, but
that causes the naming conflict as now 4 devices match, with same
device and function numbers.
> When is the PCI hotplug driver loaded? Before or after the network card driver?
>
Slot files are created at PCI device enumeration, so before network
driver loads.
>> There is a way to fix this by combining the two naming schemes, with a
>> bit of a hack.
>>
>> enpAsBfCdD
>>
>> pA = PCI bus # (no change)
>> sB = _SUN slot # (no change)
>> fC = This is what changes. Instead of C = function number (0..7) it is
>> Device:Function (0..31)
>> dD = Device port (no change)
>>
>> On my system this generates new names:
>> enp4s0 at /sys/devices/pci0000:00/0000:00:03.0 SLOT 3 => enp3s4f0
>> enp4s0d1 at /sys/devices/pci0000:00/0000:00:03.0 1 SLOT 3 => enp3s4f0d1
>> enp4s0f1 at /sys/devices/pci0000:00/0000:00:03.0 SLOT 3 => enp3s4f1
>> enp4s0f1d1 at /sys/devices/pci0000:00/0000:00:03.0 SLOT 3 => enp3s4f1d1
>> enp4s0f2 at /sys/devices/pci0000:00/0000:00:03.0 SLOT 3 => enp3s4f2
>> enp4s0f2d1 at /sys/devices/pci0000:00/0000:00:03.0 SLOT 3 => enp3s4f2d1
>> enp4s0f3 at /sys/devices/pci0000:00/0000:00:03.0 SLOT 3 => enp3s4f3
>> enp4s0f3d1 at /sys/devices/pci0000:00/0000:00:03.0 SLOT 3 => enp3s4f3d1
>> enp4s0f4 at /sys/devices/pci0000:00/0000:00:03.0 SLOT 3 => enp3s4f4
>> enp4s0f4d1 at /sys/devices/pci0000:00/0000:00:03.0 SLOT 3 => enp3s4f4d1
>> enp4s0f5 at /sys/devices/pci0000:00/0000:00:03.0 SLOT => enp3s4f5
>> enp4s0f5d1 at /sys/devices/pci0000:00/0000:00:03.0 SLOT 3 => enp3s4f5d1
>> enp4s0f6 at /sys/devices/pci0000:00/0000:00:03.0 SLOT 3 => enp3s4f6
>> enp4s0f6d1 at /sys/devices/pci0000:00/0000:00:03.0 SLOT 3 => enp3s4f6d1
>> enp4s0f7 at /sys/devices/pci0000:00/0000:00:03.0 SLOT 3 => enp3s4f7
>> enp4s0f7d1 at /sys/devices/pci0000:00/0000:00:03.0 SLOT 3 => enp3s4f7d1
>> enp4s1 at /sys/devices/pci0000:00/0000:00:03.0 SLOT 3 => enp3s4f8
>> (Device 1:0 => Function 8)
>> enp4s1d1 at /sys/devices/pci0000:00/0000:00:03.0 SLOT 3 =>
>> enp3s4f8d1 (Device 1:0 => Function 8)
>>
>> enp68s0f0 at /sys/devices/pci0000:40/0000:40:03.0 SLOT 2 => enp68s2f0
>> enp68s0f1 at /sys/devices/pci0000:40/0000:40:03.0 SLOT 2 => enp68s2f1
>> enp69s0f0 at /sys/devices/pci0000:40/0000:40:03.0 SLOT 2 => enp69s2f0
>> enp69s0f1 at /sys/devices/pci0000:40/0000:40:03.0 SLOT 2 => enp69s2f1
>>
>> This way it is always able to determine the physical PCI slot the device is in.
>>
>> This scheme still does have a limitation... the names may not be
>> persistent if PCI topology changes due to the PCI bus number still
>> being part of the name.
>
> I don't think the two should be mixed. The point of the hotplug slots
> was to be independent of the geography.
>
> If what you describe can't be fixed, the slot numbering scheme should
> just be turned off by default.
There needs to be a slot numbering scheme that works. Systemd is fine
for a laptop or desktop, but for data centers with thousands of
servers, all which may have slots filled with multi-port NICs (A fully
populated PowerEdge R930 can have 32 NICs, and that's before enabling
SR-IOV), admins need to know which NIC is in which slot and port. The
fact that systemd uses decimal numbers for PCI bus (instead of hex
like lspci) makes identification of which port is on which NIC near
impossible.
>
> Kay
More information about the systemd-devel
mailing list