[systemd-devel] nftables support for nspawn/networkd

Florian Westphal fw at strlen.de
Fri Jun 19 14:14:48 UTC 2020


Hello.

I have been working on an nftables backend as alternative (or
replacement?) for the libiptc one.

A PoC is here:
https://git.breakpoint.cc/cgit/fw/systemd.git/log/?h=nft_08

Diffstat:

 basic/linux/netfilter/nf_tables.h        | 1801 +++++++++++++++++++++++++++++++
 basic/linux/netfilter/nfnetlink.h        |   81 +
 libsystemd/meson.build                   |    1 
 libsystemd/sd-netlink/netlink-internal.h |    1 
 libsystemd/sd-netlink/netlink-socket.c   |   26 
 libsystemd/sd-netlink/netlink-types.c    |  234 ++++
 libsystemd/sd-netlink/netlink-types.h    |   16 
 libsystemd/sd-netlink/nfnl-message.c     |  309 +++++
 libsystemd/sd-netlink/sd-netlink.c       |   25 
 network/networkd-address.c               |    4 
 nspawn/nspawn-expose-ports.c             |    6 
 shared/firewall-util-nft.c               |  746 ++++++++++++
 shared/firewall-util.c                   |   23 
 shared/firewall-util.h                   |   22 
 shared/meson.build                       |    2 
 systemd/sd-netlink.h                     |   25 
 test/test-firewall-util.c                |   24 
 17 files changed, 3296 insertions(+), 50 deletions(-)

Most of this comes from the import of nf_tables.h (cached header of
kernel uapi) and the nfnetlink backend, i.e. this doesn't add a external
library dependency.

At this time, the prototype disables the existing libiptc backend and unconditionally
uses the nft one. I did this for simplicity.
This also means that the existing API (fw_add_...) is mostly the same.
I say *mostly* because that API exposes more functionality (on iptables side)
than is actually used, such as in/output interface names where all calles
pass NULL.

To simplify the prototype I modified the API to drop the 'always NULL' arguments
to focus on what is actually used.

Idea is to create a static ruleset, added once when first rule is added,
or by a new 'init NAT facility' function.

The prototype is complete enough to run the test-firewall-util.
The following ruleset will be created:

    table ip io.systemd.nat {
            set masq_saddr {
                    type ipv4_addr
            }
            map map_port_ipport {
                    type inet_proto . inet_service : ipv4_addr . inet_service
            }
            chain prerouting {
                    type nat hook prerouting priority filter + 1; policy accept;
                    fib daddr type local dnat ip addr . port to meta l4proto . th dport map @map_port_ipport
            }
            chain postrouting {
                    type nat hook postrouting priority filter + 1; policy accept;
                    ip saddr @masq_saddr masquerade
            }
    }

After that, future fw_add_masquerade/add_local_dnat will only add/delete the
element/mapping to masq_saddr and map_port_ipport, respectively.
The ruleset itself never changes.

Running test-firewall-util with this backend gives following output
on a parallel 'nft monitor':

    $ nft monitor
    add table ip io.systemd.nat
    add chain ip io.systemd.nat prerouting { type nat hook prerouting priority filter + 1; policy accept; }
    add chain ip io.systemd.nat postrouting { type nat hook postrouting priority filter + 1; policy accept; }
    add set ip io.systemd.nat masq_saddr { type ipv4_addr; }
    add map ip io.systemd.nat map_port_ipport { type inet_proto . inet_service : ipv4_addr . inet_service; }
    add rule ip io.systemd.nat prerouting fib daddr type local dnat ip addr . port to meta l4proto . th dport map @map_port_ipport
    add rule ip io.systemd.nat postrouting ip saddr @masq_saddr masquerade
    add element ip io.systemd.nat masq_saddr { 10.1.2.0 }
    add element ip io.systemd.nat masq_saddr { 10.1.2.3 }
    delete element ip io.systemd.nat masq_saddr { 10.1.2.0 }
    add element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.4 . 815 }
    delete element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.4 . 815 }
    add element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.5 . 815 }
    delete element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.5 . 815 }
    CTRL-C

So, good enough for a prototype and to send it out to get feedback.
Its still incomplete:

1. no output chain is added, this is needed to complete local dnat
   support (sd_nfnl_message_new_dnat_rule_out function doesn't
   work yet).

2. No ipv6 support, but this is rather easy, the current nfnetlink
   backend should be complete enough for this.

3. No cleanup on restart, i.e. on startup the table should be
   deleted when it exists, rather than re-adding the pre/postrouting
   rules.

4. 'set masq_saddr' should use ranges, so we can do masquarade for
    e.g. 10.2.3.4-10.2.3.4 or 10.2.3.0/24 instead of only 10.2.3.4/32.
    This should not be too hard to add.

5. this currently replaces the libiptc backend.
   Alternatives are a compile time or run-time switch.

6. No monitoring support.  Theoretically libsystemd could subscribe
   to the nftables netlink notification interface to e.g. learn when a
   user has flushed a set/removed a rule etc.
   I'm currently not sure this is needed due to the usual 'and what do
   we do now' problem.

Would this be deemed acceptable for merging into systemd once the
first four points are fixed/implemented?

As for retaining the libiptc backend -- I would propose to wait wrt.
deciding here.  I would test this on stock 4.14-ish kernels to see
what will work and what is problematic first.

If you want to retain the libiptc backend in any case: Do you have suggestions
on how to toggle this? Would a configure switch be enough?

Thanks,
Florian


More information about the systemd-devel mailing list