[systemd-devel] kdbus vs. pipe based ipc performance

Stefan Westerfeld stefan at space.twc.de
Tue Apr 15 08:38:04 PDT 2014


   Hi!

On Tue, Mar 04, 2014 at 05:36:16AM +0100, Kay Sievers wrote:
> On Tue, Mar 4, 2014 at 5:00 AM, Kay Sievers <kay at vrfy.org> wrote:
> > On Mon, Mar 3, 2014 at 11:06 PM, Kay Sievers <kay at vrfy.org> wrote:
> >> On Mon, Mar 3, 2014 at 10:35 PM, Stefan Westerfeld <stefan at space.twc.de> wrote:
> >>> First of all: I'd really like to see kdbus being used as a general purpose IPC
> >>> layer; so that developers working on client-/server software will no longer
> >>> need to create their own homemade IPC by using primitives like sockets or
> >>> similar.
> >>>
> >>> Now kdbus is advertised as high performance IPC solution, and compared to the
> >>> traditional dbus approach, this may well be true. But are the numbers that
> >>>
> >>> $ test-bus-kernel-benchmark chart
> >>>
> >>> produces impressive? Or to put it in another way: will developers working on
> >>> client-/server software happily accept kdbus, because it performs as good as a
> >>> homemade IPC solution would? Or does kdbus add overhead to a degree that some
> >>> applications can't accept?
> >>>
> >>> To answer this, I wrote a program called "ibench" which passes messages between
> >>> a client and a server, but instead of using kdbus to do it, it uses traditional
> >>> pipes. To simulate main loop integration, it uses poll() in cases where a normal
> >>> client or server application would go into the main loop, and wait to be woken
> >>> up by filedescriptor activity.
> >>>
> >>> Now here are the results I obtained using
> >>>
> >>> - AMD Phenom(tm) 9850 Quad-Core Processor
> >>> - running Fedora 20 64-bit with systemd+kdbus from git
> >>> - system booted with kdbus and single kernel arguments
> >>>
> >>> ============================================================================
> >>> *** single cpu performance:                                      .
> >>>
> >>>    SIZE    COPY   MEMFD KDBUS-MAX  IBENCH  SPEEDUP
> >>>
> >>>       1   32580   16390     32580  192007  5.89
> >>>       2   40870   16960     40870  191730  4.69
> >>>       4   40750   16870     40750  190938  4.69
> >>>       8   40930   16950     40930  191234  4.67
> >>>      16   40290   17150     40290  192041  4.77
> >>>      32   40220   18050     40220  191963  4.77
> >>>      64   40280   16930     40280  192183  4.77
> >>>     128   40530   17440     40530  191649  4.73
> >>>     256   40610   17610     40610  190405  4.69
> >>>     512   40770   16690     40770  188671  4.63
> >>>    1024   40670   17840     40670  185819  4.57
> >>>    2048   40510   17780     40510  181050  4.47
> >>>    4096   39610   17330     39610  154303  3.90
> >>>    8192   38000   16540     38000  121710  3.20
> >>>   16384   35900   15050     35900   80921  2.25
> >>>   32768   31300   13020     31300   54062  1.73
> >>>   65536   24300    9940     24300   27574  1.13
> >>>  131072   16730    6820     16730   14886  0.89
> >>>  262144    4420    4080      4420    6888  1.56
> >>>  524288    1660    2040      2040    2781  1.36
> >>> 1048576     800     950       950    1231  1.30
> >>> 2097152     310     490       490     475  0.97
> >>> 4194304     150     240       240     227  0.95
> >>>
> >>> *** dual cpu performance:                                      .
> >>>
> >>>    SIZE    COPY   MEMFD KDBUS-MAX  IBENCH  SPEEDUP
> >>>
> >>>       1   31680   14000     31680  104664  3.30
> >>>       2   34960   14290     34960  104926  3.00
> >>>       4   34930   14050     34930  104659  3.00
> >>>       8   24610   13300     24610  104058  4.23
> >>>      16   33840   14740     33840  103800  3.07
> >>>      32   33880   14400     33880  103917  3.07
> >>>      64   34180   14220     34180  103349  3.02
> >>>     128   34540   14260     34540  102622  2.97
> >>>     256   37820   14240     37820  102076  2.70
> >>>     512   37570   14270     37570   99105  2.64
> >>>    1024   37570   14780     37570   96010  2.56
> >>>    2048   21640   13330     21640   89602  4.14
> >>>    4096   23430   13120     23430   73682  3.14
> >>>    8192   34350   12300     34350   59827  1.74
> >>>   16384   25180   10560     25180   43808  1.74
> >>>   32768   20210    9700     20210   21112  1.04
> >>>   65536   15440    7820     15440   10771  0.70
> >>>  131072   11630    5670     11630    5775  0.50
> >>>  262144    4080    3730      4080    3012  0.74
> >>>  524288    1830    2040      2040    1421  0.70
> >>> 1048576     810     950       950     631  0.66
> >>> 2097152     310     490       490     269  0.55
> >>> 4194304     150     240       240     133  0.55
> >>> ============================================================================
> >>>
> >>> I ran the tests twice - once using the same cpu for client and server (via cpu
> >>> affinity) and once using a different cpu for client and server.
> >>>
> >>> The SIZE, COPY and MEMFD column are produced by "test-bus-kernel-benchmark
> >>> chart", the KDBUS-MAX column is the maximum of the COPY and MEMFD column. So
> >>> this is the effective number of roundtrips that kdbus is able to do at that
> >>> SIZE. The IBENCH column is the effective number of roundtrips that ibench can
> >>> do at that SIZE.
> >>>
> >>> For many relevant cases, ibench outperforms kdbus (a lot). The SPEEDUP factor
> >>> indicates how much faster ibench is than kdbus. For small to medium array
> >>> sizes, ibench always wins (sometimes a lot). For instance passing a 4Kb array
> >>> from client to server and returning back, ibench is 3.90 times faster if client
> >>> and server live on the same cpu, and 3.14 times faster if client and server
> >>> live on different cpus.
> >>>
> >>> I'm bringing this up now because it would be sad if kdbus became part of the
> >>> kernel and universally available, but application developers would still build
> >>> their own protocols for performance reasons. And some things that may need to
> >>> be changed to make kdbus run as fast as ibench may be backward incompatible at
> >>> some level so it may be better to do it now than later on.
> >>>
> >>> The program "ibench" I wrote to provide a performance comparision for the
> >>> "test-bus-kernel-benchmark" program can be downloaded at
> >>>
> >>>   http://space.twc.de/~stefan/download/ibench.c
> >>>
> >>> As a final note, ibench also supports using a socketpair() for communication
> >>> between client and server via #define at top, but pipe() communication was
> >>> faster in my test setup.
> >>
> >> Pipes are not interesting for general purpose D-Bus IPC; with a pipe
> >> the memory can "move* from one client to the other, because it is no
> >> longer needed in the process that fills the pipe.

Well, since you fill a pipe using write() system calls, the memory is still
available in the calling process, so it doesn't really "move".

> >> Pipes are a model out-of-focus for kdbus; using pipes where pipes are
> >> the appropriate IPC mechanism is just fine, there is no competition,
> >> and being 5 times slower than simple pipes is a very good number for
> >> kdbus.
> >>
> >> Kdbus is a low-level implementation for D-Bus, not much else, it will
> >> not try to cover all sorts of specialized IPC use cases.

No, but a pipe based IPC system like ibench gives a good baseline performance
to compare kdbus to. It's probably acceptable if kdbus is somewhat slower
because indeed it also provides the nicer API for developers, and other
benefits.

Certainly I am not proposing to replace every pipe on your system with kdbus.

> > There is also a benchmark in the kdbus repo:
> >   ./test/test-kdbus-benchmark
> >
> > It is probably better to compare that, as it does not include any of
> > the higher-level D-Bus overhead from the userspace library, it
> > operates on the raw kernel kdbus interface and is quite a lot faster
> > than the test in the systemd repo.
>
> Fixed 8k message sizes in all three tools, with a concurrent CPU setup
> produces on an Intel i7 2.90GHz:
>   ibench: 55.036 - 128.807 transactions/sec
>   test-kdbus-benchmark: 73.356 - 82.654 transactions/sec
>   test-bus-kernel-benchmark: 23.290 - 27.580 transactions/sec
>
> The test-kdbus-benchmark runs the full-featured kdbus, including
> reliability/integrity checks, header parsing, user accounting,
> priority queue handling, message/connection metadata handling.

Right. So what I measured is a combination of a userspace library for sending
dbus requests *and* the kernel module. It took me a while to figure out how to
do it, but I wrote a benchmark called "kdbench", based on test-kdbus-benchmark
to test the kernel module itself. It's available here:

  http://space.twc.de/~stefan/download/kdbench-0.1.tar.xz

The results look a lot better than what systemd's test-bus-kernel-benchmark
does. Here are the results (test setup is the same one used to produce the
original table).

*** single cpu performance:

   SIZE    COPY   MEMFD KDBUS-MAX  IBENCH  SPEEDUP

      1  136586   54022    136586  185287  1.36
      2  137723   54039    137723  186406  1.35
      4  138347   54257    138347  184141  1.33
      8  137564   54012    137564  183644  1.33
     16  137424   54133    137424  183816  1.34
     32  137510   63808    137510  184159  1.34
     64  138155   54403    138155  182732  1.32
    128  137818   54174    137818  183198  1.33
    256  136872   54152    136872  182393  1.33
    512  137784   54266    137784  179457  1.30
   1024  136326   54101    136326  177112  1.30
   2048  126027   65934    126027  173540  1.38
   4096  122404   53199    122404  154953  1.27
   8192  118646   50730    118646  123660  1.04
  16384   98507   44760     98507   88211  0.90
  32768   72168   47317     72168   53251  0.74
  65536   43220   32331     43220   27061  0.63
 131072   25855   20306     25855   14760  0.57
 262144   11263   12051     12051    6669  0.55
 524288    4483    5622      5622    2986  0.53
1048576    1883    2719      2719    1247  0.46
2097152     597    1216      1216     475  0.39
4194304     291     460       460     231  0.50

*** dual cpu performance:

   SIZE    COPY   MEMFD KDBUS-MAX  IBENCH  SPEEDUP

      1   64775   33961     64775  101528  1.57
      2   47513   31938     47513  101656  2.14
      4   46265   31082     46265  100368  2.17
      8   49131   39762     49131  100166  2.04
     16   66666   45065     66666  100304  1.50
     32   66748   45002     66748  100856  1.51
     64   66329   34360     66329   99619  1.50
    128   47916   31799     47916  100186  2.09
    256   48949   44929     48949   99465  2.03
    512   65490   42430     65490   97803  1.49
   1024   47619   31568     47619   95415  2.00
   2048   47217   37775     47217   89143  1.89
   4096   62892   34186     62892   74852  1.19
   8192   50473   28641     50473   60856  1.21
  16384   47867   31493     47867   32303  0.67
  32768   36101   22638     36101   21008  0.58
  65536   22703   18689     22703   10820  0.48
 131072   16001   13902     16001    5829  0.36
 262144    9780    9486      9780    2986  0.31
 524288    4164    5141      5141    1404  0.27
1048576    1927    2687      2687     640  0.24
2097152     593    1226      1226     282  0.23
4194304     286     461       461     139  0.30

As you can see, the SPEEDUP column is looking a lot better than above. Starting
at content sizes of 16384 kdbus even outperforms ibench's pipe() approach.

The changes of "kdbench" compared to test-kdbus-benchmark are:

* kdbench uses two processes (instead of just one)
* kdbench sends either a memfd or a vector (the old test sends both,
  which is slower)
* kdbench uses blocking synchronous method calls
* kdbench loops over payload sizes, memfd usage and cpu affinities,
  so it just needs to be run once (like ibench)
* kdbench resuses memfds, so it doesn't need to create a memfd per
  request (which is slower)
* kdbench reuses message payload memory, so it doesn't need to malloc()
  per request
* kdbench doesn't attach a full set of metadata to every message, it just
  requests a timestamp per message

The results show that properly used, the kdbus module can be quite fast.

Still, the kernel module performance will not be the only thing that influences
application performance in the future. In fact if I understand it correctly
the systemd dbus library is not meant to be used outside systemd. So other
developers will write kdbus enabled libraries.

So in the end, it won't matter if the systemd dbus library (the one that
test-bus-kernel-benchmark uses) is fast. Maybe it also can have the status of
a slow reference implementation.

But for figuring out what performance applications can expect, the other
library implementations of dbus via kdbus would need to be measured. I'd
personally like to see a good & fast C++ library for using kdbus.

> Perf output is attached for all three tools, which show that
> test-bus-kernel-benchmark needs to do a lot of other things not
> directly related to the raw memory copy performance, and it should not
> be directly compared.

Right, I'll just briefly comment below.

>   2.05%  test-bus-kernel  libc-2.19.90.so            [.] _int_malloc
>   1.84%  test-bus-kernel  libc-2.19.90.so            [.] vfprintf
>   1.64%  test-bus-kernel  test-bus-kernel-benchmark  [.] bus_message_parse_fields
>   1.51%  test-bus-kernel  libc-2.19.90.so            [.] memset
>   1.40%  test-bus-kernel  libc-2.19.90.so            [.] _int_free
>   1.17%  test-bus-kernel  [kernel.kallsyms]          [k] copy_user_enhanced_fast_string
>   1.12%  test-bus-kernel  libc-2.19.90.so            [.] malloc_consolidate
>   1.11%  test-bus-kernel  [kernel.kallsyms]          [k] mutex_lock
>   1.05%  test-bus-kernel  test-bus-kernel-benchmark  [.] bus_kernel_make_message
>   1.04%  test-bus-kernel  [kernel.kallsyms]          [k] kfree
>   0.94%  test-bus-kernel  libc-2.19.90.so            [.] free
>   0.90%  test-bus-kernel  libc-2.19.90.so            [.] __GI___strcmp_ssse3
>   0.88%  test-bus-kernel  test-bus-kernel-benchmark  [.] message_extend_fields
>   0.83%  test-bus-kernel  [kdbus]                    [k] kdbus_handle_ioctl
>   0.83%  test-bus-kernel  libc-2.19.90.so            [.] malloc
>   0.79%  test-bus-kernel  [kernel.kallsyms]          [k] mutex_unlock
>   0.76%  test-bus-kernel  test-bus-kernel-benchmark  [.] BUS_MESSAGE_IS_GVARIANT
>   0.73%  test-bus-kernel  libc-2.19.90.so            [.] __libc_calloc
>   0.72%  test-bus-kernel  libc-2.19.90.so            [.] memchr
>   0.71%  test-bus-kernel  [kdbus]                    [k] kdbus_conn_kmsg_send
>   0.67%  test-bus-kernel  test-bus-kernel-benchmark  [.] buffer_peek
>   0.65%  test-bus-kernel  [kernel.kallsyms]          [k] update_cfs_shares
>   0.58%  test-bus-kernel  [kernel.kallsyms]          [k] system_call_after_swapgs
>   0.57%  test-bus-kernel  test-bus-kernel-benchmark  [.] service_name_is_valid
>   0.56%  test-bus-kernel  test-bus-kernel-benchmark  [.] build_struct_offsets
>   0.55%  test-bus-kernel  [kdbus]                    [k] kdbus_pool_copy

What can be seen here is that many performance problems are not caused by the
kernel.  A lot is spent on vfprintf() and malloc() and free(), but also on
other userspace functionality. As long as this has the status of a
non-optimized reference implementation this is ok. But if you want to implement
applications on top of it that send/receive lots of kdbus messages, the code
probably can be improved.

>   3.95%  test-kdbus-benc  [kernel.kallsyms]      [k] copy_user_enhanced_fast_string
>   2.25%  test-kdbus-benc  [kernel.kallsyms]      [k] clear_page_c_e
>   2.14%  test-kdbus-benc  [kernel.kallsyms]      [k] _raw_spin_lock
>   1.78%  test-kdbus-benc  [kernel.kallsyms]      [k] kfree
>   1.65%  test-kdbus-benc  [kernel.kallsyms]      [k] mutex_lock
>   1.55%  test-kdbus-benc  [kernel.kallsyms]      [k] get_page_from_freelist
>   1.46%  test-kdbus-benc  [kernel.kallsyms]      [k] page_fault
>   1.40%  test-kdbus-benc  [kernel.kallsyms]      [k] mutex_unlock
>   1.33%  test-kdbus-benc  [kernel.kallsyms]      [k] memset
>   1.27%  test-kdbus-benc  [kernel.kallsyms]      [k] shmem_getpage_gfp
>   1.16%  test-kdbus-benc  [kernel.kallsyms]      [k] find_get_page
>   1.05%  test-kdbus-benc  [kernel.kallsyms]      [k] memcpy
>   1.03%  test-kdbus-benc  [kernel.kallsyms]      [k] set_page_dirty
>   1.00%  test-kdbus-benc  [kernel.kallsyms]      [k] system_call
>   0.94%  test-kdbus-benc  [kernel.kallsyms]      [k] system_call_after_swapgs
>   0.93%  test-kdbus-benc  [kernel.kallsyms]      [k] kmem_cache_alloc
>   0.93%  test-kdbus-benc  test-kdbus-benchmark   [.] timeval_diff
>   0.90%  test-kdbus-benc  [kernel.kallsyms]      [k] page_waitqueue
>   0.86%  test-kdbus-benc  libpthread-2.19.90.so  [.] __libc_close
>   0.83%  test-kdbus-benc  [kernel.kallsyms]      [k] __call_rcu.constprop.63
>   0.81%  test-kdbus-benc  [kdbus]                [k] kdbus_pool_copy
>   0.78%  test-kdbus-benc  [kernel.kallsyms]      [k] strlen
>   0.77%  test-kdbus-benc  [kernel.kallsyms]      [k] unlock_page
>   0.77%  test-kdbus-benc  [kdbus]                [k] kdbus_meta_append
>   0.76%  test-kdbus-benc  [kernel.kallsyms]      [k] find_lock_page
>   0.71%  test-kdbus-benc  test-kdbus-benchmark   [.] handle_echo_reply
>   0.71%  test-kdbus-benc  [kernel.kallsyms]      [k] __kmalloc
>   0.67%  test-kdbus-benc  [kernel.kallsyms]      [k] unmap_single_vma
>   0.67%  test-kdbus-benc  [kernel.kallsyms]      [k] flush_tlb_mm_range
>   0.65%  test-kdbus-benc  [kernel.kallsyms]      [k] __fget_light
>   0.63%  test-kdbus-benc  [kernel.kallsyms]      [k] fput
>   0.63%  test-kdbus-benc  [kdbus]                [k] kdbus_handle_ioctl

What you can see here is that most work happens in the kernel, which is a good sign.
However, the way the kernel module is used creates extra work. Look at the list
of changes I made above, and you'll see the points that can be changed to get
even better performance for this kind of test. For instance memfd recycling
and other changes.

>  16.09%  ibench  [kernel.kallsyms]      [k] copy_user_enhanced_fast_string
>   4.76%  ibench  ibench                 [.] main
>   2.85%  ibench  [kernel.kallsyms]      [k] pipe_read
>   2.81%  ibench  [kernel.kallsyms]      [k] _raw_spin_lock_irqsave
>   2.31%  ibench  [kernel.kallsyms]      [k] update_cfs_shares
>   2.19%  ibench  [kernel.kallsyms]      [k] native_write_msr_safe
>   2.03%  ibench  [kernel.kallsyms]      [k] mutex_unlock
>   2.02%  ibench  [kernel.kallsyms]      [k] resched_task
>   1.74%  ibench  [kernel.kallsyms]      [k] __schedule
>   1.67%  ibench  [kernel.kallsyms]      [k] mutex_lock
>   1.61%  ibench  [kernel.kallsyms]      [k] do_sys_poll
>   1.57%  ibench  [kernel.kallsyms]      [k] get_page_from_freelist
>   1.46%  ibench  [kernel.kallsyms]      [k] __fget_light
>   1.34%  ibench  [kernel.kallsyms]      [k] update_rq_clock.part.83
>   1.28%  ibench  [kernel.kallsyms]      [k] fsnotify
>   1.28%  ibench  [kernel.kallsyms]      [k] enqueue_entity
>   1.28%  ibench  [kernel.kallsyms]      [k] update_curr
>   1.25%  ibench  [kernel.kallsyms]      [k] system_call
>   1.25%  ibench  [kernel.kallsyms]      [k] __list_del_entry
>   1.22%  ibench  [kernel.kallsyms]      [k] _raw_spin_lock
>   1.22%  ibench  [kernel.kallsyms]      [k] system_call_after_swapgs
>   1.21%  ibench  [kernel.kallsyms]      [k] task_waking_fair
>   1.18%  ibench  [kernel.kallsyms]      [k] poll_schedule_timeout
>   1.17%  ibench  [kernel.kallsyms]      [k] _raw_spin_unlock_irqrestore
>   1.14%  ibench  [kernel.kallsyms]      [k] __alloc_pages_nodemask
>   1.06%  ibench  [kernel.kallsyms]      [k] enqueue_task_fair
>   0.95%  ibench  [kernel.kallsyms]      [k] pipe_write
>   0.88%  ibench  [kernel.kallsyms]      [k] do_sync_read
>   0.87%  ibench  [kernel.kallsyms]      [k] dequeue_entity

Well, this really is as good as you can get with pipes I suppose. Almost every
cpu cycle is spent in the kernel.

And here, finally kdbench fixed at payload size 8192. These are not directly
comparable, because you probably have a different system. But still ok for an
appoximate comparision.

# ========
# arch : x86_64
# cpudesc : AMD Phenom(tm) 9850 Quad-Core Processor
# ========
#
     6.03%  kdbench  [kernel.kallsyms]  [k] copy_user_generic_string
     4.95%  kdbench  libc-2.18.so       [.] memset
     3.47%  kdbench  [unknown]          [.] 0x0000000000401498
     2.52%  kdbench  [kernel.kallsyms]  [k] kfree
     2.34%  kdbench  [kernel.kallsyms]  [k] __schedule
     2.31%  kdbench  [kernel.kallsyms]  [k] shmem_getpage_gfp
     2.30%  kdbench  [kdbus]            [k] kdbus_pool_copy
     2.10%  kdbench  [kernel.kallsyms]  [k] memset
     1.83%  kdbench  libc-2.18.so       [.] __GI___ioctl
     1.60%  kdbench  [kernel.kallsyms]  [k] __set_page_dirty_no_writeback
     1.55%  kdbench  [kernel.kallsyms]  [k] _cond_resched
     1.50%  kdbench  [kernel.kallsyms]  [k] system_call
     1.50%  kdbench  [kdbus]            [k] kdbus_conn_kmsg_send
     1.45%  kdbench  [kernel.kallsyms]  [k] __kmalloc
     1.43%  kdbench  [kernel.kallsyms]  [k] kmem_cache_alloc_trace
     1.42%  kdbench  [kdbus]            [k] kdbus_conn_queue_alloc
     1.40%  kdbench  [kernel.kallsyms]  [k] __switch_to
     1.35%  kdbench  [kernel.kallsyms]  [k] find_lock_page
     1.33%  kdbench  [kernel.kallsyms]  [k] sysret_check
     1.24%  kdbench  [kdbus]            [k] kdbus_handle_ioctl
     1.23%  kdbench  [kdbus]            [k] kdbus_kmsg_new_from_user
     1.13%  kdbench  [kernel.kallsyms]  [k] mutex_lock
     1.02%  kdbench  [kernel.kallsyms]  [k] update_curr
     1.00%  kdbench  [kernel.kallsyms]  [k] unmap_single_vma
     1.00%  kdbench  [kernel.kallsyms]  [k] find_get_page
     0.97%  kdbench  [kernel.kallsyms]  [k] page_fault
     0.93%  kdbench  [kernel.kallsyms]  [k] ktime_get_ts
     0.90%  kdbench  [kernel.kallsyms]  [k] update_cfs_shares

Besides memset(), which is used to "generate" the data to send, again this one
spends all the cpu time in the kernel. I'm not sure wether the kernel kdbus
module could be improved upon here.

   Cu... Stefan
-- 
Stefan Westerfeld, http://space.twc.de/~stefan


More information about the systemd-devel mailing list