[systemd-devel] kdbus vs. pipe based ipc performance
Stefan Westerfeld
stefan at space.twc.de
Tue Apr 15 08:38:04 PDT 2014
Hi!
On Tue, Mar 04, 2014 at 05:36:16AM +0100, Kay Sievers wrote:
> On Tue, Mar 4, 2014 at 5:00 AM, Kay Sievers <kay at vrfy.org> wrote:
> > On Mon, Mar 3, 2014 at 11:06 PM, Kay Sievers <kay at vrfy.org> wrote:
> >> On Mon, Mar 3, 2014 at 10:35 PM, Stefan Westerfeld <stefan at space.twc.de> wrote:
> >>> First of all: I'd really like to see kdbus being used as a general purpose IPC
> >>> layer; so that developers working on client-/server software will no longer
> >>> need to create their own homemade IPC by using primitives like sockets or
> >>> similar.
> >>>
> >>> Now kdbus is advertised as high performance IPC solution, and compared to the
> >>> traditional dbus approach, this may well be true. But are the numbers that
> >>>
> >>> $ test-bus-kernel-benchmark chart
> >>>
> >>> produces impressive? Or to put it in another way: will developers working on
> >>> client-/server software happily accept kdbus, because it performs as good as a
> >>> homemade IPC solution would? Or does kdbus add overhead to a degree that some
> >>> applications can't accept?
> >>>
> >>> To answer this, I wrote a program called "ibench" which passes messages between
> >>> a client and a server, but instead of using kdbus to do it, it uses traditional
> >>> pipes. To simulate main loop integration, it uses poll() in cases where a normal
> >>> client or server application would go into the main loop, and wait to be woken
> >>> up by filedescriptor activity.
> >>>
> >>> Now here are the results I obtained using
> >>>
> >>> - AMD Phenom(tm) 9850 Quad-Core Processor
> >>> - running Fedora 20 64-bit with systemd+kdbus from git
> >>> - system booted with kdbus and single kernel arguments
> >>>
> >>> ============================================================================
> >>> *** single cpu performance: .
> >>>
> >>> SIZE COPY MEMFD KDBUS-MAX IBENCH SPEEDUP
> >>>
> >>> 1 32580 16390 32580 192007 5.89
> >>> 2 40870 16960 40870 191730 4.69
> >>> 4 40750 16870 40750 190938 4.69
> >>> 8 40930 16950 40930 191234 4.67
> >>> 16 40290 17150 40290 192041 4.77
> >>> 32 40220 18050 40220 191963 4.77
> >>> 64 40280 16930 40280 192183 4.77
> >>> 128 40530 17440 40530 191649 4.73
> >>> 256 40610 17610 40610 190405 4.69
> >>> 512 40770 16690 40770 188671 4.63
> >>> 1024 40670 17840 40670 185819 4.57
> >>> 2048 40510 17780 40510 181050 4.47
> >>> 4096 39610 17330 39610 154303 3.90
> >>> 8192 38000 16540 38000 121710 3.20
> >>> 16384 35900 15050 35900 80921 2.25
> >>> 32768 31300 13020 31300 54062 1.73
> >>> 65536 24300 9940 24300 27574 1.13
> >>> 131072 16730 6820 16730 14886 0.89
> >>> 262144 4420 4080 4420 6888 1.56
> >>> 524288 1660 2040 2040 2781 1.36
> >>> 1048576 800 950 950 1231 1.30
> >>> 2097152 310 490 490 475 0.97
> >>> 4194304 150 240 240 227 0.95
> >>>
> >>> *** dual cpu performance: .
> >>>
> >>> SIZE COPY MEMFD KDBUS-MAX IBENCH SPEEDUP
> >>>
> >>> 1 31680 14000 31680 104664 3.30
> >>> 2 34960 14290 34960 104926 3.00
> >>> 4 34930 14050 34930 104659 3.00
> >>> 8 24610 13300 24610 104058 4.23
> >>> 16 33840 14740 33840 103800 3.07
> >>> 32 33880 14400 33880 103917 3.07
> >>> 64 34180 14220 34180 103349 3.02
> >>> 128 34540 14260 34540 102622 2.97
> >>> 256 37820 14240 37820 102076 2.70
> >>> 512 37570 14270 37570 99105 2.64
> >>> 1024 37570 14780 37570 96010 2.56
> >>> 2048 21640 13330 21640 89602 4.14
> >>> 4096 23430 13120 23430 73682 3.14
> >>> 8192 34350 12300 34350 59827 1.74
> >>> 16384 25180 10560 25180 43808 1.74
> >>> 32768 20210 9700 20210 21112 1.04
> >>> 65536 15440 7820 15440 10771 0.70
> >>> 131072 11630 5670 11630 5775 0.50
> >>> 262144 4080 3730 4080 3012 0.74
> >>> 524288 1830 2040 2040 1421 0.70
> >>> 1048576 810 950 950 631 0.66
> >>> 2097152 310 490 490 269 0.55
> >>> 4194304 150 240 240 133 0.55
> >>> ============================================================================
> >>>
> >>> I ran the tests twice - once using the same cpu for client and server (via cpu
> >>> affinity) and once using a different cpu for client and server.
> >>>
> >>> The SIZE, COPY and MEMFD column are produced by "test-bus-kernel-benchmark
> >>> chart", the KDBUS-MAX column is the maximum of the COPY and MEMFD column. So
> >>> this is the effective number of roundtrips that kdbus is able to do at that
> >>> SIZE. The IBENCH column is the effective number of roundtrips that ibench can
> >>> do at that SIZE.
> >>>
> >>> For many relevant cases, ibench outperforms kdbus (a lot). The SPEEDUP factor
> >>> indicates how much faster ibench is than kdbus. For small to medium array
> >>> sizes, ibench always wins (sometimes a lot). For instance passing a 4Kb array
> >>> from client to server and returning back, ibench is 3.90 times faster if client
> >>> and server live on the same cpu, and 3.14 times faster if client and server
> >>> live on different cpus.
> >>>
> >>> I'm bringing this up now because it would be sad if kdbus became part of the
> >>> kernel and universally available, but application developers would still build
> >>> their own protocols for performance reasons. And some things that may need to
> >>> be changed to make kdbus run as fast as ibench may be backward incompatible at
> >>> some level so it may be better to do it now than later on.
> >>>
> >>> The program "ibench" I wrote to provide a performance comparision for the
> >>> "test-bus-kernel-benchmark" program can be downloaded at
> >>>
> >>> http://space.twc.de/~stefan/download/ibench.c
> >>>
> >>> As a final note, ibench also supports using a socketpair() for communication
> >>> between client and server via #define at top, but pipe() communication was
> >>> faster in my test setup.
> >>
> >> Pipes are not interesting for general purpose D-Bus IPC; with a pipe
> >> the memory can "move* from one client to the other, because it is no
> >> longer needed in the process that fills the pipe.
Well, since you fill a pipe using write() system calls, the memory is still
available in the calling process, so it doesn't really "move".
> >> Pipes are a model out-of-focus for kdbus; using pipes where pipes are
> >> the appropriate IPC mechanism is just fine, there is no competition,
> >> and being 5 times slower than simple pipes is a very good number for
> >> kdbus.
> >>
> >> Kdbus is a low-level implementation for D-Bus, not much else, it will
> >> not try to cover all sorts of specialized IPC use cases.
No, but a pipe based IPC system like ibench gives a good baseline performance
to compare kdbus to. It's probably acceptable if kdbus is somewhat slower
because indeed it also provides the nicer API for developers, and other
benefits.
Certainly I am not proposing to replace every pipe on your system with kdbus.
> > There is also a benchmark in the kdbus repo:
> > ./test/test-kdbus-benchmark
> >
> > It is probably better to compare that, as it does not include any of
> > the higher-level D-Bus overhead from the userspace library, it
> > operates on the raw kernel kdbus interface and is quite a lot faster
> > than the test in the systemd repo.
>
> Fixed 8k message sizes in all three tools, with a concurrent CPU setup
> produces on an Intel i7 2.90GHz:
> ibench: 55.036 - 128.807 transactions/sec
> test-kdbus-benchmark: 73.356 - 82.654 transactions/sec
> test-bus-kernel-benchmark: 23.290 - 27.580 transactions/sec
>
> The test-kdbus-benchmark runs the full-featured kdbus, including
> reliability/integrity checks, header parsing, user accounting,
> priority queue handling, message/connection metadata handling.
Right. So what I measured is a combination of a userspace library for sending
dbus requests *and* the kernel module. It took me a while to figure out how to
do it, but I wrote a benchmark called "kdbench", based on test-kdbus-benchmark
to test the kernel module itself. It's available here:
http://space.twc.de/~stefan/download/kdbench-0.1.tar.xz
The results look a lot better than what systemd's test-bus-kernel-benchmark
does. Here are the results (test setup is the same one used to produce the
original table).
*** single cpu performance:
SIZE COPY MEMFD KDBUS-MAX IBENCH SPEEDUP
1 136586 54022 136586 185287 1.36
2 137723 54039 137723 186406 1.35
4 138347 54257 138347 184141 1.33
8 137564 54012 137564 183644 1.33
16 137424 54133 137424 183816 1.34
32 137510 63808 137510 184159 1.34
64 138155 54403 138155 182732 1.32
128 137818 54174 137818 183198 1.33
256 136872 54152 136872 182393 1.33
512 137784 54266 137784 179457 1.30
1024 136326 54101 136326 177112 1.30
2048 126027 65934 126027 173540 1.38
4096 122404 53199 122404 154953 1.27
8192 118646 50730 118646 123660 1.04
16384 98507 44760 98507 88211 0.90
32768 72168 47317 72168 53251 0.74
65536 43220 32331 43220 27061 0.63
131072 25855 20306 25855 14760 0.57
262144 11263 12051 12051 6669 0.55
524288 4483 5622 5622 2986 0.53
1048576 1883 2719 2719 1247 0.46
2097152 597 1216 1216 475 0.39
4194304 291 460 460 231 0.50
*** dual cpu performance:
SIZE COPY MEMFD KDBUS-MAX IBENCH SPEEDUP
1 64775 33961 64775 101528 1.57
2 47513 31938 47513 101656 2.14
4 46265 31082 46265 100368 2.17
8 49131 39762 49131 100166 2.04
16 66666 45065 66666 100304 1.50
32 66748 45002 66748 100856 1.51
64 66329 34360 66329 99619 1.50
128 47916 31799 47916 100186 2.09
256 48949 44929 48949 99465 2.03
512 65490 42430 65490 97803 1.49
1024 47619 31568 47619 95415 2.00
2048 47217 37775 47217 89143 1.89
4096 62892 34186 62892 74852 1.19
8192 50473 28641 50473 60856 1.21
16384 47867 31493 47867 32303 0.67
32768 36101 22638 36101 21008 0.58
65536 22703 18689 22703 10820 0.48
131072 16001 13902 16001 5829 0.36
262144 9780 9486 9780 2986 0.31
524288 4164 5141 5141 1404 0.27
1048576 1927 2687 2687 640 0.24
2097152 593 1226 1226 282 0.23
4194304 286 461 461 139 0.30
As you can see, the SPEEDUP column is looking a lot better than above. Starting
at content sizes of 16384 kdbus even outperforms ibench's pipe() approach.
The changes of "kdbench" compared to test-kdbus-benchmark are:
* kdbench uses two processes (instead of just one)
* kdbench sends either a memfd or a vector (the old test sends both,
which is slower)
* kdbench uses blocking synchronous method calls
* kdbench loops over payload sizes, memfd usage and cpu affinities,
so it just needs to be run once (like ibench)
* kdbench resuses memfds, so it doesn't need to create a memfd per
request (which is slower)
* kdbench reuses message payload memory, so it doesn't need to malloc()
per request
* kdbench doesn't attach a full set of metadata to every message, it just
requests a timestamp per message
The results show that properly used, the kdbus module can be quite fast.
Still, the kernel module performance will not be the only thing that influences
application performance in the future. In fact if I understand it correctly
the systemd dbus library is not meant to be used outside systemd. So other
developers will write kdbus enabled libraries.
So in the end, it won't matter if the systemd dbus library (the one that
test-bus-kernel-benchmark uses) is fast. Maybe it also can have the status of
a slow reference implementation.
But for figuring out what performance applications can expect, the other
library implementations of dbus via kdbus would need to be measured. I'd
personally like to see a good & fast C++ library for using kdbus.
> Perf output is attached for all three tools, which show that
> test-bus-kernel-benchmark needs to do a lot of other things not
> directly related to the raw memory copy performance, and it should not
> be directly compared.
Right, I'll just briefly comment below.
> 2.05% test-bus-kernel libc-2.19.90.so [.] _int_malloc
> 1.84% test-bus-kernel libc-2.19.90.so [.] vfprintf
> 1.64% test-bus-kernel test-bus-kernel-benchmark [.] bus_message_parse_fields
> 1.51% test-bus-kernel libc-2.19.90.so [.] memset
> 1.40% test-bus-kernel libc-2.19.90.so [.] _int_free
> 1.17% test-bus-kernel [kernel.kallsyms] [k] copy_user_enhanced_fast_string
> 1.12% test-bus-kernel libc-2.19.90.so [.] malloc_consolidate
> 1.11% test-bus-kernel [kernel.kallsyms] [k] mutex_lock
> 1.05% test-bus-kernel test-bus-kernel-benchmark [.] bus_kernel_make_message
> 1.04% test-bus-kernel [kernel.kallsyms] [k] kfree
> 0.94% test-bus-kernel libc-2.19.90.so [.] free
> 0.90% test-bus-kernel libc-2.19.90.so [.] __GI___strcmp_ssse3
> 0.88% test-bus-kernel test-bus-kernel-benchmark [.] message_extend_fields
> 0.83% test-bus-kernel [kdbus] [k] kdbus_handle_ioctl
> 0.83% test-bus-kernel libc-2.19.90.so [.] malloc
> 0.79% test-bus-kernel [kernel.kallsyms] [k] mutex_unlock
> 0.76% test-bus-kernel test-bus-kernel-benchmark [.] BUS_MESSAGE_IS_GVARIANT
> 0.73% test-bus-kernel libc-2.19.90.so [.] __libc_calloc
> 0.72% test-bus-kernel libc-2.19.90.so [.] memchr
> 0.71% test-bus-kernel [kdbus] [k] kdbus_conn_kmsg_send
> 0.67% test-bus-kernel test-bus-kernel-benchmark [.] buffer_peek
> 0.65% test-bus-kernel [kernel.kallsyms] [k] update_cfs_shares
> 0.58% test-bus-kernel [kernel.kallsyms] [k] system_call_after_swapgs
> 0.57% test-bus-kernel test-bus-kernel-benchmark [.] service_name_is_valid
> 0.56% test-bus-kernel test-bus-kernel-benchmark [.] build_struct_offsets
> 0.55% test-bus-kernel [kdbus] [k] kdbus_pool_copy
What can be seen here is that many performance problems are not caused by the
kernel. A lot is spent on vfprintf() and malloc() and free(), but also on
other userspace functionality. As long as this has the status of a
non-optimized reference implementation this is ok. But if you want to implement
applications on top of it that send/receive lots of kdbus messages, the code
probably can be improved.
> 3.95% test-kdbus-benc [kernel.kallsyms] [k] copy_user_enhanced_fast_string
> 2.25% test-kdbus-benc [kernel.kallsyms] [k] clear_page_c_e
> 2.14% test-kdbus-benc [kernel.kallsyms] [k] _raw_spin_lock
> 1.78% test-kdbus-benc [kernel.kallsyms] [k] kfree
> 1.65% test-kdbus-benc [kernel.kallsyms] [k] mutex_lock
> 1.55% test-kdbus-benc [kernel.kallsyms] [k] get_page_from_freelist
> 1.46% test-kdbus-benc [kernel.kallsyms] [k] page_fault
> 1.40% test-kdbus-benc [kernel.kallsyms] [k] mutex_unlock
> 1.33% test-kdbus-benc [kernel.kallsyms] [k] memset
> 1.27% test-kdbus-benc [kernel.kallsyms] [k] shmem_getpage_gfp
> 1.16% test-kdbus-benc [kernel.kallsyms] [k] find_get_page
> 1.05% test-kdbus-benc [kernel.kallsyms] [k] memcpy
> 1.03% test-kdbus-benc [kernel.kallsyms] [k] set_page_dirty
> 1.00% test-kdbus-benc [kernel.kallsyms] [k] system_call
> 0.94% test-kdbus-benc [kernel.kallsyms] [k] system_call_after_swapgs
> 0.93% test-kdbus-benc [kernel.kallsyms] [k] kmem_cache_alloc
> 0.93% test-kdbus-benc test-kdbus-benchmark [.] timeval_diff
> 0.90% test-kdbus-benc [kernel.kallsyms] [k] page_waitqueue
> 0.86% test-kdbus-benc libpthread-2.19.90.so [.] __libc_close
> 0.83% test-kdbus-benc [kernel.kallsyms] [k] __call_rcu.constprop.63
> 0.81% test-kdbus-benc [kdbus] [k] kdbus_pool_copy
> 0.78% test-kdbus-benc [kernel.kallsyms] [k] strlen
> 0.77% test-kdbus-benc [kernel.kallsyms] [k] unlock_page
> 0.77% test-kdbus-benc [kdbus] [k] kdbus_meta_append
> 0.76% test-kdbus-benc [kernel.kallsyms] [k] find_lock_page
> 0.71% test-kdbus-benc test-kdbus-benchmark [.] handle_echo_reply
> 0.71% test-kdbus-benc [kernel.kallsyms] [k] __kmalloc
> 0.67% test-kdbus-benc [kernel.kallsyms] [k] unmap_single_vma
> 0.67% test-kdbus-benc [kernel.kallsyms] [k] flush_tlb_mm_range
> 0.65% test-kdbus-benc [kernel.kallsyms] [k] __fget_light
> 0.63% test-kdbus-benc [kernel.kallsyms] [k] fput
> 0.63% test-kdbus-benc [kdbus] [k] kdbus_handle_ioctl
What you can see here is that most work happens in the kernel, which is a good sign.
However, the way the kernel module is used creates extra work. Look at the list
of changes I made above, and you'll see the points that can be changed to get
even better performance for this kind of test. For instance memfd recycling
and other changes.
> 16.09% ibench [kernel.kallsyms] [k] copy_user_enhanced_fast_string
> 4.76% ibench ibench [.] main
> 2.85% ibench [kernel.kallsyms] [k] pipe_read
> 2.81% ibench [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> 2.31% ibench [kernel.kallsyms] [k] update_cfs_shares
> 2.19% ibench [kernel.kallsyms] [k] native_write_msr_safe
> 2.03% ibench [kernel.kallsyms] [k] mutex_unlock
> 2.02% ibench [kernel.kallsyms] [k] resched_task
> 1.74% ibench [kernel.kallsyms] [k] __schedule
> 1.67% ibench [kernel.kallsyms] [k] mutex_lock
> 1.61% ibench [kernel.kallsyms] [k] do_sys_poll
> 1.57% ibench [kernel.kallsyms] [k] get_page_from_freelist
> 1.46% ibench [kernel.kallsyms] [k] __fget_light
> 1.34% ibench [kernel.kallsyms] [k] update_rq_clock.part.83
> 1.28% ibench [kernel.kallsyms] [k] fsnotify
> 1.28% ibench [kernel.kallsyms] [k] enqueue_entity
> 1.28% ibench [kernel.kallsyms] [k] update_curr
> 1.25% ibench [kernel.kallsyms] [k] system_call
> 1.25% ibench [kernel.kallsyms] [k] __list_del_entry
> 1.22% ibench [kernel.kallsyms] [k] _raw_spin_lock
> 1.22% ibench [kernel.kallsyms] [k] system_call_after_swapgs
> 1.21% ibench [kernel.kallsyms] [k] task_waking_fair
> 1.18% ibench [kernel.kallsyms] [k] poll_schedule_timeout
> 1.17% ibench [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
> 1.14% ibench [kernel.kallsyms] [k] __alloc_pages_nodemask
> 1.06% ibench [kernel.kallsyms] [k] enqueue_task_fair
> 0.95% ibench [kernel.kallsyms] [k] pipe_write
> 0.88% ibench [kernel.kallsyms] [k] do_sync_read
> 0.87% ibench [kernel.kallsyms] [k] dequeue_entity
Well, this really is as good as you can get with pipes I suppose. Almost every
cpu cycle is spent in the kernel.
And here, finally kdbench fixed at payload size 8192. These are not directly
comparable, because you probably have a different system. But still ok for an
appoximate comparision.
# ========
# arch : x86_64
# cpudesc : AMD Phenom(tm) 9850 Quad-Core Processor
# ========
#
6.03% kdbench [kernel.kallsyms] [k] copy_user_generic_string
4.95% kdbench libc-2.18.so [.] memset
3.47% kdbench [unknown] [.] 0x0000000000401498
2.52% kdbench [kernel.kallsyms] [k] kfree
2.34% kdbench [kernel.kallsyms] [k] __schedule
2.31% kdbench [kernel.kallsyms] [k] shmem_getpage_gfp
2.30% kdbench [kdbus] [k] kdbus_pool_copy
2.10% kdbench [kernel.kallsyms] [k] memset
1.83% kdbench libc-2.18.so [.] __GI___ioctl
1.60% kdbench [kernel.kallsyms] [k] __set_page_dirty_no_writeback
1.55% kdbench [kernel.kallsyms] [k] _cond_resched
1.50% kdbench [kernel.kallsyms] [k] system_call
1.50% kdbench [kdbus] [k] kdbus_conn_kmsg_send
1.45% kdbench [kernel.kallsyms] [k] __kmalloc
1.43% kdbench [kernel.kallsyms] [k] kmem_cache_alloc_trace
1.42% kdbench [kdbus] [k] kdbus_conn_queue_alloc
1.40% kdbench [kernel.kallsyms] [k] __switch_to
1.35% kdbench [kernel.kallsyms] [k] find_lock_page
1.33% kdbench [kernel.kallsyms] [k] sysret_check
1.24% kdbench [kdbus] [k] kdbus_handle_ioctl
1.23% kdbench [kdbus] [k] kdbus_kmsg_new_from_user
1.13% kdbench [kernel.kallsyms] [k] mutex_lock
1.02% kdbench [kernel.kallsyms] [k] update_curr
1.00% kdbench [kernel.kallsyms] [k] unmap_single_vma
1.00% kdbench [kernel.kallsyms] [k] find_get_page
0.97% kdbench [kernel.kallsyms] [k] page_fault
0.93% kdbench [kernel.kallsyms] [k] ktime_get_ts
0.90% kdbench [kernel.kallsyms] [k] update_cfs_shares
Besides memset(), which is used to "generate" the data to send, again this one
spends all the cpu time in the kernel. I'm not sure wether the kernel kdbus
module could be improved upon here.
Cu... Stefan
--
Stefan Westerfeld, http://space.twc.de/~stefan
More information about the systemd-devel
mailing list