[systemd-devel] kdbus vs. pipe based ipc performance
Kay Sievers
kay at vrfy.org
Mon Mar 3 20:36:16 PST 2014
On Tue, Mar 4, 2014 at 5:00 AM, Kay Sievers <kay at vrfy.org> wrote:
> On Mon, Mar 3, 2014 at 11:06 PM, Kay Sievers <kay at vrfy.org> wrote:
>> On Mon, Mar 3, 2014 at 10:35 PM, Stefan Westerfeld <stefan at space.twc.de> wrote:
>>> First of all: I'd really like to see kdbus being used as a general purpose IPC
>>> layer; so that developers working on client-/server software will no longer
>>> need to create their own homemade IPC by using primitives like sockets or
>>> similar.
>>>
>>> Now kdbus is advertised as high performance IPC solution, and compared to the
>>> traditional dbus approach, this may well be true. But are the numbers that
>>>
>>> $ test-bus-kernel-benchmark chart
>>>
>>> produces impressive? Or to put it in another way: will developers working on
>>> client-/server software happily accept kdbus, because it performs as good as a
>>> homemade IPC solution would? Or does kdbus add overhead to a degree that some
>>> applications can't accept?
>>>
>>> To answer this, I wrote a program called "ibench" which passes messages between
>>> a client and a server, but instead of using kdbus to do it, it uses traditional
>>> pipes. To simulate main loop integration, it uses poll() in cases where a normal
>>> client or server application would go into the main loop, and wait to be woken
>>> up by filedescriptor activity.
>>>
>>> Now here are the results I obtained using
>>>
>>> - AMD Phenom(tm) 9850 Quad-Core Processor
>>> - running Fedora 20 64-bit with systemd+kdbus from git
>>> - system booted with kdbus and single kernel arguments
>>>
>>> ============================================================================
>>> *** single cpu performance: .
>>>
>>> SIZE COPY MEMFD KDBUS-MAX IBENCH SPEEDUP
>>>
>>> 1 32580 16390 32580 192007 5.89
>>> 2 40870 16960 40870 191730 4.69
>>> 4 40750 16870 40750 190938 4.69
>>> 8 40930 16950 40930 191234 4.67
>>> 16 40290 17150 40290 192041 4.77
>>> 32 40220 18050 40220 191963 4.77
>>> 64 40280 16930 40280 192183 4.77
>>> 128 40530 17440 40530 191649 4.73
>>> 256 40610 17610 40610 190405 4.69
>>> 512 40770 16690 40770 188671 4.63
>>> 1024 40670 17840 40670 185819 4.57
>>> 2048 40510 17780 40510 181050 4.47
>>> 4096 39610 17330 39610 154303 3.90
>>> 8192 38000 16540 38000 121710 3.20
>>> 16384 35900 15050 35900 80921 2.25
>>> 32768 31300 13020 31300 54062 1.73
>>> 65536 24300 9940 24300 27574 1.13
>>> 131072 16730 6820 16730 14886 0.89
>>> 262144 4420 4080 4420 6888 1.56
>>> 524288 1660 2040 2040 2781 1.36
>>> 1048576 800 950 950 1231 1.30
>>> 2097152 310 490 490 475 0.97
>>> 4194304 150 240 240 227 0.95
>>>
>>> *** dual cpu performance: .
>>>
>>> SIZE COPY MEMFD KDBUS-MAX IBENCH SPEEDUP
>>>
>>> 1 31680 14000 31680 104664 3.30
>>> 2 34960 14290 34960 104926 3.00
>>> 4 34930 14050 34930 104659 3.00
>>> 8 24610 13300 24610 104058 4.23
>>> 16 33840 14740 33840 103800 3.07
>>> 32 33880 14400 33880 103917 3.07
>>> 64 34180 14220 34180 103349 3.02
>>> 128 34540 14260 34540 102622 2.97
>>> 256 37820 14240 37820 102076 2.70
>>> 512 37570 14270 37570 99105 2.64
>>> 1024 37570 14780 37570 96010 2.56
>>> 2048 21640 13330 21640 89602 4.14
>>> 4096 23430 13120 23430 73682 3.14
>>> 8192 34350 12300 34350 59827 1.74
>>> 16384 25180 10560 25180 43808 1.74
>>> 32768 20210 9700 20210 21112 1.04
>>> 65536 15440 7820 15440 10771 0.70
>>> 131072 11630 5670 11630 5775 0.50
>>> 262144 4080 3730 4080 3012 0.74
>>> 524288 1830 2040 2040 1421 0.70
>>> 1048576 810 950 950 631 0.66
>>> 2097152 310 490 490 269 0.55
>>> 4194304 150 240 240 133 0.55
>>> ============================================================================
>>>
>>> I ran the tests twice - once using the same cpu for client and server (via cpu
>>> affinity) and once using a different cpu for client and server.
>>>
>>> The SIZE, COPY and MEMFD column are produced by "test-bus-kernel-benchmark
>>> chart", the KDBUS-MAX column is the maximum of the COPY and MEMFD column. So
>>> this is the effective number of roundtrips that kdbus is able to do at that
>>> SIZE. The IBENCH column is the effective number of roundtrips that ibench can
>>> do at that SIZE.
>>>
>>> For many relevant cases, ibench outperforms kdbus (a lot). The SPEEDUP factor
>>> indicates how much faster ibench is than kdbus. For small to medium array
>>> sizes, ibench always wins (sometimes a lot). For instance passing a 4Kb array
>>> from client to server and returning back, ibench is 3.90 times faster if client
>>> and server live on the same cpu, and 3.14 times faster if client and server
>>> live on different cpus.
>>>
>>> I'm bringing this up now because it would be sad if kdbus became part of the
>>> kernel and universally available, but application developers would still build
>>> their own protocols for performance reasons. And some things that may need to
>>> be changed to make kdbus run as fast as ibench may be backward incompatible at
>>> some level so it may be better to do it now than later on.
>>>
>>> The program "ibench" I wrote to provide a performance comparision for the
>>> "test-bus-kernel-benchmark" program can be downloaded at
>>>
>>> http://space.twc.de/~stefan/download/ibench.c
>>>
>>> As a final note, ibench also supports using a socketpair() for communication
>>> between client and server via #define at top, but pipe() communication was
>>> faster in my test setup.
>>
>> Pipes are not interesting for general purpose D-Bus IPC; with a pipe
>> the memory can "move* from one client to the other, because it is no
>> longer needed in the process that fills the pipe.
>>
>> Pipes are a model out-of-focus for kdbus; using pipes where pipes are
>> the appropriate IPC mechanism is just fine, there is no competition,
>> and being 5 times slower than simple pipes is a very good number for
>> kdbus.
>>
>> Kdbus is a low-level implementation for D-Bus, not much else, it will
>> not try to cover all sorts of specialized IPC use cases.
>
> There is also a benchmark in the kdbus repo:
> ./test/test-kdbus-benchmark
>
> It is probably better to compare that, as it does not include any of
> the higher-level D-Bus overhead from the userspace library, it
> operates on the raw kernel kdbus interface and is quite a lot faster
> than the test in the systemd repo.
Fixed 8k message sizes in all three tools, with a concurrent CPU setup
produces on an Intel i7 2.90GHz:
ibench: 55.036 - 128.807 transactions/sec
test-kdbus-benchmark: 73.356 - 82.654 transactions/sec
test-bus-kernel-benchmark: 23.290 - 27.580 transactions/sec
The test-kdbus-benchmark runs the full-featured kdbus, including
reliability/integrity checks, header parsing, user accounting,
priority queue handling, message/connection metadata handling.
Perf output is attached for all three tools, which show that
test-bus-kernel-benchmark needs to do a lot of other things not
directly related to the raw memory copy performance, and it should not
be directly compared.
Kay
-------------- next part --------------
2.05% test-bus-kernel libc-2.19.90.so [.] _int_malloc
1.84% test-bus-kernel libc-2.19.90.so [.] vfprintf
1.64% test-bus-kernel test-bus-kernel-benchmark [.] bus_message_parse_fields
1.51% test-bus-kernel libc-2.19.90.so [.] memset
1.40% test-bus-kernel libc-2.19.90.so [.] _int_free
1.17% test-bus-kernel [kernel.kallsyms] [k] copy_user_enhanced_fast_string
1.12% test-bus-kernel libc-2.19.90.so [.] malloc_consolidate
1.11% test-bus-kernel [kernel.kallsyms] [k] mutex_lock
1.05% test-bus-kernel test-bus-kernel-benchmark [.] bus_kernel_make_message
1.04% test-bus-kernel [kernel.kallsyms] [k] kfree
0.94% test-bus-kernel libc-2.19.90.so [.] free
0.90% test-bus-kernel libc-2.19.90.so [.] __GI___strcmp_ssse3
0.88% test-bus-kernel test-bus-kernel-benchmark [.] message_extend_fields
0.83% test-bus-kernel [kdbus] [k] kdbus_handle_ioctl
0.83% test-bus-kernel libc-2.19.90.so [.] malloc
0.79% test-bus-kernel [kernel.kallsyms] [k] mutex_unlock
0.76% test-bus-kernel test-bus-kernel-benchmark [.] BUS_MESSAGE_IS_GVARIANT
0.73% test-bus-kernel libc-2.19.90.so [.] __libc_calloc
0.72% test-bus-kernel libc-2.19.90.so [.] memchr
0.71% test-bus-kernel [kdbus] [k] kdbus_conn_kmsg_send
0.67% test-bus-kernel test-bus-kernel-benchmark [.] buffer_peek
0.65% test-bus-kernel [kernel.kallsyms] [k] update_cfs_shares
0.58% test-bus-kernel [kernel.kallsyms] [k] system_call_after_swapgs
0.57% test-bus-kernel test-bus-kernel-benchmark [.] service_name_is_valid
0.56% test-bus-kernel test-bus-kernel-benchmark [.] build_struct_offsets
0.55% test-bus-kernel [kdbus] [k] kdbus_pool_copy
3.95% test-kdbus-benc [kernel.kallsyms] [k] copy_user_enhanced_fast_string
2.25% test-kdbus-benc [kernel.kallsyms] [k] clear_page_c_e
2.14% test-kdbus-benc [kernel.kallsyms] [k] _raw_spin_lock
1.78% test-kdbus-benc [kernel.kallsyms] [k] kfree
1.65% test-kdbus-benc [kernel.kallsyms] [k] mutex_lock
1.55% test-kdbus-benc [kernel.kallsyms] [k] get_page_from_freelist
1.46% test-kdbus-benc [kernel.kallsyms] [k] page_fault
1.40% test-kdbus-benc [kernel.kallsyms] [k] mutex_unlock
1.33% test-kdbus-benc [kernel.kallsyms] [k] memset
1.27% test-kdbus-benc [kernel.kallsyms] [k] shmem_getpage_gfp
1.16% test-kdbus-benc [kernel.kallsyms] [k] find_get_page
1.05% test-kdbus-benc [kernel.kallsyms] [k] memcpy
1.03% test-kdbus-benc [kernel.kallsyms] [k] set_page_dirty
1.00% test-kdbus-benc [kernel.kallsyms] [k] system_call
0.94% test-kdbus-benc [kernel.kallsyms] [k] system_call_after_swapgs
0.93% test-kdbus-benc [kernel.kallsyms] [k] kmem_cache_alloc
0.93% test-kdbus-benc test-kdbus-benchmark [.] timeval_diff
0.90% test-kdbus-benc [kernel.kallsyms] [k] page_waitqueue
0.86% test-kdbus-benc libpthread-2.19.90.so [.] __libc_close
0.83% test-kdbus-benc [kernel.kallsyms] [k] __call_rcu.constprop.63
0.81% test-kdbus-benc [kdbus] [k] kdbus_pool_copy
0.78% test-kdbus-benc [kernel.kallsyms] [k] strlen
0.77% test-kdbus-benc [kernel.kallsyms] [k] unlock_page
0.77% test-kdbus-benc [kdbus] [k] kdbus_meta_append
0.76% test-kdbus-benc [kernel.kallsyms] [k] find_lock_page
0.71% test-kdbus-benc test-kdbus-benchmark [.] handle_echo_reply
0.71% test-kdbus-benc [kernel.kallsyms] [k] __kmalloc
0.67% test-kdbus-benc [kernel.kallsyms] [k] unmap_single_vma
0.67% test-kdbus-benc [kernel.kallsyms] [k] flush_tlb_mm_range
0.65% test-kdbus-benc [kernel.kallsyms] [k] __fget_light
0.63% test-kdbus-benc [kernel.kallsyms] [k] fput
0.63% test-kdbus-benc [kdbus] [k] kdbus_handle_ioctl
16.09% ibench [kernel.kallsyms] [k] copy_user_enhanced_fast_string
4.76% ibench ibench [.] main
2.85% ibench [kernel.kallsyms] [k] pipe_read
2.81% ibench [kernel.kallsyms] [k] _raw_spin_lock_irqsave
2.31% ibench [kernel.kallsyms] [k] update_cfs_shares
2.19% ibench [kernel.kallsyms] [k] native_write_msr_safe
2.03% ibench [kernel.kallsyms] [k] mutex_unlock
2.02% ibench [kernel.kallsyms] [k] resched_task
1.74% ibench [kernel.kallsyms] [k] __schedule
1.67% ibench [kernel.kallsyms] [k] mutex_lock
1.61% ibench [kernel.kallsyms] [k] do_sys_poll
1.57% ibench [kernel.kallsyms] [k] get_page_from_freelist
1.46% ibench [kernel.kallsyms] [k] __fget_light
1.34% ibench [kernel.kallsyms] [k] update_rq_clock.part.83
1.28% ibench [kernel.kallsyms] [k] fsnotify
1.28% ibench [kernel.kallsyms] [k] enqueue_entity
1.28% ibench [kernel.kallsyms] [k] update_curr
1.25% ibench [kernel.kallsyms] [k] system_call
1.25% ibench [kernel.kallsyms] [k] __list_del_entry
1.22% ibench [kernel.kallsyms] [k] _raw_spin_lock
1.22% ibench [kernel.kallsyms] [k] system_call_after_swapgs
1.21% ibench [kernel.kallsyms] [k] task_waking_fair
1.18% ibench [kernel.kallsyms] [k] poll_schedule_timeout
1.17% ibench [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
1.14% ibench [kernel.kallsyms] [k] __alloc_pages_nodemask
1.06% ibench [kernel.kallsyms] [k] enqueue_task_fair
0.95% ibench [kernel.kallsyms] [k] pipe_write
0.88% ibench [kernel.kallsyms] [k] do_sync_read
0.87% ibench [kernel.kallsyms] [k] dequeue_entity
More information about the systemd-devel
mailing list