[Nice] Measuring NICE STUN performance

Mon Oct 8 00:01:06 PDT 2007

	Hello,

Le Friday 05 October 2007 15:23:40 ext Christian Dickmann, vous avez écrit :
> I am currently trying to measure the performance of the STUN Daemon.
> Initially I tried to use a modified stunbdc to generate a lot of Binding
> Requests per second. However, the client CPU consumption was much higher
> than server CPU consumption, so I would need a lot of clients to get the
> STUN Server to a significant amount of CPU load.

Clients are not expected to generate a large amount of transactions. Also 
proper client-side handling requires some book-keeping that the server-side 
does not need, and needs non-predictible transaction IDs to protect against 
spoofed STUN responses injection.

So, sending/receiving STUN packets should actually be pretty fast, but 
initializing and destroying a STUN client transaction is pretty slow (it has 
hashing, memory allocation, and quite a bunch of system calls...).

> Thus, I decided to write a very very simple STUN client with a Binding
> Request hardcoded (and initially no attributes at all). I then just
> modify the transaction ID in place (and not randomly, more like a
> sequence number) and send the message as often as I need. With a forked
> process I recv the responses but do nothing else than counting them to
> make sure all requests have been answered. And in fact, this client has
> very very little CPU consumption so it seems to be perfect.

This is not safe. The client UDP socket receive queue might overflow.

> However, I got a very strange problem. The CPU consumption of the STUN
> server is not at all stable, even though I confirmed that requests
> arrive at a fairly stable rate (by using tcpdump or even by counting
> received packets in the STUN server dgram_process() routine).

I have not tried to stress test the server code, but from personal experience, 
tcpdump is very bad at stress testing. Its receive queue tends to overflow 
quite fast (especially but not only if DNS lookups are not disabled).

> The CPU
> consumption of the STUN server is not just a little bit unstable, but
> drops to 0% for seconds, even though I am sending 20,000 Binding
> Requests per second. Then, for a very short period of time, the CPU load
> jumps to 50 or 80% or so just to drop back to 0%.

This is really weird. Re-reading the stun binding discovery server code, the 
fast path really only does:

loop:
recvmsg()
parse request...
format response...
sendmsg()
goto loop

without any memory allocation, system call or anything.

Random ideas:
- Is the CPU time wasted in user or kernel land? the kernel UDP receive path 
is not so optimized...
- Do recvmsg()/sendmsg() return errors?
- Link-layer problems (I tend to favor loopback testing to avoid these).

-- 
Rémi Denis-Courmont