[PATCH 00/22] Micro optimizations for often used code paths

Wed Dec 29 11:27:12 PST 2010

Following patches aim to reduce useless work done by xserver in often executed
code paths. I also fixed a few random bugs that I spotted in code while doing
the optimizations.

Main optimization is registering block handlers only when there is work for
them. It is very common case that block handlers are called even there is no
work to do. On arm doing nothing in these block handlers takes 20-40us when
data isn't in CPU caches (very common case on arm).

Next optimization is using NoopDDA everywhere when wanting to register handler
to do nothing. This eliminates about 800ns (on arm) for each registered handler
doing nothing.

Then the last change is to cache result for LocalClient that is called for each
DRI2 request. Caching eliminates about 9us for each DRI2 call. The largest win
comes from avoiding doing malloc/free for _XSERVTransGetPeerAddr.

Ondemand registered block handlers are making 2 assumptions how xserver handles
BlockHandlers:
1. BlockHandlers that are registered ondemand are always registered after init.
This makes it safe to remove the ondemand handler later on because all ondemand
handlers implement unwrap/call/wrap sequence correctly.
2. CloseScreen is only called before screen structure is freed.
CloseScreen handler is traditionally used to remove BlockHandlers but that
assumes CloseScreen is called in same order that function pointer wrapping has
happened. IMO it is better to trust that the function pointer is never called
after close screen. That makes unwrapping the handlers pointless in CloseScreen
solving the problem that ondemand wrapped handlers are in random order.

Then to dared x11perf showing the difference in "real world" on arm

x11perf -prop
Without patches

Sync time adjustment is 0.1290 msecs.

  60000 reps @   0.0983 msec ( 10200.0/sec): GetProperty
  60000 reps @   0.0981 msec ( 10200.0/sec): GetProperty
  60000 reps @   0.0982 msec ( 10200.0/sec): GetProperty
  60000 reps @   0.0982 msec ( 10200.0/sec): GetProperty
  60000 reps @   0.0981 msec ( 10200.0/sec): GetProperty
 300000 trep @   0.0982 msec ( 10200.0/sec): GetProperty

With patches

Sync time adjustment is 0.1232 msecs.

  60000 reps @   0.0903 msec ( 11100.0/sec): GetProperty
  60000 reps @   0.0903 msec ( 11100.0/sec): GetProperty
  60000 reps @   0.0904 msec ( 11100.0/sec): GetProperty
  60000 reps @   0.0903 msec ( 11100.0/sec): GetProperty
  60000 reps @   0.0904 msec ( 11100.0/sec): GetProperty
 300000 trep @   0.0904 msec ( 11100.0/sec): GetProperty

Doesn't look as much as detailed profiling traces from real world applications would point for
performance. But it is clearly visible in profiles that time taken for execution is highly 
depend on cache. Result for 90-100us for round trip is a lot less than what real worl profiles
are showing where only WaitForSomething takes 100-200us.