[Xcb] profiling and performance

Wed May 3 09:58:56 PDT 2006

Hey,

>
> > for the evas bench, with xlib, I get 133 fps, and 131 fps with xcb.
>
> Those are so close to equal, that may just be measurement noise. :-)

not really. I give you a mean value for each engine. The Xlib engine is
always a bit above 133 fps, as well as the xcb one is always a bit above
131 fps. So, it's not "noise". I've always seen, on my computer (it will
vary from a computer to another), a difference of 2 fps.

>
> Donnie has covered the most important point: even if you can get XCB's
> top CPU-consumers to take no time at all, you'll get at best a 0.6%
> performance improvement. That's less than 1 fps for your benchmark.

that would make the difference between Xlib and XCB to 1 fps. Which would
be better :)

> > there are other xcb functions that take more time :
> >
> > write_block
> > _xcb_in_read
> > read_packet
>
> First observation: you're running an unoptimized XCB. Stop that. :-)

ha, right. I compile XCB with no optimisation flags and -g. Maybe that can
change something a bit.

> Maybe your performance differences would go away if you let the compiler
> inline functions like write_block and read_packet, and do all its other
> optimizations.

I usually use for my computer these flags :

-march=athlon-xp -O3 -ffast-math -pipe -funroll-loops
-fomit-frame-pointer -msse -mfpmath=sse,387

Do you find them reasonnable ?

> XCBSendRequest will be your most significant XCB function, as I
> expected, once write_block is inlined into it. If you can get debugging
> symbols for your libc -- for Debian, `apt-get install libc6-dbg` :-) --
> I expect you'll see memcpy in there too, although it'll be interesting
> to see where exactly memcpy shows up. Debugging symbols for your X
> server would be nice too, but unlikely to tell us anything about XCB.

I have an old mandrake with xorg 6.8. I don't think that I have the debug
package.

> I speculate that the _xcb_in_read and read_packet costs here are due to
> raster's opposition to threads. ;-) If you'd just put your event loop in
> a separate thread and let the OS block it until there was something to
> read, you wouldn't have to be polling for events all the time. Inlining
> and other optimizations will make a big difference here, but in the end
> I think syscalls will be the limiting factor for your event polling.

ok. Then I need to think a LOT in order to integrate the loop in a thread,
as i don't know at all how to do that :)

This can be a hard part for later, as ecore is not threaded at all. It's
not thread safe at all.

> > it's conceivable :D How can I know that there are more requests than in
> > the xlib code ?
>
> After studying your profiling output, I don't really think this is the
> case. It does look like maybe you're walking the list of pixmap formats
> multiple times, though: if you're doing it once per frame that could
> have a small impact. The equivalent Xlib code may have cached something.
>
> It's interesting that you're calling both XCBPutImage and
> XCBShmPutImage. Is there a bug preventing you from always using shared
> memory?

haa, indeed, i haven't noticed that. I should look at that.

> The only reply you get with any interesting frequency is for
> GetInputFocus, presumably for XCBSync. The more XCBSync calls you can
> remove without changing the behavior of your program, the higher its
> throughput will be. Understanding when you can remove XCBSync calls is
> hard though.

Oulaaa, I don't know at all when removing them :D I know that there is a
call that I can remove. That's all.

Same for XCBFlush ?

> To test the hypothesis that the XCB-using code is issuing more requests
> than the Xlib version, you could use Ethereal or xscope to log the
> requests and responses for both programs, and compare them. Or you could
> take the easy way :-) -- after every 25 frames, report the last sequence
> number sent. For Xlib, call NextRequest(dpy). For XCB, pick the cookie
> of the last request you've sent, and use cookie.sequence.
>
> These numbers won't be equal between the two programs, because Xlib
> issues extra requests automatically, and in both apps events will arrive
> randomly and cause random requests. But the numbers should increase at
> roughly the same rate in both applications.

ok, thank you

> Anyway, thanks for doing this testing, and for giving me some numbers
> that make me happy. :-) Have you shared your results with raster yet? If
> they don't make him happy too I'll have to have a "talk" with him. ;-)

I haven't talked to raster about these tests. First point, compiling an
optimized xcb, to see what the most comsuming xcb function is.

After that, i'll try to compare xlib and xcb. Then i'll give other numbers
:D

thank you for all those informations

Vincent