exa i965 performance
cworth at cworth.org
Tue Oct 2 09:21:23 PDT 2007
[pulling the xorg list in on this reply]
On At Sun, 30 Sep 2007 12:09:57 +0800, Zhenyu Wang wrote:
> Hi, Carl, How are you?:)
I'm good, thanks! And you?
> Nice to see your post about render_bench on i965 and I've noticed
> your work on "ring-torture" branch (yeah, nice name!),
Thanks. On that branch all the current state changes are made through
write commands in the ring, allowing the existing i830WaitSync calls
to be removed. But that's not enough to actually help performance yet,
(not to mention that it would kill performance to have to shove this
much data into the ring for every composite operation). So performance
with this branch is marginally better, but not significantly.
> it did something
> same as I pushed on xf86-video-intel "exa" branch, like kernel programs
> preload. I've also tried to remove some sync ops in my branch and didn't
> hurt rendering results.
Great. I'll go look at that branch.
> The big problem seems we're lacking state tracker for composite operations,
> and from spec some pipeline operations do hurt performance.
Definitely. There's a lot of redundant setup happening in
prepare_composite, (even on my ring-torture branch). And since the
current code is reusing a tiny vertex buffer for every composite
operation, it's having to flush the graphics state on every operation
to force the new buffer contents to be read. It's that (and similar)
flushes that's killing performance.
> What I also
> tried was to use large vertex buffer, and submit multiple primitives
> in one time, but as expected rendering caused artifacts.
This is what will have to happen to get good performance I think. I
don't understand why you expect artifacts with this approach.
It occurs to me that one thing that might make sense is to store
multiple glyphs onto a single pixmap to reduce the amount of state
changes from one composite operation to the next. (If the same source
surface state can be used for a span, then the only thing that needs
to happen from one glyph to the next is new vertices---and if those
are just streamed through a buffer, then the hardware should be able
to crank through things at a great pace.)
I plan to report soon on a "speed of light" test to see how fast the
hardware should be capable of. I'll make assumptions like "no 3D
rendering getting in the way" and "all glyphs on a single pixmap" to
easily eliminate all the current overhead and see how fast things can
For reference, here are where things stand before that test, (with
EXA: 100,000 glyphs/second
NoAccel: 300,000 glyphs/second
No compositing: 1,000,000 glyphs/second
The "no compositing" number was obtained by making both
i965_prepare_composite and i965_composite return immediately. This
shows us how much software overhead there is in just getting all the
glyph data down to the driver. So 1 million glyphs/second is as fast
as the operation would go with infinitely fast hardware. I'd like to
see how close to that we can get.
> I'm also looking forward ttm support, as we still lack UTS/DTS
> Actually we're also looking for good perf suite for testing, current we
> uses x11perf, cairo-perf, render_bench (I did run before on i965 and happy
> with the result...) and maybe you can give us info on how to use trender?
The easiest way to use TRender is with a little "bookmarklet" in your
firefox toolbar. I've described that approach here:
That works great for one-off testing of a particular web page, (jut
visit that web page and click the bookmark button). There's another
means of running Trender which involves saving entire web pages and
automated testing, for testing many different kinds of webpages, and
for doing comparisons between different machines. It does mean you
have to collect the webpages you want to measure, so it's a bit more
work to setup.
I hope that helps.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
More information about the xorg