[Mesa-dev] Introducing OpenSWR: High performance software rasterizer

Thu Oct 22 14:17:10 PDT 2015

On 22/10/15 00:43, Rowley, Timothy O wrote:
>
>> On Oct 20, 2015, at 5:58 PM, Jose Fonseca <jfonseca at vmware.com> wrote:
>>
>> Thanks for the explanations.  It's closer now, but still a bit of gap:
>>
>> $ KNOB_MAX_THREADS_PER_CORE=0 ./gloss
>> SWR create screen!
>> This processor supports AVX2.
>> --> numThreads = 3
>> 1102 frames in 5.002 seconds = 220.312 FPS
>> 1133 frames in 5.001 seconds = 226.555 FPS
>> 1130 frames in 5.002 seconds = 225.91 FPS
>> ^C
>> $ GALLIUM_DRIVER=llvmpipe LP_NUM_THREADS=2 ./gloss
>> 1456 frames in 5 seconds = 291.2 FPS
>> 1617 frames in 5.003 seconds = 323.206 FPS
>> 1571 frames in 5.002 seconds = 314.074 FPS
>
> A bit more of an apples to apples comparison might be single-threaded llvmpipe (LP_NUM_THREADS=1) and single-threaded swr (KNOB_SINGLE_THREADED=1).  Running gloss and glxgears (another favorite “benchmark” :) ) under these conditions show swr running a bit slower, though a little closer than your numbers.

Indeed that seems a better comparison.

$ KNOB_SINGLE_THREADED=1 ./gloss
SWR create screen!
This processor supports AVX2.
733 frames in 5.003 seconds = 146.512 FPS
787 frames in 5.004 seconds = 157.274 FPS
793 frames in 5.005 seconds = 158.442 FPS
799 frames in 5.001 seconds = 159.768 FPS
787 frames in 5.005 seconds = 157.243 FPS
$ GALLIUM_DRIVER=llvmpipe LP_NUM_THREADS=0 ./gloss
939 frames in 5.002 seconds = 187.725 FPS
1032 frames in 5.001 seconds = 206.359 FPS
1017 frames in 5.002 seconds = 203.319 FPS
1021 frames in 5 seconds = 204.2 FPS
1039 frames in 5.002 seconds = 207.717 FPS

 > Examining performance traces, we think swr’s concept of hot-tiles, 
the working memory representation of the render target, and the 
associated load/store functions contribute to most of the difference. 
We might be able to optimize those conversions; additionally fast clear 
would help these demos.  For larger workloads this small per-frame cost 
doesn’t really affect the performance.

> These initial observations from you and others regarding performance have been interesting.  Our performance work has been with large workloads on high core count configurations, where while some of the decisions such as a dedicated core for the application/API might have cost performance a bit, the percentage is much less than on the dual and quad core processors.  We’ll look into some changes/tuning that will benefit both extremes, though we might have to end up conceding that llvmpipe will be faster at glxgears. :-)

I don't care for gears -- it practically measure present/blit rate --, 
but gloss spite simple is sensitive to texturing performance.

>> Final thoughts: I understand this project has its own history, but I echo what Roland said -- it would be nice to unify with llvmpipe at one point, in some way or fashion.  Our (VMware's) focus has been desktop composition, but there's no reason why a single SW renderer can't satisfy both ends of the spectrum, especially for JIT enable renderers, since they can emit at runtime the code most suited for the workload.
>
> We would be happy for someone to take some of the ideas from swr to speed up llvmpipe, but for now our development will continue on the swr core and driver.  We’re not planning on replacing llvmpipe - its intent of working on any architecture is admirable.  In the ideal world the solution would be something that combines the best traits of both rasterizers, but at this point the shortest path to having a performant solution for our customers is with swr.

Fair enough.

They do share a lot already, Mesa, gallium statetracker, and gallivm. 
If further development in openswr is planned, it might require to jump 
through a few hoops, but I think it's worth to figure out what would 
take to get this merged into master so that, whenever there are 
interface changes, openswer won't get the short stick.

>> That said, it's really nice seeing Mesa and Gallium enabling this sort of experiments with SW rendering.
>
> Yes, we were quite happy with how fast we were able to get a new driver functioning with gallium.  The major thing slowing us was the documentation, which is not uniform in coverage.  There was a lot of reading other drivers’ source to figure out how things were supposed to work.

Yes, that's a fair comment.

Jose