[Mesa-dev] Introducing OpenSWR: High performance software rasterizer

Wed Oct 21 16:43:09 PDT 2015

> On Oct 20, 2015, at 5:58 PM, Jose Fonseca <jfonseca at vmware.com> wrote:
> 
> Thanks for the explanations.  It's closer now, but still a bit of gap:
> 
> $ KNOB_MAX_THREADS_PER_CORE=0 ./gloss
> SWR create screen!
> This processor supports AVX2.
> --> numThreads = 3
> 1102 frames in 5.002 seconds = 220.312 FPS
> 1133 frames in 5.001 seconds = 226.555 FPS
> 1130 frames in 5.002 seconds = 225.91 FPS
> ^C
> $ GALLIUM_DRIVER=llvmpipe LP_NUM_THREADS=2 ./gloss
> 1456 frames in 5 seconds = 291.2 FPS
> 1617 frames in 5.003 seconds = 323.206 FPS
> 1571 frames in 5.002 seconds = 314.074 FPS

A bit more of an apples to apples comparison might be single-threaded llvmpipe (LP_NUM_THREADS=1) and single-threaded swr (KNOB_SINGLE_THREADED=1).  Running gloss and glxgears (another favorite “benchmark” :) ) under these conditions show swr running a bit slower, though a little closer than your numbers.  Examining performance traces, we think swr’s concept of hot-tiles, the working memory representation of the render target, and the associated load/store functions contribute to most of the difference.  We might be able to optimize those conversions; additionally fast clear would help these demos.  For larger workloads this small per-frame cost doesn’t really affect the performance.

> One final question: you said that one thread is reserved for the API, but I see all threads (with top `H`) maxing up the CPU. So if the thread reserved for the API is not doing vertex/fragment processing, then what is it using 100% of a CPU thread for?

With a trivial application main loop and light api usage, the API thread is going to end up spending most of the time waiting for the other threads to finish work.

These initial observations from you and others regarding performance have been interesting.  Our performance work has been with large workloads on high core count configurations, where while some of the decisions such as a dedicated core for the application/API might have cost performance a bit, the percentage is much less than on the dual and quad core processors.  We’ll look into some changes/tuning that will benefit both extremes, though we might have to end up conceding that llvmpipe will be faster at glxgears. :-)  

> Final thoughts: I understand this project has its own history, but I echo what Roland said -- it would be nice to unify with llvmpipe at one point, in some way or fashion.  Our (VMware's) focus has been desktop composition, but there's no reason why a single SW renderer can't satisfy both ends of the spectrum, especially for JIT enable renderers, since they can emit at runtime the code most suited for the workload.

We would be happy for someone to take some of the ideas from swr to speed up llvmpipe, but for now our development will continue on the swr core and driver.  We’re not planning on replacing llvmpipe - its intent of working on any architecture is admirable.  In the ideal world the solution would be something that combines the best traits of both rasterizers, but at this point the shortest path to having a performant solution for our customers is with swr. 

> That said, it's really nice seeing Mesa and Gallium enabling this sort of experiments with SW rendering.

Yes, we were quite happy with how fast we were able to get a new driver functioning with gallium.  The major thing slowing us was the documentation, which is not uniform in coverage.  There was a lot of reading other drivers’ source to figure out how things were supposed to work.

-Tim