[Mesa-dev] Introducing OpenSWR: High performance software rasterizer

Tue Oct 20 15:58:43 PDT 2015

On 20/10/15 23:16, Rowley, Timothy O wrote:
>
>> On Oct 20, 2015, at 4:23 PM, Jose Fonseca <jfonseca at vmware.com> wrote:
>>
>> I tried it on my i7-5500U, but I run into two issues:
>>
>> - OpenSWR seems to only use 2 threads (even though my system support 4 threads)
>>
>> - and even when I compensate llvmpipe to only use 2 rasterizer threads, I still only get half the framerate of llvmpipe with the "gloss" Mesa demo (a very simple texturing demo):
>>
>> $ ./gloss
>> SWR create screen!
>> This processor supports AVX2.
>> 720 frames in 5.004 seconds = 143.885 FPS
>> 737 frames in 5.005 seconds = 147.253 FPS
>> 729 frames in 5.004 seconds = 145.683 FPS
>> 732 frames in 5.002 seconds = 146.341 FPS
>> 735 frames in 5.001 seconds = 146.971 FPS
>> [...]
>> $ GALLIUM_DRIVER=llvmpipe LP_NUM_THREADS=2 ./gloss
>> 1539 frames in 5.002 seconds = 307.677 FPS
>> 1719 frames in 5 seconds = 343.8 FPS
>> 1780 frames in 5.002 seconds = 355.858 FPS
>> 1497 frames in 5.002 seconds = 299.28 FPS
>> 1548 frames in 5.001 seconds = 309.538 FPS
>> [..]
>>
>> I see similar ratio with more complex  workload with the trace from:
>>
>>   http://people.freedesktop.org/~jrfonseca/traces/furmark-1.8.2-svga.trace
>>
>> (you'll need to download https://github.com/apitrace/apitrace and build)
>>
>> My questions are:
>>
>> - Is this the expected performance when texturing is used? Or is there something wrong with my setup?
>>
>
> Two things are happening here to cause the behavior you’re seeing.  First, OpenSWR only generates threads equal to the number of physical cores.  On our workloads, going beyond that and using hyperthreads was a minimal or negative performance increase.  Second, one thread is reserved for the API thread, which does not participate in either frontend (geometry) or backend (fragment) work.  Thus on your two core 5500U OpenSWR only had one raster thread versus llvmpipe’s two, giving half the performance.  If you want to switch OpenSWR to using hyperthreads, set the environment variable KNOB_MAX_THREADS_PER_CORE=0.

Thanks for the explanations.  It's closer now, but still a bit of gap:

$ KNOB_MAX_THREADS_PER_CORE=0 ./gloss
SWR create screen!
This processor supports AVX2.
--> numThreads = 3
1102 frames in 5.002 seconds = 220.312 FPS
1133 frames in 5.001 seconds = 226.555 FPS
1130 frames in 5.002 seconds = 225.91 FPS
^C
$ GALLIUM_DRIVER=llvmpipe LP_NUM_THREADS=2 ./gloss
1456 frames in 5 seconds = 291.2 FPS
1617 frames in 5.003 seconds = 323.206 FPS
1571 frames in 5.002 seconds = 314.074 FPS

One final question: you said that one thread is reserved for the API, 
but I see all threads (with top `H`) maxing up the CPU.  So if the 
thread reserved for the API is not doing vertex/fragment processing, 
then what is it using 100% of a CPU thread for?

Final thoughts: I understand this project has its own history, but I 
echo what Roland said -- it would be nice to unify with llvmpipe at one 
point, in some way or fashion.  Our (VMware's) focus has been desktop 
composition, but there's no reason why a single SW renderer can't 
satisfy both ends of the spectrum, especially for JIT enable renderers, 
since they can emit at runtime the code most suited for the workload.

That said, it's really nice seeing Mesa and Gallium enabling this sort 
of experiments with SW rendering.

Jose