[Mesa-dev] Improving ralloc performance for the GLSL compiler

Connor Abbott cwabbott0 at gmail.com
Tue Aug 30 20:14:02 UTC 2016


On Tue, Aug 30, 2016 at 10:06 AM, Marek Olšák <maraeo at gmail.com> wrote:
> On Tue, Aug 30, 2016 at 3:21 PM, Eero Tamminen
> <eero.t.tamminen at intel.com> wrote:
>> Hi,
>>
>>
>> On 30.08.2016 12:51, Marek Olšák wrote:
>>>
>>> Recently I discovered that our GLSL compiler spends a lot of time in
>>> rzalloc_size, so I looked at possible options to optimize that. It's
>>> worth noting that too many existing allocations slow down subsequent
>>> malloc calls, which in turn slows down the GLSL compiler. When I kept
>>> 5 instances of LLVMContext alive between compilations (I wanted to
>>> reuse them), the GLSL compiler slowed down. That shows that the GLSL
>>> compiler performance is too dependent on the size and complexity of
>>> the heap.
>>>
>>> So I decided to write my own linear allocator and then compared it
>>> with jemalloc preloaded by LD, and jemalloc linked statically and used
>>> by ralloc only.
>>>
>>> The test was shader-db using AMD's shader collection. The command line
>>> was:
>>> time GALLIUM_NOOP=1 shader-db/run shaders
>>> The noop driver ensures the compilation process ends with TGSI.
>>>
>>>
>>> Default Mesa:
>>> real    0m58.343s
>>> user    3m48.828s
>>> sys    0m0.760s
>>>
>>> Mesa with LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1:
>>> real    0m48.550s (17% less time)
>>> user    3m9.544s
>>> sys    0m1.700s
>>>
>>> Ralloc using _mesa_je_{calloc, realloc, free} and Mesa links against
>>> my libmesa_jemalloc_pic.a:
>>> real    0m49.580s (15% less time)
>>> user    3m14.452s
>>> sys    0m0.996s
>>>
>>> Ralloc using my own linear allocator that allocates out of 32KB
>>> buffers for 512b and smaller allocations:
>>> real    0m46.521s (20% less time)
>>> user    3m1.304s
>>> sys    0m1.740s
>>>
>>>
>>> Now let's test complete compilation down to GCN bytecode:
>>>
>>> Default Mesa:
>>> real    1m57.634s
>>> user    7m41.692s
>>> sys    0m1.824s
>>>
>>> Mesa with LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1:
>>> real    1m42.604s (13% less time)
>>> user    6m39.776s
>>> sys    0m3.828s
>>>
>>> Ralloc using _mesa_je_{calloc, realloc, free} and Mesa links against
>>> my libmesa_jemalloc_pic.a:
>>> real    1m44.413s (11% less time)
>>> user    6m48.808s
>>> sys    0m2.480s
>>>
>>> Ralloc using my own linear allocator:
>>> real    1m40.486s (14.6% less time)
>>> user    6m34.456s
>>> sys    0m2.224s
>>>
>>>
>>> The linear allocator that I wrote has a very high memory usage due to
>>> the inability to free 32KB blocks if those blocks have at least one
>>> living allocation. The workaround would be to do realloc() when
>>> changing a ralloc parent in order to "defragment" the memory, but
>>> that's more involved.
>>>
>>> I don't know much about glibc, but it's hard to believe that glibc
>>> people have been purposely ignoring jemalloc for so long. There must
>>> be some anti-performance politics going on, but enough of
>>> speculations.
>>
>>
>> Different allocators have different trade-offs:
>> * single-core speed
>> * multi-core speed
>> * memory usage
>> * long time memory fragmentation
>> * alloc debugging support & robustness
>>
>> And they can behave different with different allocation patterns and sizes.
>> Jemalloc being better in one test than ptmalloc doesn't necessarily mean
>> that it's better in another.
>>
>> Here's some discussion on the subject:
>>         https://lwn.net/Articles/273084/
>>
>> The used algorithms and some of the trade-offs are described in allocators'
>> source codes.
>>
>>
>>> If we don't care about memory usage, let's use my allocator.
>>
>>
>> Modern games are most demanding use-case for compiler, use largest number of
>> shaders, but almost all (>90%) Steam games are *still* 32-bit.  Before
>> compiler memory usage optimizations by Ian & Co,  several of them crashed
>> because they ran out of 32-bit address space.
>
> Did the games crash because i965 was using GLSL IR as its main
> compiler IR? Or was the problem that GLSL IR hadn't been released at
> link time, because the driver had to keep all of it for compiling
> shader variants? The memory usage issue might have been i965-specific
> and not relevant right now.
>
> Note that Gallium releases GLSL IR in glLinkProgram and other drivers
> should do that too. If some drivers don't, they are going to have
> memory usage issues either way.

I believe that at the time, i965 had to keep GLSL IR around after
linking to handle shader variants. Nowadays, we release the GLSL IR at
link time and only hang onto the NIR for variants. NIR is inherently a
lot more compact than GLSL IR since it uses a lot fewer variables and
variable dereferences (they're mostly replaced by SSA values during
optimization time). It's not as compact as TGSI, since it's designed
to be mutated/optimized, but it could be made a lot smaller with a
little tuning. Also, Ian did a lot of work to make GLSL's memory
footprint smaller, which still helps during link time.

>
> Marek
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev


More information about the mesa-dev mailing list