[Mesa-dev] Improving ralloc performance for the GLSL compiler

Tue Aug 30 19:52:02 UTC 2016

On Tue, Aug 30, 2016 at 4:06 PM, Marek Olšák <maraeo at gmail.com> wrote:
> On Tue, Aug 30, 2016 at 3:21 PM, Eero Tamminen
> <eero.t.tamminen at intel.com> wrote:
>> Hi,
>>
>>
>> On 30.08.2016 12:51, Marek Olšák wrote:
>>>
>>> Recently I discovered that our GLSL compiler spends a lot of time in
>>> rzalloc_size, so I looked at possible options to optimize that. It's
>>> worth noting that too many existing allocations slow down subsequent
>>> malloc calls, which in turn slows down the GLSL compiler. When I kept
>>> 5 instances of LLVMContext alive between compilations (I wanted to
>>> reuse them), the GLSL compiler slowed down. That shows that the GLSL
>>> compiler performance is too dependent on the size and complexity of
>>> the heap.
>>>
>>> So I decided to write my own linear allocator and then compared it
>>> with jemalloc preloaded by LD, and jemalloc linked statically and used
>>> by ralloc only.
>>>
>>> The test was shader-db using AMD's shader collection. The command line
>>> was:
>>> time GALLIUM_NOOP=1 shader-db/run shaders
>>> The noop driver ensures the compilation process ends with TGSI.
>>>
>>>
>>> Default Mesa:
>>> real    0m58.343s
>>> user    3m48.828s
>>> sys    0m0.760s
>>>
>>> Mesa with LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1:
>>> real    0m48.550s (17% less time)
>>> user    3m9.544s
>>> sys    0m1.700s
>>>
>>> Ralloc using _mesa_je_{calloc, realloc, free} and Mesa links against
>>> my libmesa_jemalloc_pic.a:
>>> real    0m49.580s (15% less time)
>>> user    3m14.452s
>>> sys    0m0.996s
>>>
>>> Ralloc using my own linear allocator that allocates out of 32KB
>>> buffers for 512b and smaller allocations:
>>> real    0m46.521s (20% less time)
>>> user    3m1.304s
>>> sys    0m1.740s
>>>
>>>
>>> Now let's test complete compilation down to GCN bytecode:
>>>
>>> Default Mesa:
>>> real    1m57.634s
>>> user    7m41.692s
>>> sys    0m1.824s
>>>
>>> Mesa with LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1:
>>> real    1m42.604s (13% less time)
>>> user    6m39.776s
>>> sys    0m3.828s
>>>
>>> Ralloc using _mesa_je_{calloc, realloc, free} and Mesa links against
>>> my libmesa_jemalloc_pic.a:
>>> real    1m44.413s (11% less time)
>>> user    6m48.808s
>>> sys    0m2.480s
>>>
>>> Ralloc using my own linear allocator:
>>> real    1m40.486s (14.6% less time)
>>> user    6m34.456s
>>> sys    0m2.224s
>>>
>>>
>>> The linear allocator that I wrote has a very high memory usage due to
>>> the inability to free 32KB blocks if those blocks have at least one
>>> living allocation. The workaround would be to do realloc() when
>>> changing a ralloc parent in order to "defragment" the memory, but
>>> that's more involved.
>>>
>>> I don't know much about glibc, but it's hard to believe that glibc
>>> people have been purposely ignoring jemalloc for so long. There must
>>> be some anti-performance politics going on, but enough of
>>> speculations.
>>
>>
>> Different allocators have different trade-offs:
>> * single-core speed
>> * multi-core speed
>> * memory usage
>> * long time memory fragmentation
>> * alloc debugging support & robustness
>>
>> And they can behave different with different allocation patterns and sizes.
>> Jemalloc being better in one test than ptmalloc doesn't necessarily mean
>> that it's better in another.
>>
>> Here's some discussion on the subject:
>>         https://lwn.net/Articles/273084/
>>
>> The used algorithms and some of the trade-offs are described in allocators'
>> source codes.
>>
>>
>>> If we don't care about memory usage, let's use my allocator.
>>
>>
>> Modern games are most demanding use-case for compiler, use largest number of
>> shaders, but almost all (>90%) Steam games are *still* 32-bit.  Before
>> compiler memory usage optimizations by Ian & Co,  several of them crashed
>> because they ran out of 32-bit address space.
>
> Did the games crash because i965 was using GLSL IR as its main
> compiler IR? Or was the problem that GLSL IR hadn't been released at
> link time, because the driver had to keep all of it for compiling
> shader variants? The memory usage issue might have been i965-specific
> and not relevant right now.
>
> Note that Gallium releases GLSL IR in glLinkProgram and other drivers
> should do that too. If some drivers don't, they are going to have
> memory usage issues either way.

Just to clarify, I don't care that much about memory usage as long as
the trade-off is worth it, but I understand there are people who care,
e.g. drivers that don't release GLSL IR or small devices (ARM,
embedded).

If I choose to finish up my allocator and put it below ralloc (which
would be easier than importing jemalloc), I will make it conditional
depending on the driver and CPU architecture.

Marek