[Mesa-dev] Improving ralloc performance for the GLSL compiler
Eero Tamminen
eero.t.tamminen at intel.com
Tue Aug 30 13:21:02 UTC 2016
Hi,
On 30.08.2016 12:51, Marek Olšák wrote:
> Recently I discovered that our GLSL compiler spends a lot of time in
> rzalloc_size, so I looked at possible options to optimize that. It's
> worth noting that too many existing allocations slow down subsequent
> malloc calls, which in turn slows down the GLSL compiler. When I kept
> 5 instances of LLVMContext alive between compilations (I wanted to
> reuse them), the GLSL compiler slowed down. That shows that the GLSL
> compiler performance is too dependent on the size and complexity of
> the heap.
>
> So I decided to write my own linear allocator and then compared it
> with jemalloc preloaded by LD, and jemalloc linked statically and used
> by ralloc only.
>
> The test was shader-db using AMD's shader collection. The command line was:
> time GALLIUM_NOOP=1 shader-db/run shaders
> The noop driver ensures the compilation process ends with TGSI.
>
>
> Default Mesa:
> real 0m58.343s
> user 3m48.828s
> sys 0m0.760s
>
> Mesa with LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1:
> real 0m48.550s (17% less time)
> user 3m9.544s
> sys 0m1.700s
>
> Ralloc using _mesa_je_{calloc, realloc, free} and Mesa links against
> my libmesa_jemalloc_pic.a:
> real 0m49.580s (15% less time)
> user 3m14.452s
> sys 0m0.996s
>
> Ralloc using my own linear allocator that allocates out of 32KB
> buffers for 512b and smaller allocations:
> real 0m46.521s (20% less time)
> user 3m1.304s
> sys 0m1.740s
>
>
> Now let's test complete compilation down to GCN bytecode:
>
> Default Mesa:
> real 1m57.634s
> user 7m41.692s
> sys 0m1.824s
>
> Mesa with LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1:
> real 1m42.604s (13% less time)
> user 6m39.776s
> sys 0m3.828s
>
> Ralloc using _mesa_je_{calloc, realloc, free} and Mesa links against
> my libmesa_jemalloc_pic.a:
> real 1m44.413s (11% less time)
> user 6m48.808s
> sys 0m2.480s
>
> Ralloc using my own linear allocator:
> real 1m40.486s (14.6% less time)
> user 6m34.456s
> sys 0m2.224s
>
>
> The linear allocator that I wrote has a very high memory usage due to
> the inability to free 32KB blocks if those blocks have at least one
> living allocation. The workaround would be to do realloc() when
> changing a ralloc parent in order to "defragment" the memory, but
> that's more involved.
>
> I don't know much about glibc, but it's hard to believe that glibc
> people have been purposely ignoring jemalloc for so long. There must
> be some anti-performance politics going on, but enough of
> speculations.
Different allocators have different trade-offs:
* single-core speed
* multi-core speed
* memory usage
* long time memory fragmentation
* alloc debugging support & robustness
And they can behave different with different allocation patterns and
sizes. Jemalloc being better in one test than ptmalloc doesn't
necessarily mean that it's better in another.
Here's some discussion on the subject:
https://lwn.net/Articles/273084/
The used algorithms and some of the trade-offs are described in
allocators' source codes.
> If we don't care about memory usage, let's use my allocator.
Modern games are most demanding use-case for compiler, use largest
number of shaders, but almost all (>90%) Steam games are *still* 32-bit.
Before compiler memory usage optimizations by Ian & Co, several of
them crashed because they ran out of 32-bit address space.
(DOTA2 is nowadays thankfully 64-bit so it doesn't anymore crash because
of that.)
> If we do,
> let's import jemalloc into the Mesa tree and use it for ralloc. That
> "11% less time" spent in the shader compiler (which includes LLVM)
> would be nice to have.
I don't think above jemalloc testing is enough, you should also:
* Test performance with 32-bit builds
* Do some memory usage comparisons
I'm not sure what's the best way to track memory usage for this though.
From proc you get total mapping sizes, but typically dirty memory usage
is more relevant and that you see from smaps data.
Easiest start could be with Valgrind massif as it can show heap memory
usage over time:
http://valgrind.org/docs/manual/ms-manual.html
- Eero
PS. This Valgrind tool can be used to optimize memory allocations
efficiency:
http://valgrind.org/docs/manual/dh-manual.html
It tells which parts of the allocs are hot and which are cold, or unused
completely, so that things within allocations can be arranged in most
efficient manner.
More information about the mesa-dev
mailing list