[Intel-gfx] [RFC] libdrm_intel: Rework BO allocs to avoid rounding up to bucket size

Fri Aug 29 12:45:13 CEST 2014

On 29/08/2014 11:16, Chris Wilson wrote:
> On Fri, Aug 29, 2014 at 11:02:01AM +0100, Arun Siluvery wrote:
>> From: Garry Lancaster <garry.lancaster at intel.com>
>>
>> libdrm includes a scheme where freed buffer objects (BOs)
>> are held in a cache. This allows incoming allocation requests to be
>> serviced by re-using an old BO, instead of requiring a new
>> object to be allocated. This is a performance enhancement.
>> The cache is divided into "buckets". Each bucket holds unused
>> BOs of a pre-determined size. When a BO allocation request is seen,
>> the bucket for BOs of this size or larger is selected. Any BO
>> currently in the bucket will be re-used for the allocation. If the
>> bucket was empty, a new BO is created. However, the BO is created
>> with the size determined by the selected bucket (i.e. the size is
>> rounded up to the bucket size), rather than being created with the
>> originally requested size. This is so that when the BO is freed,
>> it can be released into the bucket and re-used by any other allocation
>> which selects the same bucket.
>>
>> Depending upon the size of the allocation, this rounding up can
>> result in a significant wastage of memory when allocating a BO. For
>> example, a BO request just over 132K allocated during GLES context
>> creation was rounded up to the next bucket size of 160K. Such wastage
>> can be critical on devices with low memory.
>>
>> This commit reworks the BO allocation code. On a BO allocation request,
>> if the selected bucket contains any BOs, each of them is checked to
>> see if any is large enough to fulfill the allocation request. If not,
>> a new BO is created, but (due to the new check) it is no longer
>> necessary to round up its size to match the size determined by the
>> selected bucket.
>>
>> So, previously, buckets contained BOs that were all the same size. But now
>> the BOs in a bucket can be different sizes: in the range from the size of the
>> next smaller, nominal, bucket size to the current, nominal, bucket size.
>>
>> On a 1GB system, the following reductions in BO memory usage were seen:
>>
>> BaseMark X 1.0:                324.4MB -> 306.0MB (-18.4MB;  5.7% saving)
>> BaseMark X 1.1 Medium Quality: 206.9MB -> 201.2MB (- 5.7MB;  2.8% saving)
>> GFXBench 3.0 TRex:             216.6MB -> 200.0MB (-16.6MB;  8.3% saving)
>> GFXBench 3.0 Manhattan:        281.4MB -> 246.8MB (-34.6MB; 12.3% saving)
>>
>> No performance change was seen on BaseMarkX. GFXBench 3.0 showed small
>> performance increases (~0.5fps on Manhattan, ~1-2fps on TRex) which may be
>> due to reduced activity of the OOM killer.
>
> The principle for rounding up was to increase the cache hit rate and
> thereby reduce allocations. Might be interesting to know whether the
> number of bo allocated also changes. If not, the argument is that the
> working set is pretty stable and has a natural set of sizes which it
> reuses. A counter example might then be uxa, glamor, compositors which
> off-the-top-of-my-head would have more variable object sizes.
>
> Reducing the impact of thrashing should itself be measurable, and a
> useful statistic to track.
>
> As a corollary to exact allocations, you can then reduce the number of
> buckets again (the number was increased to allow finer-grained
> allocations). Again, it is hard to judge whether handing back larger
> objects will lead to memory wastage. So yet another statistic to track
> is requested versus allocated memory sizes.
>
Reducing number of buckets would lead to more wastage of memory right?

The current bucket sizes are,
Bucket[0]: 4K
Bucket[1]: 8K
Bucket[2]: 12K
Bucket[3]: 16K
Bucket[4]: 20K
Bucket[5]: 24K
Bucket[6]: 28K
Bucket[7]: 32K
Bucket[8]: 40K
Bucket[9]: 48K
Bucket[10]: 56K
Bucket[11]: 64K
Bucket[12]: 80K
Bucket[13]: 96K
Bucket[14]: 112K
Bucket[15]: 128K
Bucket[16]: 160K
Bucket[17]: 192K
Bucket[18]: 224K
Bucket[19]: 256K
...
...

If there are more objects with size 132K we would end up allocating 
160K. We can track requested vs allocated but that depends on the 
application and usage, what would be the best measure to track this? I 
mean we measure over a given time or any other criteria?

> Also it is important to state what type of system you are measuring the
> impact of allocations for -- the behaviour of a cache miss is
> dramatically different between LLC and non-LLC systems.

The current data is from a non-LLC system.

regards
Arun

> -Chris
>