Plan: BO move throttling for visible VRAM evictions
maraeo at gmail.com
Fri Mar 24 16:57:11 UTC 2017
On Fri, Mar 24, 2017 at 5:45 PM, Christian König
<deathsimple at vodafone.de> wrote:
> Am 24.03.2017 um 17:33 schrieb Marek Olšák:
>> I'm sharing this idea here, because it's something that has been
>> decreasing our performance a lot recently, for example:
>> I think the problem there is that Mesa git started uploading
>> descriptors and uniforms to VRAM, which helps when TC L2 has a low
>> hit/miss ratio, but the performance can randomly drop by an order of
>> magnitude. I've heard rumours that kernel 4.11 has an improved
>> allocator that should perform better, but the situation is still far
>> from ideal.
>> AMD CPUs and APUs will hopefully suffer less, because we can resize
>> the visible VRAM with the help of our CPU hw specs, but Intel CPUs
>> will remain limited to 256 MB. The following plan describes how to do
>> throttling for visible VRAM evictions.
>> 1) Theory
>> Initially, the driver doesn't care about where buffers are in VRAM,
>> because VRAM buffers are only moved to visible VRAM on CPU page faults
>> (when the CPU touches the buffer memory but the memory is in the
>> invisible part of VRAM). When it happens,
>> amdgpu_bo_fault_reserve_notify is called, which moves the buffer to
>> visible VRAM, and the app continues. amdgpu_bo_fault_reserve_notify
>> also marks the buffer as contiguous, which makes memory fragmentation
>> I verified this with DiRT Rally where amdgpu_bo_fault_reserve_notify
>> was much higher in a CPU profiler than anything else in the kernel.
> Good to know that my expectations on this are correct.
> How about fixing the need for contiguous buffers when CPU mapping them?
> That should actually be pretty easy to do.
>> 2) Monitoring via Gallium HUD
>> We need to expose 2 kernel counters via the INFO ioctl and display
>> those via Gallium HUD:
>> - The number of VRAM CPU page faults. (the number of calls to
>> - The number of bytes moved by ttm_bo_validate inside
>> This will help us observe what exactly is happening and fine-tune the
>> throttling when it's done.
>> 3) Solution
>> a) When amdgpu_bo_fault_reserve_notify is called, record the fact.
>> (amdgpu_bo::had_cpu_page_fault = true)
> What is that good for?
>> b) Monitor the MB/s rate at which buffers are moved by
>> amdgpu_bo_fault_reserve_notify. If we get above a specific threshold,
>> don't move the buffer to visible VRAM. Move it to GTT instead. Note
>> that moving to GTT can be cheaper, because moving to visible VRAM is
>> likely to evict a lot of buffers there and unmap them from the CPU,
>> but moving to GTT shouldn't evict or unmap anything.
> Yeah, had that idea as well. I've been working on adding a context to TTMs
> BO validation call chain.
> This way we could add a byte limit on how much TTM will try to evict before
> returning -ENOMEM (or better ENOSPC).
>> c) When we get into the CS ioctl and a buffer has had_cpu_page_fault,
>> it can be moved to VRAM if:
>> - the GTT->VRAM move rate is low enough to allow it (this is the
>> existing throttling mechanism)
>> - the visible VRAM move rate is low enough that we will be OK with
>> another CPU page fault if it happens.
> Interesting idea, need to think a bit about it.
> But I would say this has second priority, fixing the contiguous buffer
> requirement should be first. Going to work on that next.
Interesting. I didn't know the contiguous setting wasn't required.
More information about the amd-gfx