Plan: BO move throttling for visible VRAM evictions

Fri Mar 24 16:45:05 UTC 2017

Am 24.03.2017 um 17:33 schrieb Marek Olšák:
> Hi,
>
> I'm sharing this idea here, because it's something that has been
> decreasing our performance a lot recently, for example:
> http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa
>
> I think the problem there is that Mesa git started uploading
> descriptors and uniforms to VRAM, which helps when TC L2 has a low
> hit/miss ratio, but the performance can randomly drop by an order of
> magnitude. I've heard rumours that kernel 4.11 has an improved
> allocator that should perform better, but the situation is still far
> from ideal.
>
> AMD CPUs and APUs will hopefully suffer less, because we can resize
> the visible VRAM with the help of our CPU hw specs, but Intel CPUs
> will remain limited to 256 MB. The following plan describes how to do
> throttling for visible VRAM evictions.
>
>
> 1) Theory
>
> Initially, the driver doesn't care about where buffers are in VRAM,
> because VRAM buffers are only moved to visible VRAM on CPU page faults
> (when the CPU touches the buffer memory but the memory is in the
> invisible part of VRAM). When it happens,
> amdgpu_bo_fault_reserve_notify is called, which moves the buffer to
> visible VRAM, and the app continues. amdgpu_bo_fault_reserve_notify
> also marks the buffer as contiguous, which makes memory fragmentation
> worse.
>
> I verified this with DiRT Rally where amdgpu_bo_fault_reserve_notify
> was much higher in a CPU profiler than anything else in the kernel.

Good to know that my expectations on this are correct.

How about fixing the need for contiguous buffers when CPU mapping them?

That should actually be pretty easy to do.

> 2) Monitoring via Gallium HUD
>
> We need to expose 2 kernel counters via the INFO ioctl and display
> those via Gallium HUD:
> - The number of VRAM CPU page faults. (the number of calls to
> amdgpu_bo_fault_reserve_notify).
> - The number of bytes moved by ttm_bo_validate inside
> amdgpu_bo_fault_reserve_notify.
>
> This will help us observe what exactly is happening and fine-tune the
> throttling when it's done.
>
>
> 3) Solution
>
> a) When amdgpu_bo_fault_reserve_notify is called, record the fact.
> (amdgpu_bo::had_cpu_page_fault = true)

What is that good for?

> b) Monitor the MB/s rate at which buffers are moved by
> amdgpu_bo_fault_reserve_notify. If we get above a specific threshold,
> don't move the buffer to visible VRAM. Move it to GTT instead. Note
> that moving to GTT can be cheaper, because moving to visible VRAM is
> likely to evict a lot of buffers there and unmap them from the CPU,
> but moving to GTT shouldn't evict or unmap anything.

Yeah, had that idea as well. I've been working on adding a context to 
TTMs BO validation call chain.

This way we could add a byte limit on how much TTM will try to evict 
before returning -ENOMEM (or better ENOSPC).

> c) When we get into the CS ioctl and a buffer has had_cpu_page_fault,
> it can be moved to VRAM if:
> - the GTT->VRAM move rate is low enough to allow it (this is the
> existing throttling mechanism)
> - the visible VRAM move rate is low enough that we will be OK with
> another CPU page fault if it happens.

Interesting idea, need to think a bit about it.

But I would say this has second priority, fixing the contiguous buffer 
requirement should be first. Going to work on that next.

> d) The solution can be fine-tuned with the help of Gallium HUD to get
> the best performance under various scenarios. The current throttling
> mechanism can serve as an inspiration.
>
>
> That's it. Feel free to comment. I think this is our biggest
> performance bottleneck at the moment.

Yeah, completely agree.

Regards,
Christian.

>
> Marek
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx