<div dir="auto"><div><br><div class="gmail_extra"><br><div class="gmail_quote">On Mar 28, 2017 3:07 AM, "Michel Dänzer" <<a href="mailto:michel@daenzer.net">michel@daenzer.net</a>> wrote:<br type="attribution"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="quoted-text">On 27/03/17 07:29 PM, Marek Olšák wrote:<br>
> On Mar 27, 2017 9:35 AM, "Michel Dänzer" <<a href="mailto:michel@daenzer.net">michel@daenzer.net</a><br>
</div><div class="elided-text">> <mailto:<a href="mailto:michel@daenzer.net">michel@daenzer.net</a>>> wrote:<br>
><br>
> On 25/03/17 01:33 AM, Marek Olšák wrote:<br>
> > Hi,<br>
> ><br>
> > I'm sharing this idea here, because it's something that has been<br>
> > decreasing our performance a lot recently, for example:<br>
> ><br>
> <a href="http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa" rel="noreferrer" target="_blank">http://openbenchmarking.org/<wbr>prospect/1703011-RI-<wbr>RADEONDIR06/<wbr>7b7668cfc109d1c3dc27e871c8aea7<wbr>1ca13f23fa</a><br>
> <<a href="http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa" rel="noreferrer" target="_blank">http://openbenchmarking.org/<wbr>prospect/1703011-RI-<wbr>RADEONDIR06/<wbr>7b7668cfc109d1c3dc27e871c8aea7<wbr>1ca13f23fa</a>><br>
> ><br>
> > I think the problem there is that Mesa git started uploading<br>
> > descriptors and uniforms to VRAM, which helps when TC L2 has a low<br>
> > hit/miss ratio, but the performance can randomly drop by an order of<br>
> > magnitude. I've heard rumours that kernel 4.11 has an improved<br>
> > allocator that should perform better, but the situation is still far<br>
> > from ideal.<br>
> ><br>
> > AMD CPUs and APUs will hopefully suffer less, because we can resize<br>
> > the visible VRAM with the help of our CPU hw specs, but Intel CPUs<br>
> > will remain limited to 256 MB. The following plan describes how to do<br>
> > throttling for visible VRAM evictions.<br>
> ><br>
> ><br>
> > 1) Theory<br>
> ><br>
> > Initially, the driver doesn't care about where buffers are in VRAM,<br>
> > because VRAM buffers are only moved to visible VRAM on CPU page faults<br>
> > (when the CPU touches the buffer memory but the memory is in the<br>
> > invisible part of VRAM). When it happens,<br>
> > amdgpu_bo_fault_reserve_notify is called, which moves the buffer to<br>
> > visible VRAM, and the app continues. amdgpu_bo_fault_reserve_notify<br>
> > also marks the buffer as contiguous, which makes memory fragmentation<br>
> > worse.<br>
> ><br>
> > I verified this with DiRT Rally where amdgpu_bo_fault_reserve_notify<br>
> > was much higher in a CPU profiler than anything else in the kernel.<br>
> ><br>
> ><br>
> > 2) Monitoring via Gallium HUD<br>
> ><br>
> > We need to expose 2 kernel counters via the INFO ioctl and display<br>
> > those via Gallium HUD:<br>
> > - The number of VRAM CPU page faults. (the number of calls to<br>
> > amdgpu_bo_fault_reserve_<wbr>notify).<br>
> > - The number of bytes moved by ttm_bo_validate inside<br>
> > amdgpu_bo_fault_reserve_<wbr>notify.<br>
> ><br>
> > This will help us observe what exactly is happening and fine-tune the<br>
> > throttling when it's done.<br>
> ><br>
> ><br>
> > 3) Solution<br>
> ><br>
> > a) When amdgpu_bo_fault_reserve_notify is called, record the fact.<br>
> > (amdgpu_bo::had_cpu_page_fault = true)<br>
> ><br>
> > b) Monitor the MB/s rate at which buffers are moved by<br>
> > amdgpu_bo_fault_reserve_<wbr>notify. If we get above a specific threshold,<br>
> > don't move the buffer to visible VRAM. Move it to GTT instead. Note<br>
> > that moving to GTT can be cheaper, because moving to visible VRAM is<br>
> > likely to evict a lot of buffers there and unmap them from the CPU,<br>
><br>
> FWIW, this can be avoided by only setting GTT in busy_placement. Then<br>
> TTM will only move the BO to visible VRAM if that can be done without<br>
> evicting anything from there.<br>
><br>
><br>
> > but moving to GTT shouldn't evict or unmap anything.<br>
> ><br>
> > c) When we get into the CS ioctl and a buffer has had_cpu_page_fault,<br>
> > it can be moved to VRAM if:<br>
> > - the GTT->VRAM move rate is low enough to allow it (this is the<br>
> > existing throttling mechanism)<br>
> > - the visible VRAM move rate is low enough that we will be OK with<br>
> > another CPU page fault if it happens.<br>
><br>
> Some other ideas that might be worth trying:<br>
><br>
> Evicting BOs to GTT instead of moving them to CPU accessible VRAM in<br>
> principle in some cases (e.g. for all BOs except those with<br>
> AMDGPU_GEM_CREATE_CPU_ACCESS_<wbr>REQUIRED) or even always.<br>
><br>
><br>
> I've tried this and it made things even worse.<br>
<br>
</div>What exactly did you try?<br></blockquote></div></div></div><div dir="auto"><br></div><div dir="auto">I only set the placement to GTT, but I think I kept the contiguous flag.</div><div dir="auto"><br></div><div dir="auto">Marek</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="elided-text"><br>
<br>
--<br>
Earthling Michel Dänzer | <a href="http://www.amd.com" rel="noreferrer" target="_blank">http://www.amd.com</a><br>
Libre software enthusiast | Mesa and X developer<br>
</div></blockquote></div><br></div></div></div>