<div dir="auto"><div><br><div class="gmail_extra"><br><div class="gmail_quote">On Mar 28, 2017 3:07 AM, "Michel Dänzer" <<a href="mailto:michel@daenzer.net">michel@daenzer.net</a>> wrote:<br type="attribution"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="quoted-text">On 27/03/17 07:29 PM, Marek Olšák wrote:<br> > On Mar 27, 2017 9:35 AM, "Michel Dänzer" <<a href="mailto:michel@daenzer.net">michel@daenzer.net</a><br> </div><div class="elided-text">> <mailto:<a href="mailto:michel@daenzer.net">michel@daenzer.net</a>>> wrote:<br> ><br> > On 25/03/17 01:33 AM, Marek Olšák wrote:<br> > > Hi,<br> > ><br> > > I'm sharing this idea here, because it's something that has been<br> > > decreasing our performance a lot recently, for example:<br> > ><br> > <a href="http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa" rel="noreferrer" target="_blank">http://openbenchmarking.org/<wbr>prospect/1703011-RI-<wbr>RADEONDIR06/<wbr>7b7668cfc109d1c3dc27e871c8aea7<wbr>1ca13f23fa</a><br> > <<a href="http://openbenchmarking.org/prospect/1703011-RI-RADEONDIR06/7b7668cfc109d1c3dc27e871c8aea71ca13f23fa" rel="noreferrer" target="_blank">http://openbenchmarking.org/<wbr>prospect/1703011-RI-<wbr>RADEONDIR06/<wbr>7b7668cfc109d1c3dc27e871c8aea7<wbr>1ca13f23fa</a>><br> > ><br> > > I think the problem there is that Mesa git started uploading<br> > > descriptors and uniforms to VRAM, which helps when TC L2 has a low<br> > > hit/miss ratio, but the performance can randomly drop by an order of<br> > > magnitude. I've heard rumours that kernel 4.11 has an improved<br> > > allocator that should perform better, but the situation is still far<br> > > from ideal.<br> > ><br> > > AMD CPUs and APUs will hopefully suffer less, because we can resize<br> > > the visible VRAM with the help of our CPU hw specs, but Intel CPUs<br> > > will remain limited to 256 MB. The following plan describes how to do<br> > > throttling for visible VRAM evictions.<br> > ><br> > ><br> > > 1) Theory<br> > ><br> > > Initially, the driver doesn't care about where buffers are in VRAM,<br> > > because VRAM buffers are only moved to visible VRAM on CPU page faults<br> > > (when the CPU touches the buffer memory but the memory is in the<br> > > invisible part of VRAM). When it happens,<br> > > amdgpu_bo_fault_reserve_notify is called, which moves the buffer to<br> > > visible VRAM, and the app continues. amdgpu_bo_fault_reserve_notify<br> > > also marks the buffer as contiguous, which makes memory fragmentation<br> > > worse.<br> > ><br> > > I verified this with DiRT Rally where amdgpu_bo_fault_reserve_notify<br> > > was much higher in a CPU profiler than anything else in the kernel.<br> > ><br> > ><br> > > 2) Monitoring via Gallium HUD<br> > ><br> > > We need to expose 2 kernel counters via the INFO ioctl and display<br> > > those via Gallium HUD:<br> > > - The number of VRAM CPU page faults. (the number of calls to<br> > > amdgpu_bo_fault_reserve_<wbr>notify).<br> > > - The number of bytes moved by ttm_bo_validate inside<br> > > amdgpu_bo_fault_reserve_<wbr>notify.<br> > ><br> > > This will help us observe what exactly is happening and fine-tune the<br> > > throttling when it's done.<br> > ><br> > ><br> > > 3) Solution<br> > ><br> > > a) When amdgpu_bo_fault_reserve_notify is called, record the fact.<br> > > (amdgpu_bo::had_cpu_page_fault = true)<br> > ><br> > > b) Monitor the MB/s rate at which buffers are moved by<br> > > amdgpu_bo_fault_reserve_<wbr>notify. If we get above a specific threshold,<br> > > don't move the buffer to visible VRAM. Move it to GTT instead. Note<br> > > that moving to GTT can be cheaper, because moving to visible VRAM is<br> > > likely to evict a lot of buffers there and unmap them from the CPU,<br> ><br> > FWIW, this can be avoided by only setting GTT in busy_placement. Then<br> > TTM will only move the BO to visible VRAM if that can be done without<br> > evicting anything from there.<br> ><br> ><br> > > but moving to GTT shouldn't evict or unmap anything.<br> > ><br> > > c) When we get into the CS ioctl and a buffer has had_cpu_page_fault,<br> > > it can be moved to VRAM if:<br> > > - the GTT->VRAM move rate is low enough to allow it (this is the<br> > > existing throttling mechanism)<br> > > - the visible VRAM move rate is low enough that we will be OK with<br> > > another CPU page fault if it happens.<br> ><br> > Some other ideas that might be worth trying:<br> ><br> > Evicting BOs to GTT instead of moving them to CPU accessible VRAM in<br> > principle in some cases (e.g. for all BOs except those with<br> > AMDGPU_GEM_CREATE_CPU_ACCESS_<wbr>REQUIRED) or even always.<br> ><br> ><br> > I've tried this and it made things even worse.<br> <br> </div>What exactly did you try?<br></blockquote></div></div></div><div dir="auto"><br></div><div dir="auto">I only set the placement to GTT, but I think I kept the contiguous flag.</div><div dir="auto"><br></div><div dir="auto">Marek</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class="elided-text"><br> <br> --<br> Earthling Michel Dänzer | <a href="http://www.amd.com" rel="noreferrer" target="_blank">http://www.amd.com</a><br> Libre software enthusiast | Mesa and X developer<br> </div></blockquote></div><br></div></div></div>