<div dir="ltr">Hi Michel,<div><br></div><div>When it happens, the main thread of our gl based app is stuck on a ioctl(RADEON_CS). I set RADEON_THREAD=false to ease the debugging but same thing happens if true. Other threads are only si_shader:0,1,2,3 and are doing nothing, just waiting for jobs. I can also do sudo gdb -p $(pidof Xorg) to block the X11 server, to make sure there is no ping pong between 2 processes. All other processes are not loading dri/radeonsi_dri.so . And adding a few traces shows that the above ioctl call is looping for ever on <a href="https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/ttm/ttm_bo.c#L819" target="_blank">https://github.com/torvalds/<wbr>linux/blob/master/drivers/gpu/<wbr>drm/ttm/ttm_bo.c#L819</a> and comes from mesa <a href="https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/winsys/radeon/drm/radeon_drm_cs.c#n454">https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/winsys/radeon/drm/radeon_drm_cs.c#n454</a> . </div><div><br></div><div><div style="font-size:12.8px">After adding even more traces I can see that the bo, which is being indefinitely evicted, has the flag RADEON_GEM_NO_CPU_ACCESS.</div><div style="font-size:12.8px">And it gets 3 potential placements after calling "radeon_evict_flags". </div><div style="font-size:12.8px"> 1: VRAM cpu inaccessible, fpfn is 65536</div><div style="font-size:12.8px"> 2: VRAM cpu accessible, fpfn is 0</div><div style="font-size:12.8px"> 3: GTT, fpfn is 0</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">And it looks like it continuously succeeds to move on the second placement. So I might be wrong but it looks it is not even a ping pong between VRAM accessible / not accessible, it just keeps being blited in the CPU accessible part of the VRAM.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Maybe radeon_evict_flags should just not add the second placement if its current placement is already VRAM cpu accessible.</div><div style="font-size:12.8px">Or could be a bug in the get_node that should not succeed in that case.</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Note that this happens when the VRAM is nearly full.</div><div style="font-size:12.8px"><br></div><div><span style="font-size:12.8px">FWIW I noticed that amdgpu is doing something different: <a href="https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c#L205">https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c#L205</a></span></div><div><span style="font-size:12.8px">vs</span></div><div><span style="font-size:12.8px"><a href="https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/radeon/radeon_ttm.c#L198">https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/radeon/radeon_ttm.c#L198</a> </span></div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">Finally the NMI watchdog and the kernel soft lockup and hard lockup detectors do not detect this looping in that ioctl(RADEON_CS). Maybe because it estimates it is doing real work. Same for radeon_lockup_timeout, it does not detect it.</div></div><div><br></div><div>The gpu is a FirePro W600 Cape Verde 2048M.</div><div><br></div><div>Thx</div><div>Julien</div>
<div class="gmail_extra"><br><div class="gmail_quote">On Thu, Mar 23, 2017 at 8:10 AM, Michel Dänzer <span dir="ltr"><<a href="mailto:michel@daenzer.net" target="_blank">michel@daenzer.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-m_2929675273229796860gmail-">On 23/03/17 03:19 AM, Zachary Michaels wrote:<br>
> We were experiencing an infinite loop due to VRAM bos getting added back<br>
> to the VRAM lru on eviction via ttm_bo_mem_force_space,<br>
<br>
</span>Can you share more details about what happened? I can imagine that<br>
moving a BO from CPU visible to CPU invisible VRAM would put it back on<br>
the LRU, but next time around it shouldn't hit this code anymore but get<br>
evicted to GTT directly.<br>
<br>
Was userspace maybe performing concurrent CPU access to the BOs in question?<br>
<span class="gmail-m_2929675273229796860gmail-"><br>
<br>
> and reverting this commit solves the problem.<br>
<br>
</span>I hope we can find a better solution.<br>
<span class="gmail-m_2929675273229796860gmail-HOEnZb"><font color="#888888"><br>
<br>
--<br>
Earthling Michel Dänzer | <a href="http://www.amd.com" rel="noreferrer" target="_blank">http://www.amd.com</a><br>
Libre software enthusiast | Mesa and X developer<br>
<br>
</font></span></blockquote></div><div><br></div><div class="gmail-m_2929675273229796860gmail_signature"><div dir="ltr"><div style="color:rgb(51,51,51);font-family:helvetica,arial,sans-serif;font-size:12px"><div style="color:rgb(171,171,171)"></div></div></div></div>
</div></div>