[PATCH] Revert "drm/radeon: Try evicting from CPU accessible to inaccessible VRAM first"
jisorce at oblong.com
Thu Mar 23 09:26:24 UTC 2017
When it happens, the main thread of our gl based app is stuck on a
ioctl(RADEON_CS). I set RADEON_THREAD=false to ease the debugging but same
thing happens if true. Other threads are only si_shader:0,1,2,3 and are
doing nothing, just waiting for jobs. I can also do sudo gdb -p $(pidof
Xorg) to block the X11 server, to make sure there is no ping pong between 2
processes. All other processes are not loading dri/radeonsi_dri.so . And
adding a few traces shows that the above ioctl call is looping for ever on
drm/ttm/ttm_bo.c#L819 and comes from mesa
After adding even more traces I can see that the bo, which is being
indefinitely evicted, has the flag RADEON_GEM_NO_CPU_ACCESS.
And it gets 3 potential placements after calling "radeon_evict_flags".
1: VRAM cpu inaccessible, fpfn is 65536
2: VRAM cpu accessible, fpfn is 0
3: GTT, fpfn is 0
And it looks like it continuously succeeds to move on the second placement.
So I might be wrong but it looks it is not even a ping pong between VRAM
accessible / not accessible, it just keeps being blited in the CPU
accessible part of the VRAM.
Maybe radeon_evict_flags should just not add the second placement if its
current placement is already VRAM cpu accessible.
Or could be a bug in the get_node that should not succeed in that case.
Note that this happens when the VRAM is nearly full.
FWIW I noticed that amdgpu is doing something different:
Finally the NMI watchdog and the kernel soft lockup and hard lockup
detectors do not detect this looping in that ioctl(RADEON_CS). Maybe
because it estimates it is doing real work. Same for radeon_lockup_timeout,
it does not detect it.
The gpu is a FirePro W600 Cape Verde 2048M.
On Thu, Mar 23, 2017 at 8:10 AM, Michel Dänzer <michel at daenzer.net> wrote:
> On 23/03/17 03:19 AM, Zachary Michaels wrote:
> > We were experiencing an infinite loop due to VRAM bos getting added back
> > to the VRAM lru on eviction via ttm_bo_mem_force_space,
> Can you share more details about what happened? I can imagine that
> moving a BO from CPU visible to CPU invisible VRAM would put it back on
> the LRU, but next time around it shouldn't hit this code anymore but get
> evicted to GTT directly.
> Was userspace maybe performing concurrent CPU access to the BOs in
> > and reverting this commit solves the problem.
> I hope we can find a better solution.
> Earthling Michel Dänzer | http://www.amd.com
> Libre software enthusiast | Mesa and X developer
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the amd-gfx