[PATCH 27/27] drm/amdgpu: Fix GTT size calculation

Mon Jul 15 06:44:45 UTC 2019

Am 13.07.19 um 22:24 schrieb Felix Kuehling:
Am 2019-04-30 um 1:03 p.m. schrieb Koenig, Christian:

The only real solution I can see is to be able to reliable kill shaders
in an OOM situation.

Well, we can in fact preempt our compute shaders with low latency.
Killing a KFD process will do exactly that.

I've taken a look at that thing as well and to be honest it is not even
remotely sufficient.

We need something which stops the hardware *immediately* from accessing
system memory, and not wait for the SQ to kill all waves, flush caches
etc...

One possibility I'm playing around with for a while is to replace the
root PD for the VMIDs in question on the fly. E.g. we just let it point
to some dummy which redirects everything into nirvana.

But implementing this is easier said than done...

Warming up this thread, since I just fixed another bug that was enabled by artificial memory pressure due to the GTT limit.

I think disabling the PD for the VMIDs is a good idea. A problem is that HWS firmware updates PD pointers in the background for its VMIDs. So this would require a reliable and fast way to kill the HWS first.

Well we don't necessary need to completely kill the HWS. What we need is to suspend it, kill a specific process and resume it later on.

As far as I can see the concept with the HWS interaction was to use a ring buffer with async feedback when something is done.

That is really convenient for performative and reliable operation, but unfortunately not if you need to kill of some processing immediately.

So something like setting a bit in a register to suspend the HWS, kill the VMIDs, set a flag in the HWS runlist to stop it from scheduling a specific process once more and then resume the HWS is what is needed here.

An alternative I thought about is, disabling bus access at the BIF level if that's possible somehow. Basically we would instantaneously kill all GPU system memory access, signal all fences or just remove all fences from all BO reservations (reservation_object_add_excl_fence(resv, NULL)) to allow memory to be freed, let the OOM killer do its thing, and when the dust settles, reset the GPU.

Yeah, thought about that as well. The problem with this approach is that it is rather invasive.

E.g. stopping the BIF means stopping it for everybody and not just the process which is currently killed and when we reset the GPU it is actually quite likely that we lose the content of VRAM.

Regards,
Christian.

Regards,
  Felix

Regards,
Christian.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20190715/67b5dcb6/attachment-0001.html>