<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body text="#000000" bgcolor="#FFFFFF"> <div class="moz-cite-prefix">Am 13.07.19 um 22:24 schrieb Felix Kuehling: </div> <blockquote type="cite" cite="mid:44747586-8c98-6c85-dc5c-7464f59b205e@gmail.com"> <div class="moz-cite-prefix">Am 2019-04-30 um 1:03 p.m. schrieb Koenig, Christian: </div> <blockquote type="cite" cite="mid:f5c698ad-2aff-b3c5-2041-05a10983438a@amd.com"> <blockquote type="cite" style="color: #000000;"> <blockquote type="cite" style="color: #000000;"> <pre class="moz-quote-pre" wrap="">The only real solution I can see is to be able to reliable kill shaders in an OOM situation. </pre> </blockquote> <pre class="moz-quote-pre" wrap="">Well, we can in fact preempt our compute shaders with low latency. Killing a KFD process will do exactly that. </pre> </blockquote> <pre class="moz-quote-pre" wrap="">I've taken a look at that thing as well and to be honest it is not even remotely sufficient. We need something which stops the hardware *immediately* from accessing system memory, and not wait for the SQ to kill all waves, flush caches etc... One possibility I'm playing around with for a while is to replace the root PD for the VMIDs in question on the fly. E.g. we just let it point to some dummy which redirects everything into nirvana. But implementing this is easier said than done...</pre> </blockquote> Warming up this thread, since I just fixed another bug that was enabled by artificial memory pressure due to the GTT limit. I think disabling the PD for the VMIDs is a good idea. A problem is that HWS firmware updates PD pointers in the background for its VMIDs. So this would require a reliable and fast way to kill the HWS first. </blockquote> Well we don't necessary need to completely kill the HWS. What we need is to suspend it, kill a specific process and resume it later on. As far as I can see the concept with the HWS interaction was to use a ring buffer with async feedback when something is done. That is really convenient for performative and reliable operation, but unfortunately not if you need to kill of some processing immediately. So something like setting a bit in a register to suspend the HWS, kill the VMIDs, set a flag in the HWS runlist to stop it from scheduling a specific process once more and then resume the HWS is what is needed here. <blockquote type="cite" cite="mid:44747586-8c98-6c85-dc5c-7464f59b205e@gmail.com"> An alternative I thought about is, disabling bus access at the BIF level if that's possible somehow. Basically we would instantaneously kill all GPU system memory access, signal all fences or just remove all fences from all BO reservations (reservation_object_add_excl_fence(resv, NULL)) to allow memory to be freed, let the OOM killer do its thing, and when the dust settles, reset the GPU. </blockquote> Yeah, thought about that as well. The problem with this approach is that it is rather invasive. E.g. stopping the BIF means stopping it for everybody and not just the process which is currently killed and when we reset the GPU it is actually quite likely that we lose the content of VRAM. Regards, Christian. <blockquote type="cite" cite="mid:44747586-8c98-6c85-dc5c-7464f59b205e@gmail.com"> Regards, Felix <blockquote type="cite" cite="mid:f5c698ad-2aff-b3c5-2041-05a10983438a@amd.com"> <pre class="moz-quote-pre" wrap="">Regards, Christian. </pre> </blockquote> </blockquote> </body> </html>