<!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> </head> <body> Am 25.09.24 um 16:53 schrieb Philipp Stanner: <blockquote type="cite" cite="mid:a2ef4cdfeb31ad95de9311274de73a51cdc54a97.camel@redhat.com"> <pre class="moz-quote-pre" wrap="">On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote: </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König wrote: </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">Am 20.09.24 um 15:26 schrieb Philipp Stanner: </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote: </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">Am 20.09.24 um 10:57 schrieb Philipp Stanner: </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote: </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">Tearing down the scheduler with jobs still on the pending list can lead to use after free issues. Add a warning if drivers try to destroy a scheduler which still has work pushed to the HW. </pre> </blockquote> <pre class="moz-quote-pre" wrap="">Did you have time yet to look into my proposed waitque- solution? </pre> </blockquote> <pre class="moz-quote-pre" wrap="">I don't remember seeing anything. What have I missed? </pre> </blockquote> <pre class="moz-quote-pre" wrap=""><a class="moz-txt-link-freetext" href="https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/">https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/</a> </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> Mhm, I didn't got that in my inbox for some reason. Interesting approach, I'm just not sure if we can or should wait in drm_sched_fini(). </pre> </blockquote> </blockquote> <pre class="moz-quote-pre" wrap=""> We do agree that jobs still pending when drm_sched_fini() starts is always a bug, right?</pre> </blockquote> Correct, the question is how to avoid that. <blockquote type="cite" cite="mid:a2ef4cdfeb31ad95de9311274de73a51cdc54a97.camel@redhat.com"> <pre class="moz-quote-pre" wrap="">If so, what are the disadvantages of waiting in drm_sched_fini()? We could block buggy drivers as I see it. Which wouldn't be good, but could then be fixed on drivers' site.</pre> </blockquote> Sima explained that pretty well: Don't block in fops->close, do that in fops->flush instead. One issue this solves is that when you send a SIGTERM the tear down handling first flushes all the FDs and then closes them. So if flushing the FDs blocks because the process initiated sending a terabyte of data over a 300bps line (for example) you can still throw a SIGKILL and abort that as well. If you would block in fops-close() that SIGKILL won't have any effect any more because by the time close() is called the process is gone and signals are already blocked. And yes when I learned about that issue I was also buffed that handling like this in the UNIX design is nearly 50 years old and still applies to today. <blockquote type="cite" cite="mid:a2ef4cdfeb31ad95de9311274de73a51cdc54a97.camel@redhat.com"> <blockquote type="cite"> <blockquote type="cite"> <pre class="moz-quote-pre" wrap=""> Probably better to make that a separate function, something like drm_sched_flush() or similar. </pre> </blockquote> </blockquote> <pre class="moz-quote-pre" wrap=""> We could do that. Such a function could then be called by drivers which are not sure whether all jobs are done before they start tearing down.</pre> </blockquote> Yes exactly that's the idea. And give that flush function a return code so that it can return -EINTR. <blockquote type="cite" cite="mid:a2ef4cdfeb31ad95de9311274de73a51cdc54a97.camel@redhat.com"> <blockquote type="cite"> <pre class="moz-quote-pre" wrap=""> Yeah I don't think we should smash this into drm_sched_fini unconditionally. I think conceptually there's about three cases: - Ringbuffer schedules. Probably want everything as-is, because drm_sched_fini is called long after all the entities are gone in drm_device cleanup. - fw scheduler hardware with preemption support. There we probably want to nuke the context by setting the tdr timeout to zero (or maybe just as long as context preemption takes to be efficient), and relying on the normal gpu reset flow to handle things. drm_sched_entity_flush kinda does this, except not really and it's a lot more focused on the ringbuffer context. So maybe we want a new drm_sched_entity_kill. For this case calling drm_sched_fini() after the 1:1 entity is gone should not find any linger jobs, it would actually be a bug somewhere if there's a job lingering. Maybe a sanity check that there's not just no jobs lingering, but also no entity left would be good here? </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> The check for lingering ones is in Christian's patch here IISC. At which position would you imagine the check for the entity being performed? </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap=""> - fw scheduler without preemption support. There we kinda need the drm_sched_flush, except blocking in fops->close is not great. So instead I think the following is better: 1. drm_sched_entity_stopped, which only stops new submissions (for paranoia) but doesn't tear down the entity </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> Who would call that function? Drivers using it voluntarily could just as well stop accepting new jobs from userspace to their entities, couldn't they? </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap=""> 2. drm_dev_get 3. launch a worker which does a) drm_sched_flush (or drm_sched_entity_flush or whatever we call it) b) drm_sched_entity_fini + drm_sched_fini c) drm_dev_put Note that semantically this implements the refcount in the other path from Phillip: <a class="moz-txt-link-freetext" href="https://lore.kernel.org/all/20240903094531.29893-2-pstanner@redhat.com/">https://lore.kernel.org/all/20240903094531.29893-2-pstanner@redhat.com/</a> Except it doesn't impose refcount on everyone else who doesn't need it, and it doesn't even impose refcounting on drivers that do need it because we use drm_sched_flush and a worker to achieve the same. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> I indeed wasn't happy with the refcount approach for that reason, agreed. </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap=""> Essentially helper functions for the common use-cases instead of trying to solve them all by putting drm_sched_flush as a potentially very blocking function into drm_sched_fini. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> I'm still not able to see why it blocking would be undesired – as far as I can see, it is only invoked on driver teardown, so not during active operation. Teardown doesn't happen that often, and it can (if implemented correctly) only block until the driver's code has signaled the last fences. If that doesn't happen, the block would reveal a bug. But don't get me wrong: I don't want to *push* this solution. I just want to understand when it could become a problem. Wouldn't an explicitly blocking, separate function like drm_sched_flush() or drm_sched_fini_flush() be a small, doable step towards the right direction?</pre> </blockquote> I think that this is the right thing to do, yes. <blockquote type="cite" cite="mid:a2ef4cdfeb31ad95de9311274de73a51cdc54a97.camel@redhat.com"> <pre class="moz-quote-pre" wrap=""> </pre> <blockquote type="cite"> <blockquote type="cite"> <blockquote type="cite"> <blockquote type="cite"> <blockquote type="cite"> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">When there are still entities with jobs the situation is even worse since the dma_fences for those jobs can never signal we can just choose between potentially locking up core memory management and random memory corruption. When drivers really mess it up that well let them run into a BUG_ON(). Signed-off-by: Christian König<a class="moz-txt-link-rfc2396E" href="mailto:christian.koenig@amd.com"><christian.koenig@amd.com></a> --- drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index f093616fe53c..8a46fab5cdc8 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched) </pre> </blockquote> <pre class="moz-quote-pre" wrap="">I agree with Sima that it should first be documented in the function's docstring what the user is expected to have done before calling the function. </pre> </blockquote> <pre class="moz-quote-pre" wrap="">Good point, going to update the documentation as well. </pre> </blockquote> <pre class="moz-quote-pre" wrap="">Cool thing, thx. Would be great if everything (not totally trivial) necessary to be done before _fini() is mentioned. One could also think about providing a hint at how the driver can do that. AFAICS the only way for the driver to ensure that is to maintain its own, separate list of submitted jobs. </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> Even with a duplicated pending list it's actually currently impossible to do this fully cleanly. The problem is that the dma_fence object gives no guarantee when callbacks are processed, e.g. they can be both processed from interrupt context as well as from a CPU which called dma_fence_is_signaled(). So when a driver (or drm_sched_fini) waits for the last submitted fence it actually can be that the drm_sched object still needs to do some processing. See the hack in amdgpu_vm_tlb_seq() for more background on the problem. </pre> </blockquote> </blockquote> <pre class="moz-quote-pre" wrap=""> Oh dear ^^' We better work towards fixing that centrally Thanks, P. </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap=""> So I thought this should be fairly easy because of the sched hw/public fence split: If we wait for both all jobs to finish and for all the sched work/tdr work to finish, and we make sure there's no entity existing that's not yet stopped we should catch them all?</pre> </blockquote> </blockquote> Unfortunately not. Even when you do a dma_fence_wait() on the last submission it can still be that another CPU is executing the callbacks to wake up the scheduler work item and cleanup the job. That's one of the reasons why I think the design of keeping the job alive is so extremely awkward. The dma_fence as representation of the hw submission has a much better defined state machine and lifetime. Regards, Christian. <blockquote type="cite" cite="mid:a2ef4cdfeb31ad95de9311274de73a51cdc54a97.camel@redhat.com"> <blockquote type="cite"> <pre class="moz-quote-pre" wrap=""> Or at least I think it's a bug if any other code even tries to touch the hw fence. If you have any other driver code which relies on the rcu freeing then I think that's just a separate concern and we can ignore that here since the fences themselves will till get rcu-delay freed even if drm_sched_fini has finished. -Sima </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap=""> Regards, Christian. </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap=""> P. </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">Thanks, Christian. </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">P. </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap=""> drm_sched_wqueue_stop(sched); + /* + * Tearing down the scheduler wile there are still unprocessed jobs can + * lead to use after free issues in the scheduler fence. + */ + WARN_ON(!list_empty(&sched->pending_list)); + for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched- </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">num_rqs; </pre> </blockquote> <pre class="moz-quote-pre" wrap="">i++) { struct drm_sched_rq *rq = sched- </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">sched_rq[i]; </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> spin_lock(&rq->lock); - list_for_each_entry(s_entity, &rq- </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">entities, </pre> </blockquote> <pre class="moz-quote-pre" wrap="">list) + list_for_each_entry(s_entity, &rq- </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">entities, </pre> </blockquote> <pre class="moz-quote-pre" wrap="">list) { + /* + * The justification for this BUG_ON() is that tearing + * down the scheduler while jobs are pending leaves + * dma_fences unsignaled. Since we have dependencies + * from the core memory management to eventually signal + * dma_fences this can trivially lead to a system wide + * stop because of a locked up memory management. + */ + BUG_ON(spsc_queue_count(&s_entity- </pre> <blockquote type="cite"> <pre class="moz-quote-pre" wrap="">job_queue)); </pre> </blockquote> <pre class="moz-quote-pre" wrap="">+ /* * Prevents reinsertion and marks job_queue as idle, * it will removed from rq in drm_sched_entity_fini * eventually */ s_entity->stopped = true; + } spin_unlock(&rq->lock); kfree(sched->sched_rq[i]); } </pre> </blockquote> </blockquote> </blockquote> </blockquote> </blockquote> <pre class="moz-quote-pre" wrap=""> </pre> </blockquote> <pre class="moz-quote-pre" wrap=""> </pre> </blockquote> </body> </html>