<html> <head> <base href="https://bugs.freedesktop.org/"> </head> <body><span class="vcard"><a class="email" href="mailto:andrey.grodzovsky@amd.com" title="Andrey Grodzovsky <andrey.grodzovsky@amd.com>"> <span class="fn">Andrey Grodzovsky</span></a> </span> changed <a class="bz_bug_link bz_status_NEW " title="NEW - deadlock occurs during GPU reset" href="https://bugs.freedesktop.org/show_bug.cgi?id=109692">bug 109692</a> <br> <table border="1" cellspacing="0" cellpadding="8"> <tr> <th>What</th> <th>Removed</th> <th>Added</th> </tr> <tr> <td style="text-align:right;">CC</td> <td> </td> <td>christian.koenig@amd.com </td> </tr></table> <p> <div> <b><a class="bz_bug_link bz_status_NEW " title="NEW - deadlock occurs during GPU reset" href="https://bugs.freedesktop.org/show_bug.cgi?id=109692#c9">Comment # 9</a> on <a class="bz_bug_link bz_status_NEW " title="NEW - deadlock occurs during GPU reset" href="https://bugs.freedesktop.org/show_bug.cgi?id=109692">bug 109692</a> from <span class="vcard"><a class="email" href="mailto:andrey.grodzovsky@amd.com" title="Andrey Grodzovsky <andrey.grodzovsky@amd.com>"> <span class="fn">Andrey Grodzovsky</span></a> </span></b> <pre>Adding Christian to consult on solution to the problem: The deadlock happens because we lock fence->lock from dma_fence_signal followed by sched->job_list_lock from drm_sched_process_job on one hand while on the other do reverse order locking in drm_sched_stop, locking first sched->job_list_lock and then fence->lock when trying to remove the cb from dma_fence_remove_callback. I see 2 possible solutions - 1) Change ring_mirror_list to lock-less list (e.g. RCU list), problem is I don't think there is a readily available implementation of doubly linked lockless list, I didn't do to much search though... 2) Go back to removing a job from ring_mirror_list in drm_sched_job_finish - as I see it for this to work we need to add a wait_queue (signal) to drm_sched_job which will signal AFTER job is removed from ring_mirror_list in drm_sched_job_finish, then in drm_sched_stop when iterating over ring_mirror_list you build a new list of all jobs in progress (for them dma_fence_remove_callback returns false) and then you wait on all theirs wait_queues - then you are sure they removed themselves from ring_mirror_list and you can proceed. Specific to AMDGPU - To insure the jobs are not being freed in drm_sched_job_finish while we process them in drm_sched_stop it's not enough to wait to do cancel_delayed_work_sync(&sched->work_tdr) for this scheduler but also for all the other schedulers in the device and even in the hive for XGMI use case.</pre> </div> </p> <hr> <span>You are receiving this mail because:</span> <ul> <li>You are the assignee for the bug.</li> </ul> </body> </html>