[PATCH 08/34] drm/amdkfd: fix kfd_suspend_all_processes for gfx941 debugging
Felix Kuehling
felix.kuehling at amd.com
Mon Mar 27 21:20:48 UTC 2023
On 2023-03-27 14:43, Jonathan Kim wrote:
> The debugger for GFX9.4.1 uses kfd_suspend_all_processes to pause the
> compute pipe line so it can safely toggle the SQ's implicit wait on
> barrier setting during debug attach/detach to work around the wave
> exception s_barrier race condition.
>
> For mGPU setups, repeated calls to cancel all outstanding restore work can
> result in an assymetric permanent cancelling of the restored work from the
> debug device after it has toggled the HW work around settings.
This is a bit hard to follow. Not sure what you mean by asymmetric.
I think this is a general bug in how kfd_suspend_all_processes and
kfd_resume_all_processes interact. The latter schedules restore work. If
that gets cancelled before it gets a chance to run, it will result in
the queues staying preempted forever. It just happened that the barrier
waitcount setting workaround on GFXv9.4.1 was good at triggering the bug.
I would simplify the description like this:
> Flush delayed restore work in kfd_suspend_all_queues instead of
> cancelling. Cancelling the work before it runs results in the queues
> becoming permanently disabled. Flushing the work ensures that the
> queue suspend/resume state stays balanced.
With the updated description, the patch is
Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>
> Instead of cancelling the outstanding restore work, just flush it as it
> will be properly evicted anyways by the current suspend call.
>
> Signed-off-by: Jonathan Kim <jonathan.kim at amd.com>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> index 1e3795e7e18d..55a4ddd35e12 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> @@ -2008,7 +2008,7 @@ void kfd_suspend_all_processes(void)
> WARN(debug_evictions, "Evicting all processes");
> hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
> cancel_delayed_work_sync(&p->eviction_work);
> - cancel_delayed_work_sync(&p->restore_work);
> + flush_delayed_work(&p->restore_work);
>
> if (kfd_process_evict_queues(p, KFD_QUEUE_EVICTION_TRIGGER_SUSPEND))
> pr_err("Failed to suspend process 0x%x\n", p->pasid);
More information about the dri-devel
mailing list