[PATCH 08/34] drm/amdkfd: fix kfd_suspend_all_processes for gfx941 debugging

Felix Kuehling felix.kuehling at amd.com
Mon Mar 27 21:20:48 UTC 2023


On 2023-03-27 14:43, Jonathan Kim wrote:
> The debugger for GFX9.4.1 uses kfd_suspend_all_processes to pause the
> compute pipe line so it can safely toggle the SQ's implicit wait on
> barrier setting during debug attach/detach to work around the wave
> exception s_barrier race condition.
>
> For mGPU setups, repeated calls to cancel all outstanding restore work can
> result in an assymetric permanent cancelling of the restored work from the
> debug device after it has toggled the HW work around settings.

This is a bit hard to follow. Not sure what you mean by asymmetric.

I think this is a general bug in how kfd_suspend_all_processes and 
kfd_resume_all_processes interact. The latter schedules restore work. If 
that gets cancelled before it gets a chance to run, it will result in 
the queues staying preempted forever. It just happened that the barrier 
waitcount setting workaround on GFXv9.4.1 was good at triggering the bug.

I would simplify the description like this:

> Flush delayed restore work in kfd_suspend_all_queues instead of 
> cancelling. Cancelling the work before it runs results in the queues 
> becoming permanently disabled. Flushing the work ensures that the 
> queue suspend/resume state stays balanced.
With the updated description, the patch is

Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>


> Instead of cancelling the outstanding restore work, just flush it as it
> will be properly evicted anyways by the current suspend call.
>
> Signed-off-by: Jonathan Kim <jonathan.kim at amd.com>
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_process.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> index 1e3795e7e18d..55a4ddd35e12 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
> @@ -2008,7 +2008,7 @@ void kfd_suspend_all_processes(void)
>   	WARN(debug_evictions, "Evicting all processes");
>   	hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
>   		cancel_delayed_work_sync(&p->eviction_work);
> -		cancel_delayed_work_sync(&p->restore_work);
> +		flush_delayed_work(&p->restore_work);
>   
>   		if (kfd_process_evict_queues(p, KFD_QUEUE_EVICTION_TRIGGER_SUSPEND))
>   			pr_err("Failed to suspend process 0x%x\n", p->pasid);


More information about the amd-gfx mailing list