[PATCH] drm/amd/amdkfd: Fix kernel panic when reset failed and been triggered again

Mon Nov 15 16:06:51 UTC 2021

Am 2021-11-14 um 12:55 p.m. schrieb shaoyunl:
> In SRIOV configuration, the reset may failed to bring asic back to normal but stop cpsch
> already been called, the start_cpsch will not be called since there is no resume in this
> case.  When reset been triggered again, driver should avoid to do uninitialization again.
>
> Signed-off-by: shaoyunl <shaoyun.liu at amd.com>

If there is a possibility that stop_cpsch is called multiple times, I
think the check for that should be at the start of the function.
Something like:

    if (!dqm->sched_running)
        return 0;

Regards,
  Felix

> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> index 42b2cc999434..bcc8980d77e0 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -1228,12 +1228,14 @@ static int stop_cpsch(struct device_queue_manager *dqm)
>  	if (!dqm->is_hws_hang)
>  		unmap_queues_cpsch(dqm, KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES, 0);
>  	hanging = dqm->is_hws_hang || dqm->is_resetting;
> -	dqm->sched_running = false;
>  
> -	pm_release_ib(&dqm->packet_mgr);
> +	if (dqm->sched_running) {
> +		dqm->sched_running = false;
> +		pm_release_ib(&dqm->packet_mgr);
> +		kfd_gtt_sa_free(dqm->dev, dqm->fence_mem);
> +		pm_uninit(&dqm->packet_mgr, hanging);
> +	}
>  
> -	kfd_gtt_sa_free(dqm->dev, dqm->fence_mem);
> -	pm_uninit(&dqm->packet_mgr, hanging);
>  	dqm_unlock(dqm);
>  
>  	return 0;