[PATCH] drm/amdkfd: fix the hang caused by the write reorder to fence_addr

Felix Kuehling felix.kuehling at amd.com
Fri Oct 18 18:28:23 UTC 2024


On 2024-10-17 04:34, Victor Zhao wrote:
> make sure KFD_FENCE_INIT write to fence_addr before pm_send_query_status
> called, to avoid qcm fence timeout caused by incorrect ordering.
>
> Signed-off-by: Victor Zhao <Victor.Zhao at amd.com>
> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 1 +
>   drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h | 2 +-
>   2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> index b2b16a812e73..d9264a353775 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
> @@ -2254,6 +2254,7 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
>   		goto out;
>   
>   	*dqm->fence_addr = KFD_FENCE_INIT;
> +	mb();
>   	pm_send_query_status(&dqm->packet_mgr, dqm->fence_gpu_addr,
>   				KFD_FENCE_COMPLETED);
>   	/* should be timed out */
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
> index 09ab36f8e8c6..bddb169bb301 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
> @@ -260,7 +260,7 @@ struct device_queue_manager {
>   	uint16_t		vmid_pasid[VMID_NUM];
>   	uint64_t		pipelines_addr;
>   	uint64_t		fence_gpu_addr;
> -	uint64_t		*fence_addr;
> +	volatile uint64_t	*fence_addr;

[+Christian]

Is the volatile keyword really needed here? I just saw other patches 
removing volatile in some places because it's not sufficient, and not 
needed if you use memory barriers correctly.

Regards,
   Felix


>   	struct kfd_mem_obj	*fence_mem;
>   	bool			active_runlist;
>   	int			sched_policy;


More information about the amd-gfx mailing list