[PATCH] drm/amdkfd: fix the hang caused by the write reorder to fence_addr

Christian König ckoenig.leichtzumerken at gmail.com
Mon Oct 21 08:12:41 UTC 2024


Am 18.10.24 um 23:59 schrieb Philip Yang:
> On 2024-10-18 14:28, Felix Kuehling wrote:
>>
>> On 2024-10-17 04:34, Victor Zhao wrote:
>>> make sure KFD_FENCE_INIT write to fence_addr before 
>>> pm_send_query_status
>>> called, to avoid qcm fence timeout caused by incorrect ordering.
>>>
>>> Signed-off-by: Victor Zhao <Victor.Zhao at amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 1 +
>>>   drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h | 2 +-
>>>   2 files changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>> index b2b16a812e73..d9264a353775 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>> @@ -2254,6 +2254,7 @@ static int unmap_queues_cpsch(struct 
>>> device_queue_manager *dqm,
>>>           goto out;
>>>         *dqm->fence_addr = KFD_FENCE_INIT;
>>> +    mb();
>>>       pm_send_query_status(&dqm->packet_mgr, dqm->fence_gpu_addr,
>>>                   KFD_FENCE_COMPLETED);
>>>       /* should be timed out */
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
>>> index 09ab36f8e8c6..bddb169bb301 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
>>> @@ -260,7 +260,7 @@ struct device_queue_manager {
>>>       uint16_t        vmid_pasid[VMID_NUM];
>>>       uint64_t        pipelines_addr;
>>>       uint64_t        fence_gpu_addr;
>>> -    uint64_t        *fence_addr;
>>> +    volatile uint64_t    *fence_addr;
>>
>> [+Christian]
>>
>> Is the volatile keyword really needed here? I just saw other patches 
>> removing volatile in some places because it's not sufficient, and not 
>> needed if you use memory barriers correctly.
>
> After reading kernel memory barrier document and below link, I think 
> we need both volatile type and memory barrier, to guarantee F/W get 
> the updated fence value. This fixes an CP hang issue on SRIOV.
>
> https://stackoverflow.com/questions/75750110/volatile-vs-memory-barriers#:~:text=volatile%20will%20make%20sure%20that,not%20reorder%20writes%20or%20reads.
>

No, that isn't correct. Using volatile is considered harmful and almost 
never correct, see here 
https://docs.kernel.org/process/volatile-considered-harmful.html

Placing appropriate memory barriers must be sufficient or otherwise 
there is a rather bad platform or compiler bug lurking around.

Regards,
Christian.

> Regards,
>
> Philip
>
>>
>> Regards,
>>   Felix
>>
>>
>>>       struct kfd_mem_obj    *fence_mem;
>>>       bool            active_runlist;
>>>       int            sched_policy;



More information about the amd-gfx mailing list