[PATCH] drm/amdkfd: fix the hang caused by the write reorder to fence_addr
Christian König
ckoenig.leichtzumerken at gmail.com
Mon Oct 21 08:12:41 UTC 2024
Am 18.10.24 um 23:59 schrieb Philip Yang:
> On 2024-10-18 14:28, Felix Kuehling wrote:
>>
>> On 2024-10-17 04:34, Victor Zhao wrote:
>>> make sure KFD_FENCE_INIT write to fence_addr before
>>> pm_send_query_status
>>> called, to avoid qcm fence timeout caused by incorrect ordering.
>>>
>>> Signed-off-by: Victor Zhao <Victor.Zhao at amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 1 +
>>> drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h | 2 +-
>>> 2 files changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>> index b2b16a812e73..d9264a353775 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>>> @@ -2254,6 +2254,7 @@ static int unmap_queues_cpsch(struct
>>> device_queue_manager *dqm,
>>> goto out;
>>> *dqm->fence_addr = KFD_FENCE_INIT;
>>> + mb();
>>> pm_send_query_status(&dqm->packet_mgr, dqm->fence_gpu_addr,
>>> KFD_FENCE_COMPLETED);
>>> /* should be timed out */
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
>>> index 09ab36f8e8c6..bddb169bb301 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
>>> @@ -260,7 +260,7 @@ struct device_queue_manager {
>>> uint16_t vmid_pasid[VMID_NUM];
>>> uint64_t pipelines_addr;
>>> uint64_t fence_gpu_addr;
>>> - uint64_t *fence_addr;
>>> + volatile uint64_t *fence_addr;
>>
>> [+Christian]
>>
>> Is the volatile keyword really needed here? I just saw other patches
>> removing volatile in some places because it's not sufficient, and not
>> needed if you use memory barriers correctly.
>
> After reading kernel memory barrier document and below link, I think
> we need both volatile type and memory barrier, to guarantee F/W get
> the updated fence value. This fixes an CP hang issue on SRIOV.
>
> https://stackoverflow.com/questions/75750110/volatile-vs-memory-barriers#:~:text=volatile%20will%20make%20sure%20that,not%20reorder%20writes%20or%20reads.
>
No, that isn't correct. Using volatile is considered harmful and almost
never correct, see here
https://docs.kernel.org/process/volatile-considered-harmful.html
Placing appropriate memory barriers must be sufficient or otherwise
there is a rather bad platform or compiler bug lurking around.
Regards,
Christian.
> Regards,
>
> Philip
>
>>
>> Regards,
>> Felix
>>
>>
>>> struct kfd_mem_obj *fence_mem;
>>> bool active_runlist;
>>> int sched_policy;
More information about the amd-gfx
mailing list