[PATCH] drm/amdkfd: Fix some kfd related recover issues

Thu Apr 17 14:14:10 UTC 2025

On 2025-03-21 19:35, Deng, Emily wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>> -----Original Message-----
>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>> Sent: Friday, March 21, 2025 7:06 PM
>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdkfd: Fix some kfd related recover issues
>>
>>
>>
>> On 3/21/2025 4:22 PM, Emily Deng wrote:
>>> It need to check whether kq has been initialized correctly in
>> kq_acquire_packet_buffer.
>>> Or it will hit memory corruption during recover, as for recover, it
>>> will uninitialize kq first.
>>>
>>> Need to flush tlb after recover successully, as it maybe has create bo
>>> and map bo during recover.
>> Is this related to any specific type of 'reset'? For mode-2/mode-1 type of resets,
>> expectation is GC as whole is reset which includes GPU VM block.
>>
>> Thanks,
>> Lijo
> This is an FLR (Function Level Reset) for SR-IOV. But for both mode-1 and mode-2 resets, the GPU VM block is reset as well. In the case prior to the reset, there was a mapped KFD user mode buffer object (BO). After the reset, the page table cache also needs to be invalidated for the corresponding PASID.

This is misunderstanding, KFD processes are aborted before reset, and 
then after GPU recovery, outstanding KFD BO free/clear/unmap will be 
executed (job created and gpu scheduler paused during reset). Then new 
KFD process map queue will flush TLB.

During recover, driver should not touch h/w, can you provide details of 
memory corruption during reset?

Regards,

Philip

>
>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdkfd/kfd_device.c       |  1 +
>>>   drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c |  4 ++++
>>>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |  2 +-
>>>   drivers/gpu/drm/amd/amdkfd/kfd_process.c      | 22 +++++++++++++++++++
>>>   4 files changed, 28 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> index b9c82be6ce13..eb2df5842618 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>> @@ -1000,6 +1000,7 @@ int kgd2kfd_post_reset(struct kfd_dev *kfd)
>>>               return 0;
>>>
>>>       for (i = 0; i < kfd->num_nodes; i++) {
>>> +            kfd_flush_all_processes(kfd->nodes[i]);
>>>               ret = kfd_resume(kfd->nodes[i]);
>>>               if (ret)
>>>                       return ret;
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
>>> index 2b0a830f5b29..5e4ae969818e 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
>>> @@ -238,6 +238,10 @@ int kq_acquire_packet_buffer(struct kernel_queue *kq,
>>>       uint64_t wptr64;
>>>       unsigned int *queue_address;
>>>
>>> +    if (!kq) {
>>> +            pr_debug("kq has not been initialized\n");
>>> +            goto err_no_space;
>>> +    }
>>>       /* When rptr == wptr, the buffer is empty.
>>>        * When rptr == wptr + 1, the buffer is full.
>>>        * It is always rptr that advances to the position of wptr, rather
>>> than diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> index f6aedf69c644..6c073ead2b06 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>> @@ -1059,7 +1059,7 @@ int kfd_process_evict_queues(struct kfd_process
>>> *p, uint32_t trigger);  int kfd_process_restore_queues(struct
>>> kfd_process *p);  void kfd_suspend_all_processes(void);  int
>>> kfd_resume_all_processes(void);
>>> -
>>> +void kfd_flush_all_processes(struct kfd_node *node);
>>>   struct kfd_process_device *kfd_process_device_data_by_id(struct kfd_process
>> *process,
>>>                                                        uint32_t gpu_id);
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> index 7c0c24732481..4ed03359020b 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>>> @@ -2110,6 +2110,28 @@ int kfd_resume_all_processes(void)
>>>       return ret;
>>>   }
>>>
>>> +void kfd_flush_all_processes(struct kfd_node *node) {
>>> +    struct kfd_process *p;
>>> +    struct kfd_process_device *pdd;
>>> +    unsigned int temp;
>>> +    int idx = srcu_read_lock(&kfd_processes_srcu);
>>> +    struct amdgpu_vm *vm;
>>> +
>>> +    hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
>>> +            pdd = kfd_get_process_device_data(node, p);
>>> +            if (!pdd)
>>> +                    continue;
>>> +            vm = drm_priv_to_vm(pdd->drm_priv);
>>> +            if (!vm)
>>> +                    continue;
>>> +            atomic64_inc(&vm->tlb_seq);
>>> +            kfd_flush_tlb(pdd, TLB_FLUSH_LEGACY);
>>> +    }
>>> +    srcu_read_unlock(&kfd_processes_srcu, idx);
>>> +
>>> +}
>>> +
>>>   int kfd_reserved_mem_mmap(struct kfd_node *dev, struct kfd_process *process,
>>>                         struct vm_area_struct *vma)
>>>   {