[PATCH] drm/amdkfd: Fix some kfd related recover issues

Fri Mar 21 23:35:18 UTC 2025

[AMD Official Use Only - AMD Internal Distribution Only]

>-----Original Message-----
>From: Lazar, Lijo <Lijo.Lazar at amd.com>
>Sent: Friday, March 21, 2025 7:06 PM
>To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>Subject: Re: [PATCH] drm/amdkfd: Fix some kfd related recover issues
>
>
>
>On 3/21/2025 4:22 PM, Emily Deng wrote:
>> It need to check whether kq has been initialized correctly in
>kq_acquire_packet_buffer.
>> Or it will hit memory corruption during recover, as for recover, it
>> will uninitialize kq first.
>>
>> Need to flush tlb after recover successully, as it maybe has create bo
>> and map bo during recover.
>
>Is this related to any specific type of 'reset'? For mode-2/mode-1 type of resets,
>expectation is GC as whole is reset which includes GPU VM block.
>
>Thanks,
>Lijo
This is an FLR (Function Level Reset) for SR-IOV. But for both mode-1 and mode-2 resets, the GPU VM block is reset as well. In the case prior to the reset, there was a mapped KFD user mode buffer object (BO). After the reset, the page table cache also needs to be invalidated for the corresponding PASID.

>>
>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdkfd/kfd_device.c       |  1 +
>>  drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c |  4 ++++
>>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |  2 +-
>>  drivers/gpu/drm/amd/amdkfd/kfd_process.c      | 22 +++++++++++++++++++
>>  4 files changed, 28 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> index b9c82be6ce13..eb2df5842618 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>> @@ -1000,6 +1000,7 @@ int kgd2kfd_post_reset(struct kfd_dev *kfd)
>>              return 0;
>>
>>      for (i = 0; i < kfd->num_nodes; i++) {
>> +            kfd_flush_all_processes(kfd->nodes[i]);
>>              ret = kfd_resume(kfd->nodes[i]);
>>              if (ret)
>>                      return ret;
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
>> index 2b0a830f5b29..5e4ae969818e 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue.c
>> @@ -238,6 +238,10 @@ int kq_acquire_packet_buffer(struct kernel_queue *kq,
>>      uint64_t wptr64;
>>      unsigned int *queue_address;
>>
>> +    if (!kq) {
>> +            pr_debug("kq has not been initialized\n");
>> +            goto err_no_space;
>> +    }
>>      /* When rptr == wptr, the buffer is empty.
>>       * When rptr == wptr + 1, the buffer is full.
>>       * It is always rptr that advances to the position of wptr, rather
>> than diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> index f6aedf69c644..6c073ead2b06 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> @@ -1059,7 +1059,7 @@ int kfd_process_evict_queues(struct kfd_process
>> *p, uint32_t trigger);  int kfd_process_restore_queues(struct
>> kfd_process *p);  void kfd_suspend_all_processes(void);  int
>> kfd_resume_all_processes(void);
>> -
>> +void kfd_flush_all_processes(struct kfd_node *node);
>>  struct kfd_process_device *kfd_process_device_data_by_id(struct kfd_process
>*process,
>>                                                       uint32_t gpu_id);
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> index 7c0c24732481..4ed03359020b 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
>> @@ -2110,6 +2110,28 @@ int kfd_resume_all_processes(void)
>>      return ret;
>>  }
>>
>> +void kfd_flush_all_processes(struct kfd_node *node) {
>> +    struct kfd_process *p;
>> +    struct kfd_process_device *pdd;
>> +    unsigned int temp;
>> +    int idx = srcu_read_lock(&kfd_processes_srcu);
>> +    struct amdgpu_vm *vm;
>> +
>> +    hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
>> +            pdd = kfd_get_process_device_data(node, p);
>> +            if (!pdd)
>> +                    continue;
>> +            vm = drm_priv_to_vm(pdd->drm_priv);
>> +            if (!vm)
>> +                    continue;
>> +            atomic64_inc(&vm->tlb_seq);
>> +            kfd_flush_tlb(pdd, TLB_FLUSH_LEGACY);
>> +    }
>> +    srcu_read_unlock(&kfd_processes_srcu, idx);
>> +
>> +}
>> +
>>  int kfd_reserved_mem_mmap(struct kfd_node *dev, struct kfd_process *process,
>>                        struct vm_area_struct *vma)
>>  {