[PATCH 9/9] drm/amdgpu: add graceful VM fault handling v2
Yang, Philip
Philip.Yang at amd.com
Mon Sep 9 13:58:03 UTC 2019
On 2019-09-09 8:03 a.m., Christian König wrote:
> Am 04.09.19 um 22:12 schrieb Yang, Philip:
>> This series looks nice and clear for me, two questions embedded below.
>>
>> Are we going to use dedicated sdma page queue for direct VM update path
>> during a fault?
>>
>> Thanks,
>> Philip
>>
>> On 2019-09-04 11:02 a.m., Christian König wrote:
>>> Next step towards HMM support. For now just silence the retry fault and
>>> optionally redirect the request to the dummy page.
>>>
>>> v2: make sure the VM is not destroyed while we handle the fault.
>>>
>>> Signed-off-by: Christian König <christian.koenig at amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 74
>>> ++++++++++++++++++++++++++
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 +
>>> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 4 ++
>>> 3 files changed, 80 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> index 951608fc1925..410d89966a66 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> @@ -3142,3 +3142,77 @@ void amdgpu_vm_set_task_info(struct amdgpu_vm
>>> *vm)
>>> }
>>> }
>>> }
>>> +
>>> +/**
>>> + * amdgpu_vm_handle_fault - graceful handling of VM faults.
>>> + * @adev: amdgpu device pointer
>>> + * @pasid: PASID of the VM
>>> + * @addr: Address of the fault
>>> + *
>>> + * Try to gracefully handle a VM fault. Return true if the fault was
>>> handled and
>>> + * shouldn't be reported any more.
>>> + */
>>> +bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int
>>> pasid,
>>> + uint64_t addr)
>>> +{
>>> + struct amdgpu_ring *ring = &adev->sdma.instance[0].page;
>>> + struct amdgpu_bo *root;
>>> + uint64_t value, flags;
>>> + struct amdgpu_vm *vm;
>>> + long r;
>>> +
>>> + if (!ring->sched.ready)
>>> + return false;
>>> +
>>> + spin_lock(&adev->vm_manager.pasid_lock);
>>> + vm = idr_find(&adev->vm_manager.pasid_idr, pasid);
>>> + if (vm)
>>> + root = amdgpu_bo_ref(vm->root.base.bo);
>>> + else
>>> + root = NULL;
>>> + spin_unlock(&adev->vm_manager.pasid_lock);
>>> +
>>> + if (!root)
>>> + return false;
>>> +
>>> + r = amdgpu_bo_reserve(root, true);
>>> + if (r)
>>> + goto error_unref;
>>> +
>>> + spin_lock(&adev->vm_manager.pasid_lock);
>>> + vm = idr_find(&adev->vm_manager.pasid_idr, pasid);
>>> + spin_unlock(&adev->vm_manager.pasid_lock);
>>> +
>> Here get vm from pasid second time, and check if PD bo is changed, is
>> this to handle vm fault race with vm destory?
>
> Yes, exactly.
>
>>
>>> + if (!vm || vm->root.base.bo != root)
>>> + goto error_unlock;
>>> +
>>> + addr /= AMDGPU_GPU_PAGE_SIZE;
>>> + flags = AMDGPU_PTE_VALID | AMDGPU_PTE_SNOOPED |
>>> + AMDGPU_PTE_SYSTEM;
>>> +
>>> + if (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_NEVER) {
>>> + /* Redirect the access to the dummy page */
>>> + value = adev->dummy_page_addr;
>>> + flags |= AMDGPU_PTE_EXECUTABLE | AMDGPU_PTE_READABLE |
>>> + AMDGPU_PTE_WRITEABLE;
>>> + } else {
>>> + value = 0;
>>> + }
>>> +
>>> + r = amdgpu_vm_bo_update_mapping(adev, vm, true, NULL, addr, addr
>>> + 1,
>>> + flags, value, NULL, NULL);
>>> + if (r)
>>> + goto error_unlock;
>>> +
>> After fault address redirect to dummy page, will the fault recover and
>> retry continue to execute?
>
> Yes, the read/write operation will just retry and use the value from the
> dummy page instead.
>
>> Is this dangerous to update PTE to use system
>> memory address 0?
>
> What are you talking about? The dummy page is a page allocate by TTM
> where we redirect faulty accesses to.
>
For amdgpu_vm_fault_stop equals to AMDGPU_VM_FAULT_STOP_FIRST/ALWAYS
case, value is 0, this will redirect to system memory 0. Maybe redirect
is only needed for AMDGPU_VM_FAULT_STOP_NEVER?
Regards,
Philip
> Regards,
> Christian.
>
>>
>>> + r = amdgpu_vm_update_pdes(adev, vm, true);
>>> +
>>> +error_unlock:
>>> + amdgpu_bo_unreserve(root);
>>> + if (r < 0)
>>> + DRM_ERROR("Can't handle page fault (%ld)\n", r);
>>> +
>>> +error_unref:
>>> + amdgpu_bo_unref(&root);
>>> +
>>> + return false;
>>> +}
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>> index 0a97dc839f3b..4dbbe1b6b413 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>> @@ -413,6 +413,8 @@ void amdgpu_vm_check_compute_bug(struct
>>> amdgpu_device *adev);
>>> void amdgpu_vm_get_task_info(struct amdgpu_device *adev, unsigned
>>> int pasid,
>>> struct amdgpu_task_info *task_info);
>>> +bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int
>>> pasid,
>>> + uint64_t addr);
>>> void amdgpu_vm_set_task_info(struct amdgpu_vm *vm);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>> index 9d15679df6e0..15a1ce51befa 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>> @@ -353,6 +353,10 @@ static int gmc_v9_0_process_interrupt(struct
>>> amdgpu_device *adev,
>>> }
>>> /* If it's the first fault for this address, process it
>>> normally */
>>> + if (retry_fault && !in_interrupt() &&
>>> + amdgpu_vm_handle_fault(adev, entry->pasid, addr))
>>> + return 1; /* This also prevents sending it to KFD */
>>> +
>>> if (!amdgpu_sriov_vf(adev)) {
>>> /*
>>> * Issue a dummy read to wait for the status register to
>>>
>
More information about the amd-gfx
mailing list