[PATCH] drm/amdgpu: reset vm state machine after gpu reset(vram lost)

Fri Jul 19 12:48:32 UTC 2024

Am 19.07.24 um 11:36 schrieb Yin, ZhenGuo (Chris):
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Hi, Christian
>
> Why loosing VRAM would result in the vm entity to become invalid?
>
> I think only if there has a fence error appeared(like a pending SDMA job got timedout or cancelled), then the entity vm->delayed will be set as error.
>
> If a gpu reset triggered by a GFX job, and there has no SDMA job in the pending list, the entity won't be set as error.

Good point.

We could potentially change the check in amdgpu_vm_validate() to check 
the VM generation instead of calling drm_sched_entity_error().

But what you do in this patch here absolutely doesn't make any sense at all.

Regards,
Christian.

>
> Best,
> Zhenguo
> Cloud-GPU Core team, SRDC
>
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig at amd.com>
> Sent: Friday, July 19, 2024 5:22 PM
> To: Yin, ZhenGuo (Chris) <ZhenGuo.Yin at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: reset vm state machine after gpu reset(vram lost)
>
> Am 19.07.24 um 11:19 schrieb ZhenGuo Yin:
>> [Why]
>> Page table of compute VM in the VRAM will lost after gpu reset.
>> VRAM won't be restored since compute VM has no shadows.
>>
>> [How]
>> Use higher 32-bit of vm->generation to record a vram_lost_counter.
>> Reset the VM state machine when the counter is not equal to current
>> vram_lost_counter of the device.
> Mhm, that was my original approach as well but we came to the conclusion that this shouldn't be necessary since loosing VRAM would result in the entity to become invalid as well.
>
> Why is that necessary?
>
> Regards,
> Christian.
>
>> Signed-off-by: ZhenGuo Yin <zhenguo.yin at amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 10 ++++++++--
>>    1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index 3abfa66d72a2..fd7f912816dc 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -434,7 +434,7 @@ uint64_t amdgpu_vm_generation(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>>        if (!vm)
>>                return result;
>>
>> -     result += vm->generation;
>> +     result += (vm->generation & 0xFFFFFFFF);
>>        /* Add one if the page tables will be re-generated on next CS */
>>        if (drm_sched_entity_error(&vm->delayed))
>>                ++result;
>> @@ -467,6 +467,12 @@ int amdgpu_vm_validate(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>>        struct amdgpu_bo *shadow;
>>        struct amdgpu_bo *bo;
>>        int r;
>> +     uint32_t vram_lost_counter = atomic_read(&adev->vram_lost_counter);
>> +
>> +     if ((vm->generation >> 32) != vram_lost_counter) {
>> +             amdgpu_vm_bo_reset_state_machine(vm);
>> +             vm->generation = (u64)vram_lost_counter << 32 | (vm->generation & 0xFFFFFFFF);
>> +     }
>>
>>        if (drm_sched_entity_error(&vm->delayed)) {
>>                ++vm->generation;
>> @@ -2439,7 +2445,7 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>>        vm->last_update = dma_fence_get_stub();
>>        vm->last_unlocked = dma_fence_get_stub();
>>        vm->last_tlb_flush = dma_fence_get_stub();
>> -     vm->generation = 0;
>> +     vm->generation = (u64)atomic_read(&adev->vram_lost_counter) << 32;
>>
>>        mutex_init(&vm->eviction_lock);
>>        vm->evicting = false;