[PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost)
Yin, ZhenGuo (Chris)
ZhenGuo.Yin at amd.com
Tue Jul 23 09:04:51 UTC 2024
[AMD Official Use Only - AMD Internal Distribution Only]
Hi Christian
I prepared this patch because we met a page fault after gpu reset in SRIOV triggered by quark.
After investigation, I found that the page table did not get restored after gpu reset.
I just tried to use vm_update_mode=0 to disable VM update by CPU, and this issue still exists.
-[Christian] When VRAM is lost any submission using the VM entity will fail and so result in a new page table generation.
I believe that you are referring this piece of code in function amdgpu_job_run():
/* Skip job if VRAM is lost and never resubmit gangs */
if (job->generation != amdgpu_vm_generation(adev, job->vm) ||
(job->job_run_counter && job->gang_submit))
dma_fence_set_error(finished, -ECANCELED);
I agree that the submission from the old ctx will be set as ECANCELED and be skipped.
But if the application then creates a new ctx and submit a new job, both new_ctx->generation and new_job->generation will be initiated as the updated one.
ctx->generation = amdgpu_vm_generation(mgr->adev, &fpriv->vm);
(*job)->generation = amdgpu_vm_generation(adev, vm);
And the job will be executed, hence there will be a page fault due to VRAM lost.
Please correct me if I have some misunderstanding.
I still can not see why any submission using the VM entity will fail due to VRAM lost.
Best,
Zhenguo
Cloud-GPU Core team, SRDC
-----Original Message-----
From: Christian König <ckoenig.leichtzumerken at gmail.com>
Sent: Tuesday, July 23, 2024 3:13 PM
To: Yin, ZhenGuo (Chris) <ZhenGuo.Yin at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Koenig, Christian <Christian.Koenig at amd.com>
Subject: Re: [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram lost)
Am 23.07.24 um 05:05 schrieb ZhenGuo Yin:
> [Why]
> Page table of compute VM in the VRAM will lost after gpu reset.
> VRAM won't be restored since compute VM has no shadows.
>
> [How]
> Use higher 32-bit of vm->generation to record a vram_lost_counter.
> Reset the VM state machine when vm->genertaion is not equal to
> re-generation token.
>
> v2: Check vm->generation instead of calling drm_sched_entity_error in
> amdgpu_vm_validate.
I've just double checked the logic again and as far as I can see this patch here is still completely superfluous.
When VRAM is lost any submission using the VM entity will fail and so result in a new page table generation.
What isn't handled are CPU based page table updates, but for those we could potentially change the condition inside the CPU page tables code.
Regards,
Christian.
>
> Signed-off-by: ZhenGuo Yin <zhenguo.yin at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 11 +++++++----
> 1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 3abfa66d72a2..9e2f84c166e7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -434,7 +434,7 @@ uint64_t amdgpu_vm_generation(struct amdgpu_device *adev, struct amdgpu_vm *vm)
> if (!vm)
> return result;
>
> - result += vm->generation;
> + result += (vm->generation & 0xFFFFFFFF);
> /* Add one if the page tables will be re-generated on next CS */
> if (drm_sched_entity_error(&vm->delayed))
> ++result;
> @@ -467,9 +467,12 @@ int amdgpu_vm_validate(struct amdgpu_device *adev, struct amdgpu_vm *vm,
> struct amdgpu_bo *shadow;
> struct amdgpu_bo *bo;
> int r;
> + uint32_t vram_lost_counter = atomic_read(&adev->vram_lost_counter);
>
> - if (drm_sched_entity_error(&vm->delayed)) {
> - ++vm->generation;
> + if (vm->generation != amdgpu_vm_generation(adev, vm)) {
> + if (drm_sched_entity_error(&vm->delayed))
> + ++vm->generation;
> + vm->generation = (u64)vram_lost_counter << 32 | (vm->generation &
> +0xFFFFFFFF);
> amdgpu_vm_bo_reset_state_machine(vm);
> amdgpu_vm_fini_entities(vm);
> r = amdgpu_vm_init_entities(adev, vm); @@ -2439,7 +2442,7 @@ int
> amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm,
> vm->last_update = dma_fence_get_stub();
> vm->last_unlocked = dma_fence_get_stub();
> vm->last_tlb_flush = dma_fence_get_stub();
> - vm->generation = 0;
> + vm->generation = (u64)atomic_read(&adev->vram_lost_counter) << 32;
>
> mutex_init(&vm->eviction_lock);
> vm->evicting = false;
More information about the amd-gfx
mailing list