TDR and VRAM lost handling in KMD (v2)

Thu Oct 12 08:43:31 UTC 2017

Thanks Monk for the summary!

Hi Nicolai,
In order to block the usage of new context reference the old allocation, i think we need to do something in UMD so that KMD don’t need to monitor the resource list.
I want to make sure we are on the same page.
If you agree, then there might have two options to do that in UMD: (You can do whatever you want, i just want to elaborate the idea a little bit to facilitate the discussion).
-  If sharelist is valid, driver need to compare the current vram_lost_count and share list’s vram_lost count, The context will fail to create if share list created before the reset.
- Or, we can copy the vram_lost count from the sharelist, kernel will fail the submission if the vram_lost count is smaller than current one.
I personally want to go first for OrcaGL.

Thanks.
Best Regards,
David
-
On 12 Oct 2017, at 4:03 PM, Liu, Monk <Monk.Liu at amd.com<mailto:Monk.Liu at amd.com>> wrote:

V2 summary

Hi team

please give your comments

•  When a job timed out (set from lockup_timeout kernel parameter), What KMD should do in TDR routine :

1.        Update adev->gpu_reset_counter, and stop scheduler first
2.        Set its fence error status to “ECANCELED”,
3.        Find the context behind this job, and set this context as “guilty” (will have a new member field in context structure – bool guilty)
a)       There will be “bool * guilty” in entity structure, which points to its father context’s member – “bool guilty” when context initialized , so no matter we get context or entity, we always know if it is “guilty”
b)       For kernel entity that used for VM updates, there is no context back it, so kernel entity’s “bool *guilty” always “NULL”.
c)        The idea to skip the whole context is for consistence consideration, because we’ll fake signal the hang job in job_run(), so all jobs in its context shall be dropped otherwise either bad drawing/computing results or more GPU hang.

4.        Do GPU reset, which is can be some callbacks to let bare-metal and SR-IOV implement with their favor style
5.        After reset, KMD need to aware if the VRAM lost happens or not, bare-metal can implement some function to judge, while for SR-IOV I prefer to read it from GIM side (for initial version we consider it’s always VRAM lost, till GIM side change aligned)
6.        If VRAM lost hit, update adev->vram_lost_counter.
7.        Do GTT recovery and shadow buffer recovery.
8.        Re-schedule all JOBs in mirror list and restart scheduler

•  For GPU scheduler function --- job_run()
1.        Before schedule a job to ring, checks if job->vram_lost_counter == adev->vram_lost_counter, and drop this job if mismatch
2.        Before schedule a job to ring, checks if job->entity->guilty is NULL or not, and drop this job if (guilty!=NULL && *guilty == TRUE)
3.        if a job is dropped:
a)       set job’s sched_fence status to “ECANCELED”
b)       fake/force signal job’s hw fence (no need to set hw fence’s status)

•  For cs_wait() IOCTL:
After it found fence signaled, it should check if there is error on this fence and return the error status of this fence

•  For cs_wait_fences() IOCTL:
Similar with above approach

•  For cs_submit() IOCTL:
1.        check if current ctx been marked “guilty” and return “ECANCELED”  if so.
2.        set job->vram_lost_counter with adev->vram_lost_counter, and return “ECANCELED” if ctx->vram_lost_counter != job->vram_lost_counter(Christian already submitted this patch)
a)       discussion: can we return “ENODEV” if vram_lost_counter mismatch ? that way UMD know this context is under “device lost”

•  Introduce a new IOCTL to let UMD query latest adev->vram_lost_counter:

•  For amdgpu_ctx_query():
•  Don’t update ctx->reset_counter when querying this function, otherwise the query result is not consistent
•  Set out->state.reset_status to “AMDGPU_CTX_GUILTY_RESET” if the ctx is “guilty”, no need to check “ctx->reset_counter”
•  Set out->state.reset_status to “AMDGPU_CTX_INNOCENT_RESET”if the ctx isn’t “guilty” && ctx->reset_counter != adev->reset_counter
•  Set out->state.reset_status to “AMDGPU_CTX_NO_RESET” if ctx->reset_counter == adev->reset_counter
•  Set out->state.flags to “AMDGPU_CTX_FLAG_VRAM_LOST” if ctx->vram_lost_counter != adev->vram_lost_counter
•  discussion: can we return “ENODEV” for amdgpu_ctx_query() if ctx->vram_lost_counter != adev->vram_lost_counter ? that way UMD know this context is under “device lost”
•  UMD shall release this context if it is AMDGPU_CTX_GUILTY_RESET or its flags is “AMDGPU_CTX_FLAG_VRAM_LOST”

For UMD behavior we still have something need to consider:
If MESA creates a new context from an old context (share list?? I’m not familiar with UMD , David Mao shall have some discuss on it with Nicolai), the new created context’s vram_lost_counter
And reset_counter shall all be ported from that old context , otherwise CS_SUBMIT will not block it which isn’t correct

Need your feedback, thx

From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf Of Liu, Monk
Sent: 2017年10月11日 13:34
To: Koenig, Christian <Christian.Koenig at amd.com<mailto:Christian.Koenig at amd.com>>; Haehnle, Nicolai <Nicolai.Haehnle at amd.com<mailto:Nicolai.Haehnle at amd.com>>; Olsak, Marek <Marek.Olsak at amd.com<mailto:Marek.Olsak at amd.com>>; Deucher, Alexander <Alexander.Deucher at amd.com<mailto:Alexander.Deucher at amd.com>>
Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com<mailto:Alejandro.Ramirez at amd.com>>; amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>; Filipas, Mario <Mario.Filipas at amd.com<mailto:Mario.Filipas at amd.com>>; Ding, Pixel <Pixel.Ding at amd.com<mailto:Pixel.Ding at amd.com>>; Li, Bingley <Bingley.Li at amd.com<mailto:Bingley.Li at amd.com>>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com<mailto:Jerry.Jiang at amd.com>>
Subject: TDR and VRAM lost handling in KMD:

Hi Christian & Nicolai,

We need to achieve some agreements on what should MESA/UMD do and what should KMD do, please give your comments with “okay” or “No” and your idea on below items,

•  When a job timed out (set from lockup_timeout kernel parameter), What KMD should do in TDR routine :

1.        Update adev->gpu_reset_counter, and stop scheduler first, (gpu_reset_counter is used to force vm flush after GPU reset, out of this thread’s scope so no more discussion on it)
2.        Set its fence error status to “ETIME”,
3.        Find the entity/ctx behind this job, and set this ctx as “guilty”
4.        Kick out this job from scheduler’s mirror list, so this job won’t get re-scheduled to ring anymore.
5.        Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all their fence status to “ECANCELED”
6.       Force signal all fences that get kicked out by above two steps, otherwise UMD will block forever if waiting on those fences
7.        Do gpu reset, which is can be some callbacks to let bare-metal and SR-IOV implement with their favor style
8.        After reset, KMD need to aware if the VRAM lost happens or not, bare-metal can implement some function to judge, while for SR-IOV I prefer to read it from GIM side (for initial version we consider it’s always VRAM lost, till GIM side change aligned)
9.        If VRAM lost not hit, continue, otherwise:
a)       Update adev->vram_lost_counter,
b)       Iterate over all living ctx, and set all ctx as “guilty” since VRAM lost actually ruins all VRAM contents
c)        Kick out all jobs in all ctx’s KFIFO queue, and set all their fence status to “ECANCELDED”
10.     Do GTT recovery and VRAM page tables/entries recovery (optional, do we need it ???)
11.     Re-schedule all JOBs remains in mirror list to ring again and restart scheduler (for VRAM lost case, no JOB will re-scheduled)

•  For cs_wait() IOCTL:
After it found fence signaled, it should check with “dma_fence_get_status” to see if there is error there,
And return the error status of fence

•  For cs_wait_fences() IOCTL:
Similar with above approach

•  For cs_submit() IOCTL:
It need to check if current ctx been marked as “guilty” and return “ECANCELED” if so

•  Introduce a new IOCTL to let UMD query vram_lost_counter:
This way, UMD can also block app from submitting, like @Nicolai mentioned, we can cache one copy of vram_lost_counter when enumerate physical device, and deny all
gl-context from submitting if the counter queried bigger than that one cached in physical device. (looks a little overkill to me, but easy to implement )
UMD can also return error to APP when creating gl-context if found current queried vram_lost_counter bigger than that one cached in physical device.

BTW: I realized that gl-context is a little different with kernel’s context. Because for kernel. BO is not related with context but only with FD, while in UMD, BO have a backend
gl-context, so block submitting in UMD layer is also needed although KMD will do its job as bottom line

•  Basically “vram_lost_counter” is exposure by kernel to let UMD take the control of robust extension feature, it will be UMD’s call to move, KMD only deny “guilty” context from submitting

Need your feedback, thx

We’d better make TDR feature landed ASAP

BR Monk

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20171012/eb326986/attachment-0001.html>