TDR and VRAM lost handling in KMD (v2)

Thu Oct 12 08:41:52 UTC 2017

Hi Monk,

Thanks for the summary. Most of it looks good to me, though I can't 
speak to all the kernel internals.

Just some comments:

On 12.10.2017 10:03, Liu, Monk wrote:
> lFor cs_submit() IOCTL:
> 
> 1.check if current ctx been marked “*guilty*”and return “*ECANCELED*” 
>   if so.
> 
> 2.set job->*vram_lost_counter* with adev->*vram_lost_counter*, and 
> return “*ECANCELED*” if ctx->*vram_lost_counter* != 
> job->*vram_lost_counter* (Christian already submitted this patch)
> 
> a)discussion: can we return “ENODEV” if vram_lost_counter mismatch ? 
> that way UMD know this context is under “device lost”

My plan for UMD is to always query the VRAM lost counter when any kind 
of context lost situation is detected. So cs_submit() should return an 
error in this situation, but it could just be ECANCELED. We don't need 
to distinguish between different types of errors here.

> lIntroduce a new IOCTL to let UMD query latest adev->*vram_lost_counter*:

Christian already sent a patch for this.

> lFor amdgpu_ctx_query():
> 
> n*Don’t update ctx->reset_counter when querying this function, otherwise 
> the query result is not consistent *

Hmm. I misremembered part of the spec, see below.

> nSet out->state.reset_status to “AMDGPU_CTX_GUILTY_RESET” if the ctx is 
> “*guilty*”, no need to check “ctx->reset_counter”

Agreed.

> nSet out->state.reset_status to “AMDGPU_CTX_INNOCENT_RESET” *if the ctx 
> isn’t “guilty” && ctx->reset_counter != adev->reset_counter *

I disagree. The meaning of AMDGPU_CTX_*_RESET should reflect the 
corresponding enums in user space APIs. I don't know how it works in 
Vulkan, but in OpenGL, returning GL_INNOCENT_CONTEXT_RESET_ARB means 
that the context was lost.

However, !guilty && ctx->reset_counter != adev->reset_counter does not 
imply that the context was lost.

The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET if 
!guilty && ctx->vram_lost_counter != adev->vram_lost_counter.

As far as I understand it, the case of !guilty && ctx->reset_counter != 
adev->reset_counter && ctx->vram_lost_counter == adev->vram_lost_counter 
should return AMDGPU_CTX_NO_RESET, because a GPU reset occurred, but it 
didn't affect our context.

I unfortunately noticed another subtlety while re-reading the OpenGL 
spec. OpenGL says that the OpenGL context itself does *not* have to be 
re-created in order to recover from the reset. Re-creating all objects 
in the context is sufficient.

I believe this is the original motivation for why amdgpu_ctx_query() 
will reset the ctx->reset_counter.

For Mesa, it's still okay if the kernel keeps blocking submissions as we 
can just recreate the kernel context. But OrcaGL is also affected.

Does anybody know off-hand where the relevant parts of the Vulkan spec 
are? I didn't actually find anything in a quick search.

[snip]
> For UMD behavior we still have something need to consider:
> 
> If MESA creates a new context from an old context (share list?? I’m not 
> familiar with UMD , David Mao shall have some discuss on it with 
> Nicolai), the new created context’s vram_lost_counter
> 
> And reset_counter shall all be ported from that old context , otherwise 
> CS_SUBMIT will not block it which isn’t correct

The kernel doesn't have to do anything for this, it is entirely the 
UMD's responsibility. All UMD needs from KMD is the function for 
querying the vram_lost_counter.

Cheers,
Nicolai

> 
> Need your feedback, thx
> 
> *From:*amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] *On Behalf 
> Of *Liu, Monk
> *Sent:* 2017年10月11日13:34
> *To:* Koenig, Christian <Christian.Koenig at amd.com>; Haehnle, Nicolai 
> <Nicolai.Haehnle at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; Deucher, 
> Alexander <Alexander.Deucher at amd.com>
> *Cc:* Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; 
> amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; 
> Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; 
> Jiang, Jerry (SW) <Jerry.Jiang at amd.com>
> *Subject:* TDR and VRAM lost handling in KMD:
> 
> Hi Christian & Nicolai,
> 
> We need to achieve some agreements on what should MESA/UMD do and what 
> should KMD do, *please give your comments with **“okay”or “No”and your 
> idea on below items,*
> 
> lWhen a job timed out (set from lockup_timeout kernel parameter), What 
> KMD should do in TDR routine :
> 
> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, 
> (*gpu_reset_counter* is used to force vm flush after GPU reset, out of 
> this thread’s scope so no more discussion on it)
> 
> 2.Set its fence error status to “*ETIME*”,
> 
> 3.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
> 
> 4.Kick out this job from scheduler’s mirror list, so this job won’t get 
> re-scheduled to ring anymore.
> 
> 5.Kick out all jobs in this “guilty”ctx’s KFIFO queue, and set all their 
> fence status to “*ECANCELED*”
> 
> *6.*Force signal all fences that get kicked out by above two 
> steps,*otherwise UMD will block forever if waiting on those fences*
> 
> 7.Do gpu reset, which is can be some callbacks to let bare-metal and 
> SR-IOV implement with their favor style
> 
> 8.After reset, KMD need to aware if the VRAM lost happens or not, 
> bare-metal can implement some function to judge, while for SR-IOV I 
> prefer to read it from GIM side (for initial version we consider it’s 
> always VRAM lost, till GIM side change aligned)
> 
> 9.If VRAM lost not hit, continue, otherwise:
> 
> a)Update adev->*vram_lost_counter*,
> 
> b)Iterate over all living ctx, and set all ctx as “*guilty*”since VRAM 
> lost actually ruins all VRAM contents
> 
> c)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence 
> status to “*ECANCELDED*”
> 
> 10.Do GTT recovery and VRAM page tables/entries recovery (optional, do 
> we need it ???)
> 
> 11.Re-schedule all JOBs remains in mirror list to ring again and restart 
> scheduler (for VRAM lost case, no JOB will re-scheduled)
> 
> lFor cs_wait() IOCTL:
> 
> After it found fence signaled, it should check with 
> *“dma_fence_get_status” *to see if there is error there,
> 
> And return the error status of fence
> 
> lFor cs_wait_fences() IOCTL:
> 
> Similar with above approach
> 
> lFor cs_submit() IOCTL:
> 
> It need to check if current ctx been marked as “*guilty*”and return 
> “*ECANCELED*”if so
> 
> lIntroduce a new IOCTL to let UMD query *vram_lost_counter*:
> 
> This way, UMD can also block app from submitting, like @Nicolai 
> mentioned, we can cache one copy of *vram_lost_counter* when enumerate 
> physical device, and deny all
> 
> gl-context from submitting if the counter queried bigger than that one 
> cached in physical device. (looks a little overkill to me, but easy to 
> implement )
> 
> UMD can also return error to APP when creating gl-context if found 
> current queried*vram_lost_counter *bigger than that one cached in 
> physical device.
> 
> BTW: I realized that gl-context is a little different with kernel’s 
> context. Because for kernel. BO is not related with context but only 
> with FD, while in UMD, BO have a backend
> 
> gl-context, so block submitting in UMD layer is also needed although KMD 
> will do its job as bottom line
> 
> lBasically “vram_lost_counter”is exposure by kernel to let UMD take the 
> control of robust extension feature, it will be UMD’s call to move, KMD 
> only deny “guilty”context from submitting
> 
> Need your feedback, thx
> 
> We’d better make TDR feature landed ASAP
> 
> BR Monk
>