TDR and VRAM lost handling in KMD (v2)

Thu Oct 12 08:40:47 UTC 2017

Yeah, that sounds like a plan to me.

Going to commit the patches I've already done in a minute, they still 
seem to implement quite a bit of what we want to do here.

Regards,
Christian.

Am 12.10.2017 um 10:03 schrieb Liu, Monk:
>
> V2 summary
>
> Hi team
>
> *please give your comments*
>
> lWhen a job timed out (set from lockup_timeout kernel parameter), What 
> KMD should do in TDR routine :
>
> 1.Update adev->*gpu_reset_counter*, and stop scheduler first
>
> 2.Set its fence error status to “*ECANCELED*”,
>
> 3.Find the *context* behind this job, and set this *context* as 
> “*guilty*”(will have a new member field in context structure – *bool 
> guilty*)
>
> a)There will be “*bool * guilty*” in entity structure, which points to 
> its father context’s member – “*bool guilty” *when context 
> initialized**, so no matter we get context or entity, we always know 
> if it is “guilty”
>
> b)For kernel entity that used for VM updates, there is no context back 
> it, so kernel entity’s “bool *guilty” always “NULL”.
>
> c)The idea to skip the whole context is for consistence consideration, 
> because we’ll fake signal the hang job in job_run(), so all jobs in 
> its context shall be dropped otherwise either bad drawing/computing 
> results or more GPU hang.
>
> **
>
> 4.Do GPU reset, which is can be some callbacks to let bare-metal and 
> SR-IOV implement with their favor style
>
> 5.After reset, KMD need to aware if the VRAM lost happens or not, 
> bare-metal can implement some function to judge, while for SR-IOV I 
> prefer to read it from GIM side (for initial version we consider it’s 
> always VRAM lost, till GIM side change aligned)
>
> 6.If VRAM lost hit, update adev->*vram_lost_counter*.
>
> 7.Do GTT recovery and shadow buffer recovery.
>
> 8.Re-schedule all JOBs in mirror list and restart scheduler
>
> lFor GPU scheduler function --- job_run()
>
> 1.Before schedule a job to ring, checks if job->*vram_lost_counter* == 
> adev->*vram_lost_counter*, and drop this job if mismatch
>
> 2.Before schedule a job to ring, checks if job->entity->*guilty* is 
> NULL or not, *and drop this job if (guilty!=NULL && *guilty == TRUE)*
>
> 3.if a job is dropped:
>
> a)set job’s sched_fence status to “*ECANCELED*”
>
> b)fake/force signal job’s hw fence (no need to set hw fence’s status)
>
> lFor cs_wait() IOCTL:
>
> After it found fence signaled, it should check if there is error on 
> this fence and return the error status of this fence
>
> lFor cs_wait_fences() IOCTL:
>
> Similar with above approach
>
> lFor cs_submit() IOCTL:
>
> 1.check if current ctx been marked “*guilty*”and return “*ECANCELED*” 
>  if so.
>
> 2.set job->*vram_lost_counter* with adev->*vram_lost_counter*, and 
> return “*ECANCELED*” if ctx->*vram_lost_counter* != 
> job->*vram_lost_counter* (Christian already submitted this patch)
>
> a)discussion: can we return “ENODEV” if vram_lost_counter mismatch ? 
> that way UMD know this context is under “device lost”
>
> lIntroduce a new IOCTL to let UMD query latest adev->*vram_lost_counter*:
>
> lFor amdgpu_ctx_query():
>
> n*Don’t update ctx->reset_counter when querying this function, 
> otherwise the query result is not consistent *
>
> nSet out->state.reset_status to “AMDGPU_CTX_GUILTY_RESET” if the ctx 
> is “*guilty*”, no need to check “ctx->reset_counter”
>
> nSet out->state.reset_status to “AMDGPU_CTX_INNOCENT_RESET” *if the 
> ctx isn’t “guilty” && ctx->reset_counter != adev->reset_counter *
>
> nSet out->state.reset_status to “AMDGPU_CTX_NO_RESET” if 
> ctx->reset_counter == adev->reset_counter
>
> nSet out->state.flags to “AMDGPU_CTX_FLAG_VRAM_LOST” if 
> ctx->vram_lost_counter != adev->vram_lost_counter
>
> udiscussion: can we return “ENODEV” for amdgpu_ctx_query() if 
> ctx->vram_lost_counter != adev->vram_lost_counter ? that way UMD know 
> this context is under “device lost”
>
> nUMD shall release this context if it is AMDGPU_CTX_GUILTY_RESET or 
> its flags is “AMDGPU_CTX_FLAG_VRAM_LOST”
>
> For UMD behavior we still have something need to consider:
>
> If MESA creates a new context from an old context (share list?? I’m 
> not familiar with UMD , David Mao shall have some discuss on it with 
> Nicolai), the new created context’s vram_lost_counter
>
> And reset_counter shall all be ported from that old context , 
> otherwise CS_SUBMIT will not block it which isn’t correct
>
> Need your feedback, thx
>
> *From:*amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] *On 
> Behalf Of *Liu, Monk
> *Sent:* 2017年10月11日13:34
> *To:* Koenig, Christian <Christian.Koenig at amd.com>; Haehnle, Nicolai 
> <Nicolai.Haehnle at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; 
> Deucher, Alexander <Alexander.Deucher at amd.com>
> *Cc:* Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; 
> amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; 
> Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; 
> Jiang, Jerry (SW) <Jerry.Jiang at amd.com>
> *Subject:* TDR and VRAM lost handling in KMD:
>
> Hi Christian & Nicolai,
>
> We need to achieve some agreements on what should MESA/UMD do and what 
> should KMD do, *please give your comments with **“okay”or “No”and your 
> idea on below items,*
>
> lWhen a job timed out (set from lockup_timeout kernel parameter), What 
> KMD should do in TDR routine :
>
> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, 
> (*gpu_reset_counter* is used to force vm flush after GPU reset, out of 
> this thread’s scope so no more discussion on it)
>
> 2.Set its fence error status to “*ETIME*”,
>
> 3.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>
> 4.Kick out this job from scheduler’s mirror list, so this job won’t 
> get re-scheduled to ring anymore.
>
> 5.Kick out all jobs in this “guilty”ctx’s KFIFO queue, and set all 
> their fence status to “*ECANCELED*”
>
> *6.*Force signal all fences that get kicked out by above two 
> steps,*otherwise UMD will block forever if waiting on those fences*
>
> 7.Do gpu reset, which is can be some callbacks to let bare-metal and 
> SR-IOV implement with their favor style
>
> 8.After reset, KMD need to aware if the VRAM lost happens or not, 
> bare-metal can implement some function to judge, while for SR-IOV I 
> prefer to read it from GIM side (for initial version we consider it’s 
> always VRAM lost, till GIM side change aligned)
>
> 9.If VRAM lost not hit, continue, otherwise:
>
> a)Update adev->*vram_lost_counter*,
>
> b)Iterate over all living ctx, and set all ctx as “*guilty*”since VRAM 
> lost actually ruins all VRAM contents
>
> c)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence 
> status to “*ECANCELDED*”
>
> 10.Do GTT recovery and VRAM page tables/entries recovery (optional, do 
> we need it ???)
>
> 11.Re-schedule all JOBs remains in mirror list to ring again and 
> restart scheduler (for VRAM lost case, no JOB will re-scheduled)
>
> lFor cs_wait() IOCTL:
>
> After it found fence signaled, it should check with 
> *“dma_fence_get_status” *to see if there is error there,
>
> And return the error status of fence
>
> lFor cs_wait_fences() IOCTL:
>
> Similar with above approach
>
> lFor cs_submit() IOCTL:
>
> It need to check if current ctx been marked as “*guilty*”and return 
> “*ECANCELED*”if so
>
> lIntroduce a new IOCTL to let UMD query *vram_lost_counter*:
>
> This way, UMD can also block app from submitting, like @Nicolai 
> mentioned, we can cache one copy of *vram_lost_counter* when enumerate 
> physical device, and deny all
>
> gl-context from submitting if the counter queried bigger than that one 
> cached in physical device. (looks a little overkill to me, but easy to 
> implement )
>
> UMD can also return error to APP when creating gl-context if found 
> current queried*vram_lost_counter *bigger than that one cached in 
> physical device.
>
> BTW: I realized that gl-context is a little different with kernel’s 
> context. Because for kernel. BO is not related with context but only 
> with FD, while in UMD, BO have a backend
>
> gl-context, so block submitting in UMD layer is also needed although 
> KMD will do its job as bottom line
>
> lBasically “vram_lost_counter”is exposure by kernel to let UMD take 
> the control of robust extension feature, it will be UMD’s call to 
> move, KMD only deny “guilty”context from submitting
>
> Need your feedback, thx
>
> We’d better make TDR feature landed ASAP
>
> BR Monk
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20171012/efbdc2b2/attachment-0001.html>