TDR and VRAM lost handling in KMD (v2)

Thu Oct 12 08:52:40 UTC 2017

Hi David,

Agreed. I'd also like to use the first variant (UMD compares the current 
vram_lost count with the share list's vram_lost count) for Mesa.

Cheers,
Nicolai

On 12.10.2017 10:43, Mao, David wrote:
> Thanks Monk for the summary!
> 
> Hi Nicolai,
> In order to block the usage of new context reference the old allocation, 
> i think we need to do something in UMD so that KMD don’t need to monitor 
> the resource list.
> I want to make sure we are on the same page.
> If you agree, then there might have two options to do that in UMD: (You 
> can do whatever you want, i just want to elaborate the idea a little bit 
> to facilitate the discussion).
> -  If sharelist is valid, driver need to compare the current 
> vram_lost_count and share list’s vram_lost count, The context will fail 
> to create if share list created before the reset.
> - Or, we can copy the vram_lost count from the sharelist, kernel will 
> fail the submission if the vram_lost count is smaller than current one.
> I personally want to go first for OrcaGL.
> 
> Thanks.
> Best Regards,
> David
> -
>> On 12 Oct 2017, at 4:03 PM, Liu, Monk <Monk.Liu at amd.com 
>> <mailto:Monk.Liu at amd.com>> wrote:
>>
>> V2 summary
>> Hi team
>> *please give your comments*
>> lWhen a job timed out (set from lockup_timeout kernel parameter), What 
>> KMD should do in TDR routine :
>> 1.Update adev->*gpu_reset_counter*, and stop scheduler first
>> 2.Set its fence error status to“*ECANCELED*”,
>> 3.Find the*context*behind this job, and set 
>> this*context*as“*guilty*”(will have a new member field in context 
>> structure –*bool guilty*)
>> a)There will be “*bool * guilty*” in entity structure, which points to 
>> its father context’s member – “*bool guilty”*when context 
>> initialized**, so no matter we get context or entity, we always know 
>> if it is “guilty”
>> b)For kernel entity that used for VM updates, there is no context back 
>> it, so kernel entity’s “bool *guilty” always “NULL”.
>> c)The idea to skip the whole context is for consistence consideration, 
>> because we’ll fake signal the hang job in job_run(), so all jobs in 
>> its context shall be dropped otherwise either bad drawing/computing 
>> results or more GPU hang.
>> **
>> 4.Do GPU reset, which is can be some callbacks to let bare-metal and 
>> SR-IOV implement with their favor style
>> 5.After reset, KMD need to aware if the VRAM lost happens or not, 
>> bare-metal can implement some function to judge, while for SR-IOV I 
>> prefer to read it from GIM side (for initial version we consider it’s 
>> always VRAM lost, till GIM side change aligned)
>> 6.If VRAM lost hit, update adev->*vram_lost_counter*.
>> 7.Do GTT recovery and shadow buffer recovery.
>> 8.Re-schedule all JOBs in mirror list and restart scheduler
>> lFor GPU scheduler function --- job_run()
>> 1.Before schedule a job to ring, checks if job->*vram_lost_counter*== 
>> adev->*vram_lost_counter*, and drop this job if mismatch
>> 2.Before schedule a job to ring, checks if job->entity->*guilty*is 
>> NULL or not,*and drop this job if (guilty!=NULL && *guilty == TRUE)*
>> 3.if a job is dropped:
>> a)set job’s sched_fence status to “*ECANCELED*”
>> b)fake/force signal job’s hw fence (no need to set hw fence’s status)
>> lFor cs_wait() IOCTL:
>> After it found fence signaled, it should check if there is error on 
>> this fence and return the error status of this fence
>> lFor cs_wait_fences() IOCTL:
>> Similar with above approach
>> lFor cs_submit() IOCTL:
>> 1.check if current ctx been marked“*guilty*”and return“*ECANCELED*” if so.
>> 2.set job->*vram_lost_counter*with adev->*vram_lost_counter*, and 
>> return “*ECANCELED*” if ctx->*vram_lost_counter*!= 
>> job->*vram_lost_counter*(Christian already submitted this patch)
>> a)discussion: can we return “ENODEV” if vram_lost_counter mismatch ? 
>> that way UMD know this context is under “device lost”
>> lIntroduce a new IOCTL to let UMD query latest adev->*vram_lost_counter*:
>> lFor amdgpu_ctx_query():
>> n*Don’t update ctx->reset_counter when querying this function, 
>> otherwise the query result is not consistent*
>> nSet out->state.reset_status to “AMDGPU_CTX_GUILTY_RESET” if the ctx 
>> is “*guilty*”, no need to check “ctx->reset_counter”
>> nSet out->state.reset_status to “AMDGPU_CTX_INNOCENT_RESET”*if the ctx 
>> isn’t “guilty” && ctx->reset_counter != adev->reset_counter*
>> nSet out->state.reset_status to “AMDGPU_CTX_NO_RESET” if 
>> ctx->reset_counter == adev->reset_counter
>> nSet out->state.flags to “AMDGPU_CTX_FLAG_VRAM_LOST” if 
>> ctx->vram_lost_counter != adev->vram_lost_counter
>> udiscussion: can we return “ENODEV” for amdgpu_ctx_query() if 
>> ctx->vram_lost_counter != adev->vram_lost_counter ? that way UMD know 
>> this context is under “device lost”
>> nUMD shall release this context if it is AMDGPU_CTX_GUILTY_RESET or 
>> its flags is “AMDGPU_CTX_FLAG_VRAM_LOST”
>> For UMD behavior we still have something need to consider:
>> If MESA creates a new context from an old context (share list?? I’m 
>> not familiar with UMD , David Mao shall have some discuss on it with 
>> Nicolai), the new created context’s vram_lost_counter
>> And reset_counter shall all be ported from that old context , 
>> otherwise CS_SUBMIT will not block it which isn’t correct
>> Need your feedback, thx
>> *From:*amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org]*On 
>> Behalf Of*Liu, Monk
>> *Sent:*2017年10月11日13:34
>> *To:*Koenig, Christian <Christian.Koenig at amd.com 
>> <mailto:Christian.Koenig at amd.com>>; Haehnle, Nicolai 
>> <Nicolai.Haehnle at amd.com <mailto:Nicolai.Haehnle at amd.com>>; Olsak, 
>> Marek <Marek.Olsak at amd.com <mailto:Marek.Olsak at amd.com>>; Deucher, 
>> Alexander <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>
>> *Cc:*Ramirez, Alejandro <Alejandro.Ramirez at amd.com 
>> <mailto:Alejandro.Ramirez at amd.com>>;amd-gfx at lists.freedesktop.org 
>> <mailto:amd-gfx at lists.freedesktop.org>; Filipas, Mario 
>> <Mario.Filipas at amd.com <mailto:Mario.Filipas at amd.com>>; Ding, Pixel 
>> <Pixel.Ding at amd.com <mailto:Pixel.Ding at amd.com>>; Li, Bingley 
>> <Bingley.Li at amd.com <mailto:Bingley.Li at amd.com>>; Jiang, Jerry (SW) 
>> <Jerry.Jiang at amd.com <mailto:Jerry.Jiang at amd.com>>
>> *Subject:*TDR and VRAM lost handling in KMD:
>> Hi Christian & Nicolai,
>> We need to achieve some agreements on what should MESA/UMD do and what 
>> should KMD do,*please give your comments with**“okay”or“No”and your 
>> idea on below items,*
>> lWhen a job timed out (set from lockup_timeout kernel parameter), What 
>> KMD should do in TDR routine :
>> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, 
>> (*gpu_reset_counter*is used to force vm flush after GPU reset, out of 
>> this thread’s scope so no more discussion on it)
>> 2.Set its fence error status to“*ETIME*”,
>> 3.Find the entity/ctx behind this job, and set this ctx as“*guilty*”
>> 4.Kick out this job from scheduler’s mirror list, so this job won’t 
>> get re-scheduled to ring anymore.
>> 5.Kick out all jobs in this“guilty”ctx’s KFIFO queue, and set all 
>> their fence status to“*ECANCELED*”
>> *6.*Force signal all fences that get kicked out by above two 
>> steps,*otherwise UMD will block forever if waiting on those fences*
>> 7.Do gpu reset, which is can be some callbacks to let bare-metal and 
>> SR-IOV implement with their favor style
>> 8.After reset, KMD need to aware if the VRAM lost happens or not, 
>> bare-metal can implement some function to judge, while for SR-IOV I 
>> prefer to read it from GIM side (for initial version we consider it’s 
>> always VRAM lost, till GIM side change aligned)
>> 9.If VRAM lost not hit, continue, otherwise:
>> a)Update adev->*vram_lost_counter*,
>> b)Iterate over all living ctx, and set all ctx as“*guilty*”since VRAM 
>> lost actually ruins all VRAM contents
>> c)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence 
>> status to“*ECANCELDED*”
>> 10.Do GTT recovery and VRAM page tables/entries recovery (optional, do 
>> we need it ???)
>> 11.Re-schedule all JOBs remains in mirror list to ring again and 
>> restart scheduler (for VRAM lost case, no JOB will re-scheduled)
>> lFor cs_wait() IOCTL:
>> After it found fence signaled, it should check 
>> with*“dma_fence_get_status”*to see if there is error there,
>> And return the error status of fence
>> lFor cs_wait_fences() IOCTL:
>> Similar with above approach
>> lFor cs_submit() IOCTL:
>> It need to check if current ctx been marked as“*guilty*”and 
>> return“*ECANCELED*”if so
>> lIntroduce a new IOCTL to let UMD query*vram_lost_counter*:
>> This way, UMD can also block app from submitting, like @Nicolai 
>> mentioned, we can cache one copy of*vram_lost_counter*when enumerate 
>> physical device, and deny all
>> gl-context from submitting if the counter queried bigger than that one 
>> cached in physical device. (looks a little overkill to me, but easy to 
>> implement )
>> UMD can also return error to APP when creating gl-context if found 
>> current queried*vram_lost_counter*bigger than that one cached in 
>> physical device.
>> BTW: I realized that gl-context is a little different with kernel’s 
>> context. Because for kernel. BO is not related with context but only 
>> with FD, while in UMD, BO have a backend
>> gl-context, so block submitting in UMD layer is also needed although 
>> KMD will do its job as bottom line
>> lBasically“vram_lost_counter”is exposure by kernel to let UMD take the 
>> control of robust extension feature, it will be UMD’s call to move, 
>> KMD only deny“guilty”context from submitting
>> Need your feedback, thx
>> We’d better make TDR feature landed ASAP
>> BR Monk
>