TDR and VRAM lost handling in KMD (v2)

Thu Oct 12 08:49:49 UTC 2017

> However, !guilty && ctx->reset_counter != adev->reset_counter does not 
> imply that the context was lost.
>
> The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET if 
> !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.
>
> As far as I understand it, the case of !guilty && ctx->reset_counter 
> != adev->reset_counter && ctx->vram_lost_counter == 
> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because a 
> GPU reset occurred, but it didn't affect our context.
I disagree on that.

AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there was a 
reset but we haven't been causing it.

That the OpenGL extension is specified otherwise is unfortunate, but I 
think we shouldn't use that for the kernel interface here.

Regards,
Christian.

Am 12.10.2017 um 10:44 schrieb Nicolai Hähnle:
> I think we should stick to the plan where kernel contexts stay "stuck" 
> after a GPU reset. This is the most robust behavior for the kernel.
>
> Even if the OpenGL spec says that an OpenGL context can be re-used 
> without destroying and re-creating it, the UMD can take care of 
> re-creating the kernel context.
>
> This means amdgpu_ctx_query should *not* reset ctx->reset_counter.
>
> Cheers,
> Nicolai
>
>
> On 12.10.2017 10:41, Nicolai Hähnle wrote:
>> Hi Monk,
>>
>> Thanks for the summary. Most of it looks good to me, though I can't 
>> speak to all the kernel internals.
>>
>> Just some comments:
>>
>> On 12.10.2017 10:03, Liu, Monk wrote:
>>> lFor cs_submit() IOCTL:
>>>
>>> 1.check if current ctx been marked “*guilty*”and return 
>>> “*ECANCELED*”   if so.
>>>
>>> 2.set job->*vram_lost_counter* with adev->*vram_lost_counter*, and 
>>> return “*ECANCELED*” if ctx->*vram_lost_counter* != 
>>> job->*vram_lost_counter* (Christian already submitted this patch)
>>>
>>> a)discussion: can we return “ENODEV” if vram_lost_counter mismatch ? 
>>> that way UMD know this context is under “device lost”
>>
>> My plan for UMD is to always query the VRAM lost counter when any 
>> kind of context lost situation is detected. So cs_submit() should 
>> return an error in this situation, but it could just be ECANCELED. We 
>> don't need to distinguish between different types of errors here.
>>
>>
>>> lIntroduce a new IOCTL to let UMD query latest 
>>> adev->*vram_lost_counter*:
>>
>> Christian already sent a patch for this.
>>
>>
>>> lFor amdgpu_ctx_query():
>>>
>>> n*Don’t update ctx->reset_counter when querying this function, 
>>> otherwise the query result is not consistent *
>>
>> Hmm. I misremembered part of the spec, see below.
>>
>>
>>> nSet out->state.reset_status to “AMDGPU_CTX_GUILTY_RESET” if the ctx 
>>> is “*guilty*”, no need to check “ctx->reset_counter”
>>
>> Agreed.
>>
>>
>>> nSet out->state.reset_status to “AMDGPU_CTX_INNOCENT_RESET” *if the 
>>> ctx isn’t “guilty” && ctx->reset_counter != adev->reset_counter *
>>
>> I disagree. The meaning of AMDGPU_CTX_*_RESET should reflect the 
>> corresponding enums in user space APIs. I don't know how it works in 
>> Vulkan, but in OpenGL, returning GL_INNOCENT_CONTEXT_RESET_ARB means 
>> that the context was lost.
>>
>> However, !guilty && ctx->reset_counter != adev->reset_counter does 
>> not imply that the context was lost.
>>
>> The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET 
>> if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.
>>
>> As far as I understand it, the case of !guilty && ctx->reset_counter 
>> != adev->reset_counter && ctx->vram_lost_counter == 
>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because a 
>> GPU reset occurred, but it didn't affect our context.
>>
>> I unfortunately noticed another subtlety while re-reading the OpenGL 
>> spec. OpenGL says that the OpenGL context itself does *not* have to 
>> be re-created in order to recover from the reset. Re-creating all 
>> objects in the context is sufficient.
>>
>> I believe this is the original motivation for why amdgpu_ctx_query() 
>> will reset the ctx->reset_counter.
>>
>> For Mesa, it's still okay if the kernel keeps blocking submissions as 
>> we can just recreate the kernel context. But OrcaGL is also affected.
>>
>> Does anybody know off-hand where the relevant parts of the Vulkan 
>> spec are? I didn't actually find anything in a quick search.
>>
>>
>> [snip]
>>> For UMD behavior we still have something need to consider:
>>>
>>> If MESA creates a new context from an old context (share list?? I’m 
>>> not familiar with UMD , David Mao shall have some discuss on it with 
>>> Nicolai), the new created context’s vram_lost_counter
>>>
>>> And reset_counter shall all be ported from that old context , 
>>> otherwise CS_SUBMIT will not block it which isn’t correct
>>
>> The kernel doesn't have to do anything for this, it is entirely the 
>> UMD's responsibility. All UMD needs from KMD is the function for 
>> querying the vram_lost_counter.
>>
>> Cheers,
>> Nicolai
>>
>>
>>>
>>> Need your feedback, thx
>>>
>>> *From:*amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] *On 
>>> Behalf Of *Liu, Monk
>>> *Sent:* 2017年10月11日13:34
>>> *To:* Koenig, Christian <Christian.Koenig at amd.com>; Haehnle, Nicolai 
>>> <Nicolai.Haehnle at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; 
>>> Deucher, Alexander <Alexander.Deucher at amd.com>
>>> *Cc:* Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; 
>>> amd-gfx at lists.freedesktop.org; Filipas, Mario 
>>> <Mario.Filipas at amd.com>; Ding, Pixel <Pixel.Ding at amd.com>; Li, 
>>> Bingley <Bingley.Li at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>
>>> *Subject:* TDR and VRAM lost handling in KMD:
>>>
>>> Hi Christian & Nicolai,
>>>
>>> We need to achieve some agreements on what should MESA/UMD do and 
>>> what should KMD do, *please give your comments with **“okay”or 
>>> “No”and your idea on below items,*
>>>
>>> lWhen a job timed out (set from lockup_timeout kernel parameter), 
>>> What KMD should do in TDR routine :
>>>
>>> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, 
>>> (*gpu_reset_counter* is used to force vm flush after GPU reset, out 
>>> of this thread’s scope so no more discussion on it)
>>>
>>> 2.Set its fence error status to “*ETIME*”,
>>>
>>> 3.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>>>
>>> 4.Kick out this job from scheduler’s mirror list, so this job won’t 
>>> get re-scheduled to ring anymore.
>>>
>>> 5.Kick out all jobs in this “guilty”ctx’s KFIFO queue, and set all 
>>> their fence status to “*ECANCELED*”
>>>
>>> *6.*Force signal all fences that get kicked out by above two 
>>> steps,*otherwise UMD will block forever if waiting on those fences*
>>>
>>> 7.Do gpu reset, which is can be some callbacks to let bare-metal and 
>>> SR-IOV implement with their favor style
>>>
>>> 8.After reset, KMD need to aware if the VRAM lost happens or not, 
>>> bare-metal can implement some function to judge, while for SR-IOV I 
>>> prefer to read it from GIM side (for initial version we consider 
>>> it’s always VRAM lost, till GIM side change aligned)
>>>
>>> 9.If VRAM lost not hit, continue, otherwise:
>>>
>>> a)Update adev->*vram_lost_counter*,
>>>
>>> b)Iterate over all living ctx, and set all ctx as “*guilty*”since 
>>> VRAM lost actually ruins all VRAM contents
>>>
>>> c)Kick out all jobs in all ctx’s KFIFO queue, and set all their 
>>> fence status to “*ECANCELDED*”
>>>
>>> 10.Do GTT recovery and VRAM page tables/entries recovery (optional, 
>>> do we need it ???)
>>>
>>> 11.Re-schedule all JOBs remains in mirror list to ring again and 
>>> restart scheduler (for VRAM lost case, no JOB will re-scheduled)
>>>
>>> lFor cs_wait() IOCTL:
>>>
>>> After it found fence signaled, it should check with 
>>> *“dma_fence_get_status” *to see if there is error there,
>>>
>>> And return the error status of fence
>>>
>>> lFor cs_wait_fences() IOCTL:
>>>
>>> Similar with above approach
>>>
>>> lFor cs_submit() IOCTL:
>>>
>>> It need to check if current ctx been marked as “*guilty*”and return 
>>> “*ECANCELED*”if so
>>>
>>> lIntroduce a new IOCTL to let UMD query *vram_lost_counter*:
>>>
>>> This way, UMD can also block app from submitting, like @Nicolai 
>>> mentioned, we can cache one copy of *vram_lost_counter* when 
>>> enumerate physical device, and deny all
>>>
>>> gl-context from submitting if the counter queried bigger than that 
>>> one cached in physical device. (looks a little overkill to me, but 
>>> easy to implement )
>>>
>>> UMD can also return error to APP when creating gl-context if found 
>>> current queried*vram_lost_counter *bigger than that one cached in 
>>> physical device.
>>>
>>> BTW: I realized that gl-context is a little different with kernel’s 
>>> context. Because for kernel. BO is not related with context but only 
>>> with FD, while in UMD, BO have a backend
>>>
>>> gl-context, so block submitting in UMD layer is also needed although 
>>> KMD will do its job as bottom line
>>>
>>> lBasically “vram_lost_counter”is exposure by kernel to let UMD take 
>>> the control of robust extension feature, it will be UMD’s call to 
>>> move, KMD only deny “guilty”context from submitting
>>>
>>> Need your feedback, thx
>>>
>>> We’d better make TDR feature landed ASAP
>>>
>>> BR Monk
>>>
>>
>