TDR and VRAM lost handling in KMD:

Wed Oct 11 07:20:44 UTC 2017

See inline:

Am 11.10.2017 um 07:33 schrieb Liu, Monk:
>
> Hi Christian & Nicolai,
>
> We need to achieve some agreements on what should MESA/UMD do and what 
> should KMD do, *please give your comments with “okay” or “No” and your 
> idea on below items,*
>
> lWhen a job timed out (set from lockup_timeout kernel parameter), What 
> KMD should do in TDR routine :
>
> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, 
> (*gpu_reset_counter* is used to force vm flush after GPU reset, out of 
> this thread’s scope so no more discussion on it)
>
Okay.

> 2.Set its fence error status to “*ETIME*”,
>
No, as I already explained ETIME is for synchronous operation.

In other words when we return ETIME from the wait IOCTL it would mean 
that the waiting has somehow timed out, but not the job we waited for.

Please use ECANCELED as well or some other error code when we find that 
we need to distinct the timedout job from the canceled ones (probably a 
good idea, but I'm not sure).

> 3.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>
Not sure. Do we want to set the whole context as guilty or just the entity?

Setting the whole contexts as guilty sounds racy to me.

BTW: We should use a different name than "guilty", maybe just "bool 
canceled;" ?

> 4.Kick out this job from scheduler’s mirror list, so this job won’t 
> get re-scheduled to ring anymore.
>
Okay.

> 5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all 
> their fence status to “*ECANCELED*”
>
Setting ECANCELED should be ok. But I think we should do this when we 
try to run the jobs and not during GPU reset.

> *6.*Force signal all fences that get kicked out by above two 
> steps,*otherwise UMD will block forever if waiting on those fences*
>
Okay.

> **
>
> 7.Do gpu reset, which is can be some callbacks to let bare-metal and 
> SR-IOV implement with their favor style
>
Okay.

> 8.After reset, KMD need to aware if the VRAM lost happens or not, 
> bare-metal can implement some function to judge, while for SR-IOV I 
> prefer to read it from GIM side (for initial version we consider it’s 
> always VRAM lost, till GIM side change aligned)
>
Okay.

> 9.If VRAM lost not hit, continue, otherwise:
>
> a)Update adev->*vram_lost_counter*,
>
Okay.

> b)Iterate over all living ctx, and set all ctx as “*guilty*” since 
> VRAM lost actually ruins all VRAM contents
>
No, that shouldn't be done by comparing the counters. Iterating over all 
contexts is way to much overhead.

> c)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence 
> status to “*ECANCELDED*”
>
Yes and no, that should be done when we try to run the jobs and not 
during GPU reset.

> 10.Do GTT recovery and VRAM page tables/entries recovery (optional, do 
> we need it ???)
>
Yes, that is still needed. As Nicolai explained we can't be sure that 
VRAM is still 100% correct even when it isn't cleared.

> 11.Re-schedule all JOBs remains in mirror list to ring again and 
> restart scheduler (for VRAM lost case, no JOB will re-scheduled)
>
Okay.

> lFor cs_wait() IOCTL:
>
> After it found fence signaled, it should check with 
> *“dma_fence_get_status” *to see if there is error there,
>
> And return the error status of fence
>
Yes and no, dma_fence_get_status() is some specific handling for 
sync_file debugging (no idea why that made it into the common fence code).

It was replaced by putting the error code directly into the fence, so 
just reading that one after waiting should be ok.

Maybe we should fix dma_fence_get_status() to do the right thing for this?

> lFor cs_wait_fences() IOCTL:
>
> Similar with above approach
>
> lFor cs_submit() IOCTL:
>
> It need to check if current ctx been marked as “*guilty*” and return 
> “*ECANCELED*” if so
>
> lIntroduce a new IOCTL to let UMD query *vram_lost_counter*:
>
> This way, UMD can also block app from submitting, like @Nicolai 
> mentioned, we can cache one copy of *vram_lost_counter* when enumerate 
> physical device, and deny all
>
> gl-context from submitting if the counter queried bigger than that one 
> cached in physical device. (looks a little overkill to me, but easy to 
> implement )
>
> UMD can also return error to APP when creating gl-context if found 
> current queried*vram_lost_counter *bigger than that one cached in 
> physical device.
>
Okay. Already have a patch for this, please review that one if you 
haven't already done so.

Regards,
Christian.

> BTW: I realized that gl-context is a little different with kernel’s 
> context. Because for kernel. BO is not related with context but only 
> with FD, while in UMD, BO have a backend
>
> gl-context, so block submitting in UMD layer is also needed although 
> KMD will do its job as bottom line
>
> lBasically “vram_lost_counter” is exposure by kernel to let UMD take 
> the control of robust extension feature, it will be UMD’s call to 
> move, KMD only deny “guilty” context from submitting
>
> Need your feedback, thx
>
> We’d better make TDR feature landed ASAP
>
> BR Monk
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20171011/fb558beb/attachment-0001.html>