TDR and VRAM lost handling in KMD:

Wed Oct 11 14:04:48 UTC 2017

Yeah, I just thought of it, agree that shouldn't keep copy in entity, otherwise too complicated to handle 

BR Monk

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com] 
Sent: Wednesday, October 11, 2017 10:04 PM
To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; Zhou, David(ChunMing) <David1.Zhou at amd.com>; Haehnle, Nicolai <Nicolai.Haehnle at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>
Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>
Subject: Re: TDR and VRAM lost handling in KMD:

> I remember even the VM update job is with a kernel entity, (no context 
> is true), and if entity can keep a counter copy
That won't work. We want to keep the entities associated with VM updates and buffer moves alive, but their jobs canceled.

Regards,
Christian.

Am 11.10.2017 um 15:51 schrieb Liu, Monk:
>> Some jobs don't have a context (VM updates, clears, buffer moves).
> What? I remember even the VM update job is with a kernel entity, (no 
> context is true), and if entity can keep a counter copy That can solve 
> your concerns
>
>
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: Wednesday, October 11, 2017 9:39 PM
> To: Liu, Monk <Monk.Liu at amd.com>; Zhou, David(ChunMing) 
> <David1.Zhou at amd.com>; Haehnle, Nicolai <Nicolai.Haehnle at amd.com>; 
> Olsak, Marek <Marek.Olsak at amd.com>; Deucher, Alexander 
> <Alexander.Deucher at amd.com>
> Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; 
> amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; 
> Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; 
> Jiang, Jerry (SW) <Jerry.Jiang at amd.com>
> Subject: Re: TDR and VRAM lost handling in KMD:
>
> Some jobs don't have a context (VM updates, clears, buffer moves).
>
> I would still like to abort those when they where issued before a losing VRAM content, but keep the entity usable.
>
> So I think we should just keep a copy of the VRAM lost counter in the job. That also removes us from the burden of figuring out the context during job run.
>
> Regards,
> Christian.
>
> Am 11.10.2017 um 15:35 schrieb Liu, Monk:
>> I think just compare the copy from context/entity with current 
>> counter is enough, don't see how it's better to keep another copy in 
>> JOB
>>
>>
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: Wednesday, October 11, 2017 6:40 PM
>> To: Zhou, David(ChunMing) <David1.Zhou at amd.com>; Liu, Monk 
>> <Monk.Liu at amd.com>; Haehnle, Nicolai <Nicolai.Haehnle at amd.com>; 
>> Olsak, Marek <Marek.Olsak at amd.com>; Deucher, Alexander 
>> <Alexander.Deucher at amd.com>
>> Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; 
>> amd-gfx at lists.freedesktop.org; Filipas, Mario 
>> <Mario.Filipas at amd.com>; Ding, Pixel <Pixel.Ding at amd.com>; Li, 
>> Bingley <Bingley.Li at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>
>> Subject: Re: TDR and VRAM lost handling in KMD:
>>
>> I've already posted a patch for this on the mailing list.
>>
>> Basically we just copy the vram lost counter into the job and when we try to run the job we can mark it as canceled.
>>
>> Regards,
>> Christian.
>>
>> Am 11.10.2017 um 12:14 schrieb Chunming Zhou:
>>> Your summary lacks the below issue:
>>>
>>> How about the job already pushed in scheduler queue when vram is lost?
>>>
>>>
>>> Regards,
>>> David Zhou
>>> On 2017年10月11日 17:41, Liu, Monk wrote:
>>>> Okay, let me summary our whole idea together and see if it works:
>>>>
>>>> 1, For cs_submit, always check vram-lost_counter first and reject 
>>>> the submit (return -ECANCLED to UMD) if ctx->vram_lost_counter !=
>>>> adev->vram_lost_counter. That way the vram lost issue can be 
>>>> adev->handled
>>>>
>>>> 2, for cs_submit we still need to check if the incoming context is 
>>>> "AMDGPU_CTX_GUILTY_RESET" or not even if we found
>>>> ctx->vram_lost_counter == adev->vram_lost_counter, and we can 
>>>> ctx->reject
>>>> the submit
>>>> If it is "AMDGPU_CTX_GUILTY_RESET", correct ?
>>>>
>>>> 3, in gpu_reset() routine, we only mark the hang job's entity as 
>>>> guilty (so we need to add new member in entity structure), and not 
>>>> kick it out in gpu_reset() stage, but we need to set the context 
>>>> behind this entity as " AMDGPU_CTX_GUILTY_RESET"
>>>>      And if reset introduces VRAM LOST, we just update
>>>> adev->vram_lost_counter, but *don't* change all entity to guilty, 
>>>> adev->so
>>>> still only the hang job's entity is "guilty"
>>>>      After some entity marked as "guilty", we find a way to set the 
>>>> context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K 
>>>> interface, we need let UMD can know that this context is wrong.
>>>>
>>>> 4, in gpu scheduler's run_job() routine, since it only reads 
>>>> entity, so we skip job scheduling once found the entity is "guilty"
>>>>
>>>>
>>>> Does above sounds good ?
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Haehnle, Nicolai
>>>> Sent: Wednesday, October 11, 2017 5:26 PM
>>>> To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian 
>>>> <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; 
>>>> Deucher, Alexander <Alexander.Deucher at amd.com>
>>>> Cc: amd-gfx at lists.freedesktop.org; Ding, Pixel 
>>>> <Pixel.Ding at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>; Li, 
>>>> Bingley <Bingley.Li at amd.com>; Ramirez, Alejandro 
>>>> <Alejandro.Ramirez at amd.com>; Filipas, Mario <Mario.Filipas at amd.com>
>>>> Subject: Re: TDR and VRAM lost handling in KMD:
>>>>
>>>> On 11.10.2017 11:18, Liu, Monk wrote:
>>>>> Let's talk it simple, When vram lost hit,  what's the action for 
>>>>> amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts 
>>>>> (not the one trigger gpu hang) after vram lost ? do you mean we 
>>>>> return -ENODEV to UMD ?
>>>> It should successfully return AMDGPU_CTX_INNOCENT_RESET.
>>>>
>>>>
>>>>> In cs_submit, with vram lost hit, if we don't mark all contexts as 
>>>>> "guilty", how we block its from submitting ? can you show some 
>>>>> implement way
>>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>>>         return -ECANCELED;
>>>>
>>>> (where ctx->vram_lost_counter is initialized at context creation 
>>>> time and never changed afterwards)
>>>>
>>>>
>>>>> BTW: the "guilty" here is a new member I want to add to context, 
>>>>> it is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, 
>>>>> Looks I need to unify them and only one place to mark guilty or 
>>>>> not
>>>> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made 
>>>> consistent with the rest.
>>>>
>>>> Cheers,
>>>> Nicolai
>>>>
>>>>
>>>>> BR Monk
>>>>>
>>>>> -----Original Message-----
>>>>> From: Haehnle, Nicolai
>>>>> Sent: Wednesday, October 11, 2017 5:00 PM
>>>>> To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian 
>>>>> <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; 
>>>>> Deucher, Alexander <Alexander.Deucher at amd.com>
>>>>> Cc: amd-gfx at lists.freedesktop.org; Ding, Pixel 
>>>>> <Pixel.Ding at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>; Li, 
>>>>> Bingley <Bingley.Li at amd.com>; Ramirez, Alejandro 
>>>>> <Alejandro.Ramirez at amd.com>; Filipas, Mario 
>>>>> <Mario.Filipas at amd.com>
>>>>> Subject: Re: TDR and VRAM lost handling in KMD:
>>>>>
>>>>> On 11.10.2017 10:48, Liu, Monk wrote:
>>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. 
>>>>>> OpenGL), so it's reasonable to use it. However, it /does not/ 
>>>>>> make sense to mark idle contexts as "guilty" just because VRAM is 
>>>>>> lost. VRAM lost is a perfect example where the driver should 
>>>>>> report context lost to applications with the "innocent" flag for 
>>>>>> contexts that were idle at the time of reset. The only context(s) 
>>>>>> that should be reported as "guilty"
>>>>>> (or perhaps "unknown" in some cases) are the ones that were 
>>>>>> executing at the time of reset.
>>>>>>
>>>>>> ML: KMD mark all contexts as guilty is because that way we can 
>>>>>> unify our IOCTL behavior: e.g. for IOCTL only block 
>>>>>> “guilty”context , no need to worry about vram-lost-counter 
>>>>>> anymore, that’s a implementation style. I don’t think it is 
>>>>>> related with UMD layer,
>>>>>>
>>>>>> For UMD the gl-context isn’t aware of by KMD, so UMD can 
>>>>>> implement it own “guilty” gl-context if you want.
>>>>> Well, to some extent this is just semantics, but it helps to keep 
>>>>> the terminology consistent.
>>>>>
>>>>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi 
>>>>> in
>>>>> mind: this returns one of
>>>>> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT,
>>>>> and it must return "innocent" for contexts that are only lost due 
>>>>> to VRAM lost without being otherwise involved in the timeout that 
>>>>> lead to the reset.
>>>>>
>>>>> The point is that in the places where you used "guilty" it would 
>>>>> be better to use "context lost", and then further differentiate 
>>>>> between guilty/innocent context lost based on the details of what happened.
>>>>>
>>>>>
>>>>>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you 
>>>>>> illustrate what rule KMD should obey to check in KMS IOCTL like 
>>>>>> cs_sumbit ?? let’s see which way better
>>>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>>>>          return -ECANCELED;
>>>>>
>>>>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE.
>>>>>
>>>>> Yes, it's one additional check in cs_submit. If you're worried 
>>>>> about that (and Christian's concerns about possible issues with 
>>>>> walking over all contexts are addressed), I suppose you could just 
>>>>> store a per-context
>>>>>
>>>>>        unsigned context_reset_status;
>>>>>
>>>>> instead of a `bool guilty`. Its value would start out as 0
>>>>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during 
>>>>> reset.
>>>>>
>>>>> Cheers,
>>>>> Nicolai
>>>>>
>>>>>
>>>>>> *From:*Haehnle, Nicolai
>>>>>> *Sent:* Wednesday, October 11, 2017 4:41 PM
>>>>>> *To:* Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian 
>>>>>> <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; 
>>>>>> Deucher, Alexander <Alexander.Deucher at amd.com>
>>>>>> *Cc:* amd-gfx at lists.freedesktop.org; Ding, Pixel 
>>>>>> <Pixel.Ding at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>; 
>>>>>> Li, Bingley <Bingley.Li at amd.com>; Ramirez, Alejandro 
>>>>>> <Alejandro.Ramirez at amd.com>; Filipas, Mario 
>>>>>> <Mario.Filipas at amd.com>
>>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>>>>
>>>>>>      From a Mesa perspective, this almost all sounds reasonable to me.
>>>>>>
>>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. 
>>>>>> OpenGL), so it's reasonable to use it. However, it /does not/ 
>>>>>> make sense to mark idle contexts as "guilty" just because VRAM is 
>>>>>> lost. VRAM lost is a perfect example where the driver should 
>>>>>> report context lost to applications with the "innocent" flag for 
>>>>>> contexts that were idle at the time of reset. The only context(s) 
>>>>>> that should be reported as "guilty"
>>>>>> (or perhaps "unknown" in some cases) are the ones that were 
>>>>>> executing at the time of reset.
>>>>>>
>>>>>> On whether the whole context is marked as guilty from a user 
>>>>>> space perspective, it would simply be nice for user space to get 
>>>>>> consistent answers. It would be a bit odd if we could e.g. 
>>>>>> succeed in submitting an SDMA job after a GFX job was rejected. 
>>>>>> This would point in favor of marking the entire context as guilty 
>>>>>> (although that could happen lazily instead of at reset time). On 
>>>>>> the other hand, if that's too big a burden for the kernel 
>>>>>> implementation I'm sure we can live without it.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Nicolai
>>>>>>
>>>>>> -----------------------------------------------------------------
>>>>>> -
>>>>>> -
>>>>>> ---
>>>>>> --
>>>>>>
>>>>>> *From:*Liu, Monk
>>>>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM
>>>>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, 
>>>>>> Alexander
>>>>>> *Cc:* amd-gfx at lists.freedesktop.org 
>>>>>> <mailto:amd-gfx at lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry 
>>>>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
>>>>>> *Subject:* RE: TDR and VRAM lost handling in KMD:
>>>>>>
>>>>>> 1.Set its fence error status to “*ETIME*”,
>>>>>>
>>>>>> No, as I already explained ETIME is for synchronous operation.
>>>>>>
>>>>>> In other words when we return ETIME from the wait IOCTL it would 
>>>>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>>>>
>>>>>> Please use ECANCELED as well or some other error code when we 
>>>>>> find that we need to distinct the timedout job from the canceled 
>>>>>> ones (probably a good idea, but I'm not sure).
>>>>>>
>>>>>> [ML] I’m okay if you insist not to use ETIME
>>>>>>
>>>>>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>>>>>>
>>>>>> Not sure. Do we want to set the whole context as guilty or just 
>>>>>> the entity?
>>>>>>
>>>>>> Setting the whole contexts as guilty sounds racy to me.
>>>>>>
>>>>>> BTW: We should use a different name than "guilty", maybe just 
>>>>>> "bool canceled;" ?
>>>>>>
>>>>>> [ML] I think context is better than entity, because for example 
>>>>>> if you only block entity_0 of context and allow entity_N run, 
>>>>>> that means the dependency between entities are broken (e.g. page 
>>>>>> table updates in
>>>>>>
>>>>>> Sdma entity pass but gfx submit in GFX entity blocked, not make 
>>>>>> sense to me)
>>>>>>
>>>>>> We’d better either block the whole context or let not…
>>>>>>
>>>>>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set 
>>>>>> all their fence status to “*ECANCELED*”
>>>>>>
>>>>>> Setting ECANCELED should be ok. But I think we should do this 
>>>>>> when we try to run the jobs and not during GPU reset.
>>>>>>
>>>>>> [ML] without deep thought and expritment, I’m not sure the 
>>>>>> difference between them, but kick it out in gpu_reset routine is 
>>>>>> more efficient,
>>>>>>
>>>>>> Otherwise you need to check context/entity guilty flag in run_job 
>>>>>> routine …and you need to it for every context/entity, I don’t see 
>>>>>> why
>>>>>>
>>>>>> We don’t just kickout all of them in gpu_reset stage ….
>>>>>>
>>>>>> a)Iterate over all living ctx, and set all ctx as “*guilty*” 
>>>>>> since VRAM lost actually ruins all VRAM contents
>>>>>>
>>>>>> No, that shouldn't be done by comparing the counters. Iterating 
>>>>>> over all contexts is way to much overhead.
>>>>>>
>>>>>> [ML] because I want to make KMS IOCTL rules clean, like they 
>>>>>> don’t need to differentiate VRAM lost or not, they only 
>>>>>> interested in if the context is guilty or not, and block
>>>>>>
>>>>>> Submit for guilty ones.
>>>>>>
>>>>>> *Can you give more details of your idea? And better the detail 
>>>>>> implement in cs_submit, I want to see how you want to block 
>>>>>> submit without checking context guilty flag*
>>>>>>
>>>>>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their 
>>>>>> fence status to “*ECANCELDED*”
>>>>>>
>>>>>> Yes and no, that should be done when we try to run the jobs and 
>>>>>> not during GPU reset.
>>>>>>
>>>>>> [ML] again, kicking out them in gpu reset routine is high 
>>>>>> efficient, otherwise you need check on every job in run_job()
>>>>>>
>>>>>> Besides, can you illustrate the detail implementation ?
>>>>>>
>>>>>> Yes and no, dma_fence_get_status() is some specific handling for 
>>>>>> sync_file debugging (no idea why that made it into the common 
>>>>>> fence code).
>>>>>>
>>>>>> It was replaced by putting the error code directly into the 
>>>>>> fence, so just reading that one after waiting should be ok.
>>>>>>
>>>>>> Maybe we should fix dma_fence_get_status() to do the right thing 
>>>>>> for this?
>>>>>>
>>>>>> [ML] yeah, that’s too confusing, the name sound really the one I 
>>>>>> want to use, we should change it…
>>>>>>
>>>>>> *But look into the implement, I don**’t see why we cannot use it ?
>>>>>> it also finally return the fence->error *
>>>>>>
>>>>>> *From:*Koenig, Christian
>>>>>> *Sent:* Wednesday, October 11, 2017 3:21 PM
>>>>>> *To:* Liu, Monk <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>; 
>>>>>> Haehnle, Nicolai <Nicolai.Haehnle at amd.com 
>>>>>> <mailto:Nicolai.Haehnle at amd.com>>;
>>>>>> Olsak, Marek <Marek.Olsak at amd.com <mailto:Marek.Olsak at amd.com>>; 
>>>>>> Deucher, Alexander <Alexander.Deucher at amd.com 
>>>>>> <mailto:Alexander.Deucher at amd.com>>
>>>>>> *Cc:* amd-gfx at lists.freedesktop.org 
>>>>>> <mailto:amd-gfx at lists.freedesktop.org>; Ding, Pixel 
>>>>>> <Pixel.Ding at amd.com <mailto:Pixel.Ding at amd.com>>; Jiang, Jerry
>>>>>> (SW) <Jerry.Jiang at amd.com <mailto:Jerry.Jiang at amd.com>>; Li, 
>>>>>> Bingley <Bingley.Li at amd.com <mailto:Bingley.Li at amd.com>>; 
>>>>>> Ramirez, Alejandro <Alejandro.Ramirez at amd.com 
>>>>>> <mailto:Alejandro.Ramirez at amd.com>>;
>>>>>> Filipas, Mario <Mario.Filipas at amd.com 
>>>>>> <mailto:Mario.Filipas at amd.com>>
>>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>>>>
>>>>>> See inline:
>>>>>>
>>>>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk:
>>>>>>
>>>>>>         Hi Christian & Nicolai,
>>>>>>
>>>>>>         We need to achieve some agreements on what should 
>>>>>> MESA/UMD do and
>>>>>>         what should KMD do, *please give your comments with “okay”
>>>>>> or “No”
>>>>>>         and your idea on below items,*
>>>>>>
>>>>>>         ?When a job timed out (set from lockup_timeout kernel 
>>>>>> parameter),
>>>>>>         What KMD should do in TDR routine :
>>>>>>
>>>>>>         1.Update adev->*gpu_reset_counter*, and stop scheduler first,
>>>>>>         (*gpu_reset_counter* is used to force vm flush after GPU 
>>>>>> reset, out
>>>>>>         of this thread’s scope so no more discussion on it)
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         2.Set its fence error status to “*ETIME*”,
>>>>>>
>>>>>> No, as I already explained ETIME is for synchronous operation.
>>>>>>
>>>>>> In other words when we return ETIME from the wait IOCTL it would 
>>>>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>>>>
>>>>>> Please use ECANCELED as well or some other error code when we 
>>>>>> find that we need to distinct the timedout job from the canceled 
>>>>>> ones (probably a good idea, but I'm not sure).
>>>>>>
>>>>>>         3.Find the entity/ctx behind this job, and set this ctx 
>>>>>> as “*guilty*”
>>>>>>
>>>>>> Not sure. Do we want to set the whole context as guilty or just 
>>>>>> the entity?
>>>>>>
>>>>>> Setting the whole contexts as guilty sounds racy to me.
>>>>>>
>>>>>> BTW: We should use a different name than "guilty", maybe just 
>>>>>> "bool canceled;" ?
>>>>>>
>>>>>>         4.Kick out this job from scheduler’s mirror list, so this 
>>>>>> job won’t
>>>>>>         get re-scheduled to ring anymore.
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, 
>>>>>> and set all
>>>>>>         their fence status to “*ECANCELED*”
>>>>>>
>>>>>> Setting ECANCELED should be ok. But I think we should do this 
>>>>>> when we try to run the jobs and not during GPU reset.
>>>>>>
>>>>>>         6.Force signal all fences that get kicked out by above two
>>>>>>         steps,*otherwise UMD will block forever if waiting on 
>>>>>> those
>>>>>> fences*
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         7.Do gpu reset, which is can be some callbacks to let 
>>>>>> bare-metal and
>>>>>>         SR-IOV implement with their favor style
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         8.After reset, KMD need to aware if the VRAM lost happens 
>>>>>> or not,
>>>>>>         bare-metal can implement some function to judge, while 
>>>>>> for SR-IOV I
>>>>>>         prefer to read it from GIM side (for initial version we consider
>>>>>>         it’s always VRAM lost, till GIM side change aligned)
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         9.If VRAM lost not hit, continue, otherwise:
>>>>>>
>>>>>>         a)Update adev->*vram_lost_counter*,
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         b)Iterate over all living ctx, and set all ctx as “*guilty*”
>>>>>> since
>>>>>>         VRAM lost actually ruins all VRAM contents
>>>>>>
>>>>>> No, that shouldn't be done by comparing the counters. Iterating 
>>>>>> over all contexts is way to much overhead.
>>>>>>
>>>>>>         c)Kick out all jobs in all ctx’s KFIFO queue, and set all their
>>>>>>         fence status to “*ECANCELDED*”
>>>>>>
>>>>>> Yes and no, that should be done when we try to run the jobs and 
>>>>>> not during GPU reset.
>>>>>>
>>>>>>         10.Do GTT recovery and VRAM page tables/entries recovery 
>>>>>> (optional,
>>>>>>         do we need it ???)
>>>>>>
>>>>>> Yes, that is still needed. As Nicolai explained we can't be sure 
>>>>>> that VRAM is still 100% correct even when it isn't cleared.
>>>>>>
>>>>>>         11.Re-schedule all JOBs remains in mirror list to ring again and
>>>>>>         restart scheduler (for VRAM lost case, no JOB will
>>>>>> re-scheduled)
>>>>>>
>>>>>> Okay.
>>>>>>
>>>>>>         ?For cs_wait() IOCTL:
>>>>>>
>>>>>>         After it found fence signaled, it should check with
>>>>>>         *“dma_fence_get_status” *to see if there is error there,
>>>>>>
>>>>>>         And return the error status of fence
>>>>>>
>>>>>> Yes and no, dma_fence_get_status() is some specific handling for 
>>>>>> sync_file debugging (no idea why that made it into the common 
>>>>>> fence code).
>>>>>>
>>>>>> It was replaced by putting the error code directly into the 
>>>>>> fence, so just reading that one after waiting should be ok.
>>>>>>
>>>>>> Maybe we should fix dma_fence_get_status() to do the right thing 
>>>>>> for this?
>>>>>>
>>>>>>         ?For cs_wait_fences() IOCTL:
>>>>>>
>>>>>>         Similar with above approach
>>>>>>
>>>>>>         ?For cs_submit() IOCTL:
>>>>>>
>>>>>>         It need to check if current ctx been marked as “*guilty*”
>>>>>> and return
>>>>>>         “*ECANCELED*” if so
>>>>>>
>>>>>>         ?Introduce a new IOCTL to let UMD query *vram_lost_counter*:
>>>>>>
>>>>>>         This way, UMD can also block app from submitting, like @Nicolai
>>>>>>         mentioned, we can cache one copy of *vram_lost_counter* when
>>>>>>         enumerate physical device, and deny all
>>>>>>
>>>>>>         gl-context from submitting if the counter queried bigger 
>>>>>> than that
>>>>>>         one cached in physical device. (looks a little overkill 
>>>>>> to me, but
>>>>>>         easy to implement )
>>>>>>
>>>>>>         UMD can also return error to APP when creating gl-context 
>>>>>> if found
>>>>>>         current queried*vram_lost_counter *bigger than that one 
>>>>>> cached in
>>>>>>         physical device.
>>>>>>
>>>>>> Okay. Already have a patch for this, please review that one if 
>>>>>> you haven't already done so.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>         BTW: I realized that gl-context is a little different 
>>>>>> with kernel’s
>>>>>>         context. Because for kernel. BO is not related with 
>>>>>> context but only
>>>>>>         with FD, while in UMD, BO have a backend
>>>>>>
>>>>>>         gl-context, so block submitting in UMD layer is also 
>>>>>> needed although
>>>>>>         KMD will do its job as bottom line
>>>>>>
>>>>>>         ?Basically “vram_lost_counter” is exposure by kernel to 
>>>>>> let UMD take
>>>>>>         the control of robust extension feature, it will be UMD’s 
>>>>>> call to
>>>>>>         move, KMD only deny “guilty” context from submitting
>>>>>>
>>>>>>         Need your feedback, thx
>>>>>>
>>>>>>         We’d better make TDR feature landed ASAP
>>>>>>
>>>>>>         BR Monk
>>>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx