TDR and VRAM lost handling in KMD:

Wed Oct 11 13:27:26 UTC 2017

According to the initial summary, this situation is already considered:
When vram lost hit, all context marked as guilty, and all jobs in guilty context's KFIFO queue will be kicked out 

Now if we move the kick out from gpu_reset to run_job, then I think your question can be answered by:
in run_job(), before each job scheduling, check current vram_lost_counter, compare it with the copy cached during entity init (or context init), skip the job if not match 

BR Monk

-----Original Message-----
From: Zhou, David(ChunMing) 
Sent: Wednesday, October 11, 2017 6:15 PM
To: Liu, Monk <Monk.Liu at amd.com>; Haehnle, Nicolai <Nicolai.Haehnle at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>
Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>
Subject: Re: TDR and VRAM lost handling in KMD:

Your summary lacks the below issue:

How about the job already pushed in scheduler queue when vram is lost?

Regards,
David Zhou
On 2017年10月11日 17:41, Liu, Monk wrote:
> Okay, let me summary our whole idea together and see if it works:
>
> 1, For cs_submit, always check vram-lost_counter first and reject the 
> submit (return -ECANCLED to UMD) if ctx->vram_lost_counter != 
> adev->vram_lost_counter. That way the vram lost issue can be handled
>
> 2, for cs_submit we still need to check if the incoming context is 
> "AMDGPU_CTX_GUILTY_RESET" or not even if we found ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject the submit If it is "AMDGPU_CTX_GUILTY_RESET", correct ?
>
> 3, in gpu_reset() routine, we only mark the hang job's entity as guilty (so we need to add new member in entity structure), and not kick it out in gpu_reset() stage, but we need to set the context behind this entity as " AMDGPU_CTX_GUILTY_RESET"
>    And if reset introduces VRAM LOST, we just update adev->vram_lost_counter, but *don't* change all entity to guilty, so still only the hang job's entity is "guilty"
>    After some entity marked as "guilty", we find a way to set the context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K interface, we need let UMD can know that this context is wrong.
>
> 4, in gpu scheduler's run_job() routine, since it only reads entity, so we skip job scheduling once found the entity is "guilty"
>
>
> Does above sounds good ?
>
>
>
> -----Original Message-----
> From: Haehnle, Nicolai
> Sent: Wednesday, October 11, 2017 5:26 PM
> To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian 
> <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; 
> Deucher, Alexander <Alexander.Deucher at amd.com>
> Cc: amd-gfx at lists.freedesktop.org; Ding, Pixel <Pixel.Ding at amd.com>; 
> Jiang, Jerry (SW) <Jerry.Jiang at amd.com>; Li, Bingley 
> <Bingley.Li at amd.com>; Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; 
> Filipas, Mario <Mario.Filipas at amd.com>
> Subject: Re: TDR and VRAM lost handling in KMD:
>
> On 11.10.2017 11:18, Liu, Monk wrote:
>> Let's talk it simple, When vram lost hit,  what's the action for amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not the one trigger gpu hang) after vram lost ? do you mean we return -ENODEV to UMD ?
> It should successfully return AMDGPU_CTX_INNOCENT_RESET.
>
>
>> In cs_submit, with vram lost hit, if we don't mark all contexts as 
>> "guilty", how we block its from submitting ? can you show some 
>> implement way
> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>       return -ECANCELED;
>
> (where ctx->vram_lost_counter is initialized at context creation time 
> and never changed afterwards)
>
>
>> BTW: the "guilty" here is a new member I want to add to context, it 
>> is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, Looks I 
>> need to unify them and only one place to mark guilty or not
> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made 
> consistent with the rest.
>
> Cheers,
> Nicolai
>
>
>>
>> BR Monk
>>
>> -----Original Message-----
>> From: Haehnle, Nicolai
>> Sent: Wednesday, October 11, 2017 5:00 PM
>> To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian 
>> <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; 
>> Deucher, Alexander <Alexander.Deucher at amd.com>
>> Cc: amd-gfx at lists.freedesktop.org; Ding, Pixel <Pixel.Ding at amd.com>; 
>> Jiang, Jerry (SW) <Jerry.Jiang at amd.com>; Li, Bingley 
>> <Bingley.Li at amd.com>; Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; 
>> Filipas, Mario <Mario.Filipas at amd.com>
>> Subject: Re: TDR and VRAM lost handling in KMD:
>>
>> On 11.10.2017 10:48, Liu, Monk wrote:
>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), 
>>> so it's reasonable to use it. However, it /does not/ make sense to 
>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM lost 
>>> is a perfect example where the driver should report context lost to 
>>> applications with the "innocent" flag for contexts that were idle at 
>>> the time of reset. The only context(s) that should be reported as "guilty"
>>> (or perhaps "unknown" in some cases) are the ones that were 
>>> executing at the time of reset.
>>>
>>> ML: KMD mark all contexts as guilty is because that way we can unify 
>>> our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no 
>>> need to worry about vram-lost-counter anymore, that’s a 
>>> implementation style. I don’t think it is related with UMD layer,
>>>
>>> For UMD the gl-context isn’t aware of by KMD, so UMD can implement 
>>> it own “guilty” gl-context if you want.
>> Well, to some extent this is just semantics, but it helps to keep the terminology consistent.
>>
>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in
>> mind: this returns one of 
>> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT,
>> and it must return "innocent" for contexts that are only lost due to VRAM lost without being otherwise involved in the timeout that lead to the reset.
>>
>> The point is that in the places where you used "guilty" it would be better to use "context lost", and then further differentiate between guilty/innocent context lost based on the details of what happened.
>>
>>
>>> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you 
>>> illustrate what rule KMD should obey to check in KMS IOCTL like 
>>> cs_sumbit ?? let’s see which way better
>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter))
>>        return -ECANCELED;
>>
>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE.
>>
>> Yes, it's one additional check in cs_submit. If you're worried about 
>> that (and Christian's concerns about possible issues with walking 
>> over all contexts are addressed), I suppose you could just store a 
>> per-context
>>
>>      unsigned context_reset_status;
>>
>> instead of a `bool guilty`. Its value would start out as 0
>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during reset.
>>
>> Cheers,
>> Nicolai
>>
>>
>>> *From:*Haehnle, Nicolai
>>> *Sent:* Wednesday, October 11, 2017 4:41 PM
>>> *To:* Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian 
>>> <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; 
>>> Deucher, Alexander <Alexander.Deucher at amd.com>
>>> *Cc:* amd-gfx at lists.freedesktop.org; Ding, Pixel 
>>> <Pixel.Ding at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>; Li, 
>>> Bingley <Bingley.Li at amd.com>; Ramirez, Alejandro 
>>> <Alejandro.Ramirez at amd.com>; Filipas, Mario <Mario.Filipas at amd.com>
>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>
>>>    From a Mesa perspective, this almost all sounds reasonable to me.
>>>
>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), 
>>> so it's reasonable to use it. However, it /does not/ make sense to 
>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM lost 
>>> is a perfect example where the driver should report context lost to 
>>> applications with the "innocent" flag for contexts that were idle at 
>>> the time of reset. The only context(s) that should be reported as "guilty"
>>> (or perhaps "unknown" in some cases) are the ones that were 
>>> executing at the time of reset.
>>>
>>> On whether the whole context is marked as guilty from a user space 
>>> perspective, it would simply be nice for user space to get 
>>> consistent answers. It would be a bit odd if we could e.g. succeed 
>>> in submitting an SDMA job after a GFX job was rejected. This would 
>>> point in favor of marking the entire context as guilty (although 
>>> that could happen lazily instead of at reset time). On the other 
>>> hand, if that's too big a burden for the kernel implementation I'm sure we can live without it.
>>>
>>> Cheers,
>>>
>>> Nicolai
>>>
>>> --------------------------------------------------------------------
>>> --
>>> --
>>>
>>> *From:*Liu, Monk
>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM
>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, 
>>> Alexander
>>> *Cc:* amd-gfx at lists.freedesktop.org
>>> <mailto:amd-gfx at lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry 
>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
>>> *Subject:* RE: TDR and VRAM lost handling in KMD:
>>>
>>> 1.Set its fence error status to “*ETIME*”,
>>>
>>> No, as I already explained ETIME is for synchronous operation.
>>>
>>> In other words when we return ETIME from the wait IOCTL it would 
>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>
>>> Please use ECANCELED as well or some other error code when we find 
>>> that we need to distinct the timedout job from the canceled ones 
>>> (probably a good idea, but I'm not sure).
>>>
>>> [ML] I’m okay if you insist not to use ETIME
>>>
>>> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>>>
>>> Not sure. Do we want to set the whole context as guilty or just the entity?
>>>
>>> Setting the whole contexts as guilty sounds racy to me.
>>>
>>> BTW: We should use a different name than "guilty", maybe just "bool 
>>> canceled;" ?
>>>
>>> [ML] I think context is better than entity, because for example if 
>>> you only block entity_0 of context and allow entity_N run, that 
>>> means the dependency between entities are broken (e.g. page table 
>>> updates in
>>>
>>> Sdma entity pass but gfx submit in GFX entity blocked, not make 
>>> sense to me)
>>>
>>> We’d better either block the whole context or let not…
>>>
>>> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all 
>>> their fence status to “*ECANCELED*”
>>>
>>> Setting ECANCELED should be ok. But I think we should do this when 
>>> we try to run the jobs and not during GPU reset.
>>>
>>> [ML] without deep thought and expritment, I’m not sure the 
>>> difference between them, but kick it out in gpu_reset routine is 
>>> more efficient,
>>>
>>> Otherwise you need to check context/entity guilty flag in run_job 
>>> routine …and you need to it for every context/entity, I don’t see 
>>> why
>>>
>>> We don’t just kickout all of them in gpu_reset stage ….
>>>
>>> a)Iterate over all living ctx, and set all ctx as “*guilty*” since 
>>> VRAM lost actually ruins all VRAM contents
>>>
>>> No, that shouldn't be done by comparing the counters. Iterating over 
>>> all contexts is way to much overhead.
>>>
>>> [ML] because I want to make KMS IOCTL rules clean, like they don’t 
>>> need to differentiate VRAM lost or not, they only interested in if 
>>> the context is guilty or not, and block
>>>
>>> Submit for guilty ones.
>>>
>>> *Can you give more details of your idea? And better the detail 
>>> implement in cs_submit, I want to see how you want to block submit 
>>> without checking context guilty flag*
>>>
>>> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their 
>>> fence status to “*ECANCELDED*”
>>>
>>> Yes and no, that should be done when we try to run the jobs and not 
>>> during GPU reset.
>>>
>>> [ML] again, kicking out them in gpu reset routine is high efficient, 
>>> otherwise you need check on every job in run_job()
>>>
>>> Besides, can you illustrate the detail implementation ?
>>>
>>> Yes and no, dma_fence_get_status() is some specific handling for 
>>> sync_file debugging (no idea why that made it into the common fence code).
>>>
>>> It was replaced by putting the error code directly into the fence, 
>>> so just reading that one after waiting should be ok.
>>>
>>> Maybe we should fix dma_fence_get_status() to do the right thing for this?
>>>
>>> [ML] yeah, that’s too confusing, the name sound really the one I 
>>> want to use, we should change it…
>>>
>>> *But look into the implement, I don**’t see why we cannot use it ? 
>>> it also finally return the fence->error *
>>>
>>> *From:*Koenig, Christian
>>> *Sent:* Wednesday, October 11, 2017 3:21 PM
>>> *To:* Liu, Monk <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>; 
>>> Haehnle, Nicolai <Nicolai.Haehnle at amd.com 
>>> <mailto:Nicolai.Haehnle at amd.com>>;
>>> Olsak, Marek <Marek.Olsak at amd.com <mailto:Marek.Olsak at amd.com>>; 
>>> Deucher, Alexander <Alexander.Deucher at amd.com 
>>> <mailto:Alexander.Deucher at amd.com>>
>>> *Cc:* amd-gfx at lists.freedesktop.org
>>> <mailto:amd-gfx at lists.freedesktop.org>; Ding, Pixel 
>>> <Pixel.Ding at amd.com <mailto:Pixel.Ding at amd.com>>; Jiang, Jerry (SW) 
>>> <Jerry.Jiang at amd.com <mailto:Jerry.Jiang at amd.com>>; Li, Bingley 
>>> <Bingley.Li at amd.com <mailto:Bingley.Li at amd.com>>; Ramirez, Alejandro 
>>> <Alejandro.Ramirez at amd.com <mailto:Alejandro.Ramirez at amd.com>>;
>>> Filipas, Mario <Mario.Filipas at amd.com 
>>> <mailto:Mario.Filipas at amd.com>>
>>> *Subject:* Re: TDR and VRAM lost handling in KMD:
>>>
>>> See inline:
>>>
>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk:
>>>
>>>       Hi Christian & Nicolai,
>>>
>>>       We need to achieve some agreements on what should MESA/UMD do and
>>>       what should KMD do, *please give your comments with “okay” or “No”
>>>       and your idea on below items,*
>>>
>>>       ?When a job timed out (set from lockup_timeout kernel parameter),
>>>       What KMD should do in TDR routine :
>>>
>>>       1.Update adev->*gpu_reset_counter*, and stop scheduler first,
>>>       (*gpu_reset_counter* is used to force vm flush after GPU reset, out
>>>       of this thread’s scope so no more discussion on it)
>>>
>>> Okay.
>>>
>>>       2.Set its fence error status to “*ETIME*”,
>>>
>>> No, as I already explained ETIME is for synchronous operation.
>>>
>>> In other words when we return ETIME from the wait IOCTL it would 
>>> mean that the waiting has somehow timed out, but not the job we waited for.
>>>
>>> Please use ECANCELED as well or some other error code when we find 
>>> that we need to distinct the timedout job from the canceled ones 
>>> (probably a good idea, but I'm not sure).
>>>
>>>       3.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>>>
>>> Not sure. Do we want to set the whole context as guilty or just the entity?
>>>
>>> Setting the whole contexts as guilty sounds racy to me.
>>>
>>> BTW: We should use a different name than "guilty", maybe just "bool 
>>> canceled;" ?
>>>
>>>       4.Kick out this job from scheduler’s mirror list, so this job won’t
>>>       get re-scheduled to ring anymore.
>>>
>>> Okay.
>>>
>>>       5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all
>>>       their fence status to “*ECANCELED*”
>>>
>>> Setting ECANCELED should be ok. But I think we should do this when 
>>> we try to run the jobs and not during GPU reset.
>>>
>>>       6.Force signal all fences that get kicked out by above two
>>>       steps,*otherwise UMD will block forever if waiting on those
>>> fences*
>>>
>>> Okay.
>>>
>>>       7.Do gpu reset, which is can be some callbacks to let bare-metal and
>>>       SR-IOV implement with their favor style
>>>
>>> Okay.
>>>
>>>       8.After reset, KMD need to aware if the VRAM lost happens or not,
>>>       bare-metal can implement some function to judge, while for SR-IOV I
>>>       prefer to read it from GIM side (for initial version we consider
>>>       it’s always VRAM lost, till GIM side change aligned)
>>>
>>> Okay.
>>>
>>>       9.If VRAM lost not hit, continue, otherwise:
>>>
>>>       a)Update adev->*vram_lost_counter*,
>>>
>>> Okay.
>>>
>>>       b)Iterate over all living ctx, and set all ctx as “*guilty*” since
>>>       VRAM lost actually ruins all VRAM contents
>>>
>>> No, that shouldn't be done by comparing the counters. Iterating over 
>>> all contexts is way to much overhead.
>>>
>>>       c)Kick out all jobs in all ctx’s KFIFO queue, and set all their
>>>       fence status to “*ECANCELDED*”
>>>
>>> Yes and no, that should be done when we try to run the jobs and not 
>>> during GPU reset.
>>>
>>>       10.Do GTT recovery and VRAM page tables/entries recovery (optional,
>>>       do we need it ???)
>>>
>>> Yes, that is still needed. As Nicolai explained we can't be sure 
>>> that VRAM is still 100% correct even when it isn't cleared.
>>>
>>>       11.Re-schedule all JOBs remains in mirror list to ring again and
>>>       restart scheduler (for VRAM lost case, no JOB will 
>>> re-scheduled)
>>>
>>> Okay.
>>>
>>>       ?For cs_wait() IOCTL:
>>>
>>>       After it found fence signaled, it should check with
>>>       *“dma_fence_get_status” *to see if there is error there,
>>>
>>>       And return the error status of fence
>>>
>>> Yes and no, dma_fence_get_status() is some specific handling for 
>>> sync_file debugging (no idea why that made it into the common fence code).
>>>
>>> It was replaced by putting the error code directly into the fence, 
>>> so just reading that one after waiting should be ok.
>>>
>>> Maybe we should fix dma_fence_get_status() to do the right thing for this?
>>>
>>>       ?For cs_wait_fences() IOCTL:
>>>
>>>       Similar with above approach
>>>
>>>       ?For cs_submit() IOCTL:
>>>
>>>       It need to check if current ctx been marked as “*guilty*” and return
>>>       “*ECANCELED*” if so
>>>
>>>       ?Introduce a new IOCTL to let UMD query *vram_lost_counter*:
>>>
>>>       This way, UMD can also block app from submitting, like @Nicolai
>>>       mentioned, we can cache one copy of *vram_lost_counter* when
>>>       enumerate physical device, and deny all
>>>
>>>       gl-context from submitting if the counter queried bigger than that
>>>       one cached in physical device. (looks a little overkill to me, but
>>>       easy to implement )
>>>
>>>       UMD can also return error to APP when creating gl-context if found
>>>       current queried*vram_lost_counter *bigger than that one cached in
>>>       physical device.
>>>
>>> Okay. Already have a patch for this, please review that one if you 
>>> haven't already done so.
>>>
>>> Regards,
>>> Christian.
>>>
>>>       BTW: I realized that gl-context is a little different with kernel’s
>>>       context. Because for kernel. BO is not related with context but only
>>>       with FD, while in UMD, BO have a backend
>>>
>>>       gl-context, so block submitting in UMD layer is also needed although
>>>       KMD will do its job as bottom line
>>>
>>>       ?Basically “vram_lost_counter” is exposure by kernel to let UMD take
>>>       the control of robust extension feature, it will be UMD’s call to
>>>       move, KMD only deny “guilty” context from submitting
>>>
>>>       Need your feedback, thx
>>>
>>>       We’d better make TDR feature landed ASAP
>>>
>>>       BR Monk
>>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx