TDR and VRAM lost handling in KMD (v2)

Thu Oct 12 12:00:56 UTC 2017

On 12.10.2017 13:49, Christian König wrote:
> Am 12.10.2017 um 13:37 schrieb Liu, Monk:
>> Hi team
>> Very good, many policy and implement are agreed, looks we only have 
>> some arguments in amdgpu_ctx_query(), well I also confused with the 
>> current implement of it, ☹
>> First, I want to know if you guys agree that we*don't update 
>> ctx->reset_counter in amdgpu_ctx_query() *? because I want to make the 
>> query result always consistent upon a given context,
> 
> That sounds like a good idea to me, but I'm not sure if it won't break 
> userspace (I don't think so). Nicolai or Marek need to comment.

That should be fine for Mesa.

>> Second, I want to say that for KERNEL, it shouldn't use the term from 
>> MESA or OGL or VULKAN, e.g. kernel shouldn't use AMDGPU_INNOCENT_RESET 
>> to map to GL_INNOCENT_RESET_ARB, etc...
>> Because that way kernel will be very limited to certain UMD, so I 
>> suggest we totally re-name the context status, and each UMD has its 
>> own way to map the kernel context's result to gl-context/vk-context/etc…
> 
> Yes, completely agree.
> 
>> Kernel should only provide below **FLAG bits** on a given context:
>>
>>   * Define AMDGPU_CTX_STATE_GUILTY 0x1             //as long as TDR
>>     detects a job hang, KMD set the context behind this context as
>>     "guilty"
>>   * Define AMDGPU_CTX_STATE_VRAMLOST         0x2      //as long as
>>     there is a VRAM lost event hit after this context created, we mark
>>     this context "VRAM_LOST", so UMD can say that all BO under this
>>     context may lost their content, since kernel have no relationship
>>     between context and BO so this is UMD's call to judge if a BO
>>     considered "VRAM lost" or not.
>>   * Define AMDGPU_CTX_STATE_RESET   0x3     //as long as there is a
>>     gpu reset occurred after context creation, this flag shall be set
>>
> 
> That sounds sane, but unfortunately might not be possible with the 
> existing IOCTL. Keep in mind that we need to keep backward compatibility 
> here.

Agreed. FWIW, when Mesa sees an unknown (new) value being returned from 
amdgpu_ctx_query, it treats it the same as AMDGPU_CTX_NO_RESET.

Cheers,
Nicolai

>> Sample code:
>> Int amdgpu_ctx_query(struct amdgpu_ctx_query_parm * out, …..) {
>>         if (ctx- >vram_lost_counter != adev->vram_lost_counter)
>>                 out- >status |= AMDGPU_CTX_STATE_VRAM_LOST;
>> if (ctx- >reset_counter != adev→reset_counter){
>> out- >status |= AMDGPU_CTX_STATE_RESET;
>> if (ctx- >guilty == TRUE)
>>                         out- >status |= AMDGPU_CTX_STATE_GUILTY;
>> }
>> return 0;
>> }
>> For UMD if it found "out.status == 0" means there is no gpu reset even 
>> occurred, the context is totally regular,
>>
>>   * *A****new IOCTL added for context:*
>>
>> Void amdgpu_ctx_reinit(){
>>         Ctx→vram_lost_counter = adev→vram_lost_counter;
>>         Ctx→reset_counter = adev→reset_counter;
>> }
> 
> Mhm, is there any advantage to just creating a new context?
> 
> Regards,
> Christian.
> 
>> *if**UMD decide *not* to release the "guilty" context but continue 
>> using it **after**UMD acknowledged GPU hang **on certain job/context, 
>> I suggest **UMD call "amdgpu_ctx_reinit()":*
>> That way after you re-init() this context, you can get updated result 
>> from "amdgpu_ctx_query", which will probably give you "out.status == 
>> 0" as long as no gpu reset/vram lost hit after re-init().
>> BR Monk
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: 2017年10月12日 18:13
>> To: Haehnle, Nicolai <Nicolai.Haehnle at amd.com>; Michel Dänzer 
>> <michel at daenzer.net>; Liu, Monk <Monk.Liu at amd.com>; Olsak, Marek 
>> <Marek.Olsak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; 
>> Zhou, David(ChunMing) <David1.Zhou at amd.com>; Mao, David 
>> <David.Mao at amd.com>
>> Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; 
>> amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; 
>> Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; 
>> Jiang, Jerry (SW) <Jerry.Jiang at amd.com>
>> Subject: Re: TDR and VRAM lost handling in KMD (v2)
>> Am 12.10.2017 um 11:44 schrieb Nicolai Hähnle:
>> > On 12.10.2017 11:35, Michel Dänzer wrote:
>> >> On 12/10/17 11:23 AM, Christian König wrote:
>> >>> Am 12.10.2017 um 11:10 schrieb Nicolai Hähnle:
>> >>>> On 12.10.2017 10:49, Christian König wrote:
>> >>>>>> However, !guilty && ctx->reset_counter != adev->reset_counter
>> >>>>>> does not imply that the context was lost.
>> >>>>>>
>> >>>>>> The way I understand it, we should return
>> >>>>>> AMDGPU_CTX_INNOCENT_RESET if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.
>> >>>>>>
>> >>>>>> As far as I understand it, the case of !guilty &&
>> >>>>>> ctx->reset_counter != adev->reset_counter &&
>> >>>>>> ctx->vram_lost_counter ==
>> >>>>>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET,
>> >>>>>> adev->because a
>> >>>>>> GPU reset occurred, but it didn't affect our context.
>> >>>>> I disagree on that.
>> >>>>>
>> >>>>> AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there
>> >>>>> was a reset but we haven't been causing it.
>> >>>>>
>> >>>>> That the OpenGL extension is specified otherwise is unfortunate,
>> >>>>> but I think we shouldn't use that for the kernel interface here.
>> >>>> Two counterpoints:
>> >>>>
>> >>>> 1. Why should any application care that there was a reset while it
>> >>>> was idle? The OpenGL behavior is what makes sense.
>> >>>
>> >>> The application is certainly not interest if a reset happened or
>> >>> not, but I though that the driver stack might be.
>> >>>
>> >>>>
>> >>>> 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today
>> >>>> because we never return it :)
>> >>>>
>> >>>
>> >>> Good point.
>> >>>
>> >>>> amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which
>> >>>> is in line with the OpenGL spec: we're conservatively returning
>> >>>> that a reset happened because we don't know whether the context was
>> >>>> affected, and we return UNKNOWN because we also don't know whether
>> >>>> the context was guilty or not.
>> >>>>
>> >>>> Returning AMDGPU_CTX_NO_RESET in the case of !guilty &&
>> >>>> ctx->vram_lost_counter == adev->vram_lost_counter is simply a
>> >>>> refinement and improvement of the current, overly conservative
>> >>>> behavior.
>> >>>
>> >>> Ok let's reenumerate what I think the different return values should
>> >>> mean:
>> >>>
>> >>> * AMDGPU_CTX_GUILTY_RESET
>> >>>
>> >>> guilty is set to true for this context.
>> >>>
>> >>> * AMDGPU_CTX_INNOCENT_RESET
>> >>>
>> >>> guilty is not set and vram_lost_counter has changed.
>> >>>
>> >>> * AMDGPU_CTX_UNKNOWN_RESET
>> >>>
>> >>> guilty is not set and vram_lost_counter has not changed, but
>> >>> gpu_reset_counter has changed.
>> >>
>> >> I don't understand the distinction you're proposing between
>> >> AMDGPU_CTX_INNOCENT_RESET and AMDGPU_CTX_UNKNOWN_RESET. I think both
>> >> cases you're describing should return either 
>> >> AMDGPU_CTX_INNOCENT_RESET, if the value of guilty is reliable, or
>> >> AMDGPU_CTX_UNKNOWN_RESET if it's not.
>> >
>> > I think it'd make more sense if it was called 
>> > "AMDGPU_CTX_UNAFFECTED_RESET".
>> >
>> > So:
>> > - AMDGPU_CTX_GUILTY_RESET --> the context was affected by a reset, and
>> > we know that it's the context's fault
>> > - AMDGPU_CTX_INNOCENT_RESET --> the context was affected by a reset,
>> > and we know that it *wasn't* the context's fault (no context job
>> > active)
>> > - AMDGPU_CTX_UNKNOWN_RESET --> the context was affected by a reset,
>> > and we don't know who's responsible (this could be returned in the
>> > unlikely case where context A's gfx job has not yet finished, but
>> > context B's gfx job has already started; it could be the fault of A,
>> > it could be the fault of B -- which somehow manages to hang a part of
>> > the hardware that then prevents A's job from finishing -- or it could
>> > be both; but it's a bit academic)
>> > - AMDGPU_CTX_UNAFFECTED_RESET --> there was a reset, but this context
>> > wasn't affected
>> >
>> > This last value would currently just be discarded by Mesa (because we
>> > should only disturb applications when we have to), but perhaps
>> > somebody else could find it useful?
>> Yes, that's what I had in mind as well.
>> Cause otherwise we would return AMDGPU_CTX_NO_RESET while there 
>> actually was a reset and that certainly doesn't sound correct to me.
>> Regards,
>> Christian.
>> >
>> > Cheers,
>> > Nicolai
> 
>