TDR and VRAM lost handling in KMD (v2)
Nicolai Hähnle
nicolai.haehnle at amd.com
Fri Oct 13 21:44:34 UTC 2017
On 13.10.2017 14:04, Liu, Monk wrote:
> That's what I suggested, look to know it's agreed
Yeah, that's fine with me as well.
Cheers,
Nicolai
>
> BR Monk
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2017年10月13日 20:01
> To: Liu, Monk <Monk.Liu at amd.com>; Haehnle, Nicolai <Nicolai.Haehnle at amd.com>; Michel Dänzer <michel at daenzer.net>; Olsak, Marek <Marek.Olsak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; Zhou, David(ChunMing) <David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com>
> Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>
> Subject: Re: TDR and VRAM lost handling in KMD (v2)
>
> I think the best approach is to keep it as it is right now and don't change a thing.
>
> And we add a new IOCTL with a bit more sane return values. E.g. guilty status and VRAM lost status as flags.
>
> Regards,
> Christian.
>
> Am 13.10.2017 um 13:51 schrieb Liu, Monk:
>> Ping Christian & Nicolai
>>
>> This ctx_query() is a little annoying to me
>>
>> BR Monk
>>
>> -----Original Message-----
>> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf
>> Of Liu, Monk
>> Sent: 2017年10月13日 17:19
>> To: Haehnle, Nicolai <Nicolai.Haehnle at amd.com>; Koenig, Christian
>> <Christian.Koenig at amd.com>; Michel Dänzer <michel at daenzer.net>; Olsak,
>> Marek <Marek.Olsak at amd.com>; Deucher, Alexander
>> <Alexander.Deucher at amd.com>; Zhou, David(ChunMing)
>> <David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com>
>> Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>;
>> amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>;
>> Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>;
>> Jiang, Jerry (SW) <Jerry.Jiang at amd.com>
>> Subject: RE: TDR and VRAM lost handling in KMD (v2)
>>
>> Alright, if MESA can handle clone context's VRAM_LOST_COUNTER mismatch
>> issue, no need to introduce one more U/K in kmd,
>>
>> So we have only one issue unresolved and need determined ASAP:
>>
>> How to modify amdgpu_ctx_query() ??
>>
>> Current design won't work later with our discussion on the TDR (v2) right ? unless MESA stop calling this ctx_query() and totally depend on new introduced interface that get latest vram-lost-counter to judge Context healthy.
>>
>> BR Monk
>>
>> -----Original Message-----
>> From: Haehnle, Nicolai
>> Sent: 2017年10月13日 15:43
>> To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian
>> <Christian.Koenig at amd.com>; Michel Dänzer <michel at daenzer.net>; Olsak,
>> Marek <Marek.Olsak at amd.com>; Deucher, Alexander
>> <Alexander.Deucher at amd.com>; Zhou, David(ChunMing)
>> <David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com>
>> Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>;
>> amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>;
>> Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>;
>> Jiang, Jerry (SW) <Jerry.Jiang at amd.com>
>> Subject: Re: TDR and VRAM lost handling in KMD (v2)
>>
>> On 13.10.2017 05:57, Liu, Monk wrote:
>>>> That sounds sane, but unfortunately might not be possible with the existing IOCTL. Keep in mind that we need to keep backward compatibility here.
>>> unfortunately the current scheme on amdgpu_ctx_query() won’t work
>>> with TDR feature, which is aim to support vulkan/mesa/close-ogl/radv
>>> …
>>>
>>> It’s enumeration is too limited to MESA implement …
>>>
>>> *Do you have good idea ? both keep the compatibility here and give
>>> flexible ?*
>>>
>>> *looks like we need to add a new amdgpu_ctx_query_2() INTERFACE ….*
>>>
>>> ·*A new IOCTL added for context:*
>>>
>>> Void amdgpu_ctx_reinit(){
>>>
>>> Ctx→vram_lost_counter = adev→vram_lost_counter;
>>>
>>> Ctx→reset_counter = adev→reset_counter;
>>>
>>> }
>>>
>>>
>>> Mhm, is there any advantage to just creating a new context?
>>>
>>> [ML] sorry, this function is wrong, here is my original idea:
>>>
>>> MESA can create a new ctx based on an old one,like:
>>>
>>> Create gl-ctx1,
>>>
>>> Create BO-A under gl-ctx1 …
>>>
>>> VRAM LOST
>>>
>>> Create gl-ctx2 from gl-ctx1 (share list, I’m not familiar with UMD,
>>> David Mao an Nicolai can input)
>>>
>>> Create BO-b UNDER gl-ctx2 …
>>>
>>> Submit job upon gl-ctx2, but it can refer to BO-A,
>>>
>>> With our design, kernel won’t block submit from context2 (from
>>> gl-ctx2) since its vram_lost_counter equals to latest adev copy
>>>
>>> But gl-ctx2 is a clone from gl-ctx1, the only difference is the
>>> vram_lost/gpu_reset counter is updated to latest one
>>>
>>> So logically we should also block submit from gl-ctx2 (mapping to
>>> kernel context2), and we failed do so …
>>>
>>> That’s why I want to add a new “amdgpu_ctx_clone”, which should work like:
>>>
>>> Int amdgpu_ctx_clone(struct context *original, struct context *new) {
>>>
>>> New->vram_lost_counter = original->vram_lost_counter;
>>>
>>> New->gpu_reset_counter = original->gpu_reset_counter;
>>>
>>> }
>> From the Mesa perspective, I don't think we would need to use that as long as we can query the device VRAM lost counter.
>>
>> Personally, I think it's better for the UMD to build its policy on lower-level primitives such as the VRAM lost counter query.
>>
>> Cheers,
>> Nicolai
>>
>>> BR Monk
>>>
>>> *From:*Koenig, Christian
>>> *Sent:* 2017年10月12日19:50
>>> *To:* Liu, Monk <Monk.Liu at amd.com>; Haehnle, Nicolai
>>> <Nicolai.Haehnle at amd.com>; Michel Dänzer <michel at daenzer.net>; Olsak,
>>> Marek <Marek.Olsak at amd.com>; Deucher, Alexander
>>> <Alexander.Deucher at amd.com>; Zhou, David(ChunMing)
>>> <David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com>
>>> *Cc:* Ramirez, Alejandro <Alejandro.Ramirez at amd.com>;
>>> amd-gfx at lists.freedesktop.org; Filipas, Mario
>>> <Mario.Filipas at amd.com>; Ding, Pixel <Pixel.Ding at amd.com>; Li,
>>> Bingley <Bingley.Li at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>
>>> *Subject:* Re: TDR and VRAM lost handling in KMD (v2)
>>>
>>> Am 12.10.2017 um 13:37 schrieb Liu, Monk:
>>>
>>> Hi team
>>>
>>> Very good, many policy and implement are agreed, looks we only have
>>> some arguments in amdgpu_ctx_query(), well I also confused with the
>>> current implement of it, ☹
>>>
>>> First, I want to know if you guys agree that we*don't update
>>> ctx->reset_counter in amdgpu_ctx_query() *? because I want to make
>>> the query result always consistent upon a given context,
>>>
>>>
>>> That sounds like a good idea to me, but I'm not sure if it won't
>>> break userspace (I don't think so). Nicolai or Marek need to comment.
>>>
>>>
>>> Second, I want to say that for KERNEL, it shouldn't use the term
>>> from MESA or OGL or VULKAN, e.g. kernel shouldn't use
>>> AMDGPU_INNOCENT_RESET to map to GL_INNOCENT_RESET_ARB, etc...
>>>
>>> Because that way kernel will be very limited to certain UMD, so I
>>> suggest we totally re-name the context status, and each UMD has its
>>> own way to map the kernel context's result to
>>> gl-context/vk-context/etc…
>>>
>>>
>>> Yes, completely agree.
>>>
>>>
>>> Kernel should only provide below **FLAG bits** on a given context:
>>>
>>> ·Define AMDGPU_CTX_STATE_GUILTY 0x1 //as long as TDR
>>> detects a job hang, KMD set the context behind this context as "guilty"
>>>
>>> ·Define AMDGPU_CTX_STATE_VRAMLOST 0x2 //as long as
>>> there is a VRAM lost event hit after this context created, we mark
>>> this context "VRAM_LOST", so UMD can say that all BO under this
>>> context may lost their content, since kernel have no relationship
>>> between context and BO so this is UMD's call to judge if a BO
>>> considered "VRAM lost" or not.
>>>
>>> ·Define AMDGPU_CTX_STATE_RESET 0x3 //as long as there is a gpu
>>> reset occurred after context creation, this flag shall be set
>>>
>>>
>>> That sounds sane, but unfortunately might not be possible with the
>>> existing IOCTL. Keep in mind that we need to keep backward
>>> compatibility here.
>>>
>>>
>>> Sample code:
>>>
>>> Int amdgpu_ctx_query(struct amdgpu_ctx_query_parm * out, …..) {
>>>
>>> if (ctx- >vram_lost_counter != adev->vram_lost_counter)
>>>
>>> out- >status |= AMDGPU_CTX_STATE_VRAM_LOST;
>>>
>>> if (ctx- >reset_counter != adev→reset_counter) {
>>>
>>> out- >status |= AMDGPU_CTX_STATE_RESET;
>>>
>>> if (ctx- >guilty == TRUE)
>>>
>>> out- >status |=
>>> AMDGPU_CTX_STATE_GUILTY;
>>>
>>> }
>>>
>>> return 0;
>>>
>>> }
>>>
>>> For UMD if it found "out.status == 0" means there is no gpu reset
>>> even occurred, the context is totally regular,
>>>
>>> ·*A new IOCTL added for context:*
>>>
>>> Void amdgpu_ctx_reinit(){
>>>
>>> Ctx→vram_lost_counter = adev→vram_lost_counter;
>>>
>>> Ctx→reset_counter = adev→reset_counter;
>>>
>>> }
>>>
>>>
>>> Mhm, is there any advantage to just creating a new context?
>>>
>>> Regards,
>>> Christian.
>>>
>>>
>>> *if UMD decide *not* to release the "guilty" context but continue
>>> using it after UMD acknowledged GPU hang on certain job/context, I
>>> suggest UMD call "amdgpu_ctx_reinit()":*
>>>
>>> That way after you re-init() this context, you can get updated
>>> result from "amdgpu_ctx_query", which will probably give you
>>> "out.status == 0" as long as no gpu reset/vram lost hit after re-init().
>>>
>>> BR Monk
>>>
>>> -----Original Message-----
>>> From: Koenig, Christian
>>> Sent: 2017年10月12日18:13
>>> To: Haehnle, Nicolai <Nicolai.Haehnle at amd.com>
>>> <mailto:Nicolai.Haehnle at amd.com>; Michel Dänzer <michel at daenzer.net>
>>> <mailto:michel at daenzer.net>; Liu, Monk <Monk.Liu at amd.com>
>>> <mailto:Monk.Liu at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>
>>> <mailto:Marek.Olsak at amd.com>; Deucher, Alexander
>>> <Alexander.Deucher at amd.com> <mailto:Alexander.Deucher at amd.com>;
>>> Zhou, David(ChunMing) <David1.Zhou at amd.com>
>>> <mailto:David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com>
>>> <mailto:David.Mao at amd.com>
>>> Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>
>>> <mailto:Alejandro.Ramirez at amd.com>; amd-gfx at lists.freedesktop.org
>>> <mailto:amd-gfx at lists.freedesktop.org>; Filipas, Mario
>>> <Mario.Filipas at amd.com> <mailto:Mario.Filipas at amd.com>; Ding, Pixel
>>> <Pixel.Ding at amd.com> <mailto:Pixel.Ding at amd.com>; Li, Bingley
>>> <Bingley.Li at amd.com> <mailto:Bingley.Li at amd.com>; Jiang, Jerry (SW)
>>> <Jerry.Jiang at amd.com> <mailto:Jerry.Jiang at amd.com>
>>> Subject: Re: TDR and VRAM lost handling in KMD (v2)
>>>
>>> Am 12.10.2017 um 11:44 schrieb Nicolai Hähnle:
>>>
>>> > On 12.10.2017 11:35, Michel Dänzer wrote:
>>>
>>> >> On 12/10/17 11:23 AM, Christian König wrote:
>>>
>>> >>> Am 12.10.2017 um 11:10 schrieb Nicolai Hähnle:
>>>
>>> >>>> On 12.10.2017 10:49, Christian König wrote:
>>>
>>> >>>>>> However, !guilty && ctx->reset_counter !=
>>> adev->reset_counter
>>>
>>> >>>>>> does not imply that the context was lost.
>>>
>>> >>>>>>
>>>
>>> >>>>>> The way I understand it, we should return
>>>
>>> >>>>>> AMDGPU_CTX_INNOCENT_RESET if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.
>>>
>>> >>>>>>
>>>
>>> >>>>>> As far as I understand it, the case of !guilty &&
>>>
>>> >>>>>> ctx->reset_counter != adev->reset_counter &&
>>>
>>> >>>>>> ctx->vram_lost_counter ==
>>>
>>> >>>>>> adev->vram_lost_counter should return
>>> AMDGPU_CTX_NO_RESET,
>>>
>>> >>>>>> adev->because a
>>>
>>> >>>>>> GPU reset occurred, but it didn't affect our context.
>>>
>>> >>>>> I disagree on that.
>>>
>>> >>>>>
>>>
>>> >>>>> AMDGPU_CTX_INNOCENT_RESET just means what it does
>>> currently, there
>>>
>>> >>>>> was a reset but we haven't been causing it.
>>>
>>> >>>>>
>>>
>>> >>>>> That the OpenGL extension is specified otherwise is
>>> unfortunate,
>>>
>>> >>>>> but I think we shouldn't use that for the kernel interface here.
>>>
>>> >>>> Two counterpoints:
>>>
>>> >>>>
>>>
>>> >>>> 1. Why should any application care that there was a reset
>>> while it
>>>
>>> >>>> was idle? The OpenGL behavior is what makes sense.
>>>
>>> >>>
>>>
>>> >>> The application is certainly not interest if a reset
>>> happened or
>>>
>>> >>> not, but I though that the driver stack might be.
>>>
>>> >>>
>>>
>>> >>>>
>>>
>>> >>>> 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything
>>> today
>>>
>>> >>>> because we never return it :)
>>>
>>> >>>>
>>>
>>> >>>
>>>
>>> >>> Good point.
>>>
>>> >>>
>>>
>>> >>>> amdgpu_ctx_query only ever returns
>>> AMDGPU_CTX_UNKNOWN_RESET, which
>>>
>>> >>>> is in line with the OpenGL spec: we're conservatively
>>> returning
>>>
>>> >>>> that a reset happened because we don't know whether the
>>> context was
>>>
>>> >>>> affected, and we return UNKNOWN because we also don't know
>>> whether
>>>
>>> >>>> the context was guilty or not.
>>>
>>> >>>>
>>>
>>> >>>> Returning AMDGPU_CTX_NO_RESET in the case of !guilty &&
>>>
>>> >>>> ctx->vram_lost_counter == adev->vram_lost_counter is simply
>>> a
>>>
>>> >>>> refinement and improvement of the current, overly
>>> conservative
>>>
>>> >>>> behavior.
>>>
>>> >>>
>>>
>>> >>> Ok let's reenumerate what I think the different return
>>> values should
>>>
>>> >>> mean:
>>>
>>> >>>
>>>
>>> >>> * AMDGPU_CTX_GUILTY_RESET
>>>
>>> >>>
>>>
>>> >>> guilty is set to true for this context.
>>>
>>> >>>
>>>
>>> >>> * AMDGPU_CTX_INNOCENT_RESET
>>>
>>> >>>
>>>
>>> >>> guilty is not set and vram_lost_counter has changed.
>>>
>>> >>>
>>>
>>> >>> * AMDGPU_CTX_UNKNOWN_RESET
>>>
>>> >>>
>>>
>>> >>> guilty is not set and vram_lost_counter has not changed, but
>>>
>>> >>> gpu_reset_counter has changed.
>>>
>>> >>
>>>
>>> >> I don't understand the distinction you're proposing between
>>>
>>> >> AMDGPU_CTX_INNOCENT_RESET and AMDGPU_CTX_UNKNOWN_RESET. I
>>> think both
>>>
>>> >> cases you're describing should return either
>>>
>>> >> AMDGPU_CTX_INNOCENT_RESET, if the value of guilty is
>>> reliable, or
>>>
>>> >> AMDGPU_CTX_UNKNOWN_RESET if it's not.
>>>
>>> >
>>>
>>> > I think it'd make more sense if it was called
>>>
>>> > "AMDGPU_CTX_UNAFFECTED_RESET".
>>>
>>> >
>>>
>>> > So:
>>>
>>> > - AMDGPU_CTX_GUILTY_RESET --> the context was affected by a
>>> reset, and
>>>
>>> > we know that it's the context's fault
>>>
>>> > - AMDGPU_CTX_INNOCENT_RESET --> the context was affected by a
>>> reset,
>>>
>>> > and we know that it *wasn't* the context's fault (no context
>>> job
>>>
>>> > active)
>>>
>>> > - AMDGPU_CTX_UNKNOWN_RESET --> the context was affected by a
>>> reset,
>>>
>>> > and we don't know who's responsible (this could be returned in
>>> the
>>>
>>> > unlikely case where context A's gfx job has not yet finished,
>>> but
>>>
>>> > context B's gfx job has already started; it could be the fault
>>> of A,
>>>
>>> > it could be the fault of B -- which somehow manages to hang a
>>> part of
>>>
>>> > the hardware that then prevents A's job from finishing -- or
>>> it could
>>>
>>> > be both; but it's a bit academic)
>>>
>>> > - AMDGPU_CTX_UNAFFECTED_RESET --> there was a reset, but this
>>> context
>>>
>>> > wasn't affected
>>>
>>> >
>>>
>>> > This last value would currently just be discarded by Mesa
>>> (because we
>>>
>>> > should only disturb applications when we have to), but perhaps
>>>
>>> > somebody else could find it useful?
>>>
>>> Yes, that's what I had in mind as well.
>>>
>>> Cause otherwise we would return AMDGPU_CTX_NO_RESET while there
>>> actually was a reset and that certainly doesn't sound correct to me.
>>>
>>> Regards,
>>>
>>> Christian.
>>>
>>> >
>>>
>>> > Cheers,
>>>
>>> > Nicolai
>>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
More information about the amd-gfx
mailing list