TDR and VRAM lost handling in KMD (v2)
Nicolai Hähnle
nicolai.haehnle at amd.com
Fri Oct 13 07:43:04 UTC 2017
On 13.10.2017 05:57, Liu, Monk wrote:
>>That sounds sane, but unfortunately might not be possible with the existing IOCTL. Keep in mind that we need to keep backward compatibility here.
>
> unfortunately the current scheme on amdgpu_ctx_query() won’t work with
> TDR feature, which is aim to support vulkan/mesa/close-ogl/radv …
>
> It’s enumeration is too limited to MESA implement …
>
> *Do you have good idea ? both keep the compatibility here and give
> flexible ?*
>
> *looks like we need to add a new amdgpu_ctx_query_2() INTERFACE ….*
>
> ·*A new IOCTL added for context:*
>
> Void amdgpu_ctx_reinit(){
>
> Ctx→vram_lost_counter = adev→vram_lost_counter;
>
> Ctx→reset_counter = adev→reset_counter;
>
> }
>
>
> Mhm, is there any advantage to just creating a new context?
>
> [ML] sorry, this function is wrong, here is my original idea:
>
> MESA can create a new ctx based on an old one,like:
>
> Create gl-ctx1,
>
> Create BO-A under gl-ctx1 …
>
> VRAM LOST
>
> Create gl-ctx2 from gl-ctx1 (share list, I’m not familiar with UMD,
> David Mao an Nicolai can input)
>
> Create BO-b UNDER gl-ctx2 …
>
> Submit job upon gl-ctx2, but it can refer to BO-A,
>
> With our design, kernel won’t block submit from context2 (from gl-ctx2)
> since its vram_lost_counter equals to latest adev copy
>
> But gl-ctx2 is a clone from gl-ctx1, the only difference is the
> vram_lost/gpu_reset counter is updated to latest one
>
> So logically we should also block submit from gl-ctx2 (mapping to kernel
> context2), and we failed do so …
>
> That’s why I want to add a new “amdgpu_ctx_clone”, which should work like:
>
> Int amdgpu_ctx_clone(struct context *original, struct context *new) {
>
> New->vram_lost_counter = original->vram_lost_counter;
>
> New->gpu_reset_counter = original->gpu_reset_counter;
>
> }
From the Mesa perspective, I don't think we would need to use that as
long as we can query the device VRAM lost counter.
Personally, I think it's better for the UMD to build its policy on
lower-level primitives such as the VRAM lost counter query.
Cheers,
Nicolai
>
> BR Monk
>
> *From:*Koenig, Christian
> *Sent:* 2017年10月12日19:50
> *To:* Liu, Monk <Monk.Liu at amd.com>; Haehnle, Nicolai
> <Nicolai.Haehnle at amd.com>; Michel Dänzer <michel at daenzer.net>; Olsak,
> Marek <Marek.Olsak at amd.com>; Deucher, Alexander
> <Alexander.Deucher at amd.com>; Zhou, David(ChunMing)
> <David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com>
> *Cc:* Ramirez, Alejandro <Alejandro.Ramirez at amd.com>;
> amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>;
> Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>;
> Jiang, Jerry (SW) <Jerry.Jiang at amd.com>
> *Subject:* Re: TDR and VRAM lost handling in KMD (v2)
>
> Am 12.10.2017 um 13:37 schrieb Liu, Monk:
>
> Hi team
>
> Very good, many policy and implement are agreed, looks we only have
> some arguments in amdgpu_ctx_query(), well I also confused with the
> current implement of it, ☹
>
> First, I want to know if you guys agree that we*don't update
> ctx->reset_counter in amdgpu_ctx_query() *? because I want to make
> the query result always consistent upon a given context,
>
>
> That sounds like a good idea to me, but I'm not sure if it won't break
> userspace (I don't think so). Nicolai or Marek need to comment.
>
>
> Second, I want to say that for KERNEL, it shouldn't use the term
> from MESA or OGL or VULKAN, e.g. kernel shouldn't use
> AMDGPU_INNOCENT_RESET to map to GL_INNOCENT_RESET_ARB, etc...
>
> Because that way kernel will be very limited to certain UMD, so I
> suggest we totally re-name the context status, and each UMD has its
> own way to map the kernel context's result to gl-context/vk-context/etc…
>
>
> Yes, completely agree.
>
>
> Kernel should only provide below **FLAG bits** on a given context:
>
> ·Define AMDGPU_CTX_STATE_GUILTY 0x1 //as long as TDR
> detects a job hang, KMD set the context behind this context as "guilty"
>
> ·Define AMDGPU_CTX_STATE_VRAMLOST 0x2 //as long as
> there is a VRAM lost event hit after this context created, we mark
> this context "VRAM_LOST", so UMD can say that all BO under this
> context may lost their content, since kernel have no relationship
> between context and BO so this is UMD's call to judge if a BO
> considered "VRAM lost" or not.
>
> ·Define AMDGPU_CTX_STATE_RESET 0x3 //as long as there is a gpu
> reset occurred after context creation, this flag shall be set
>
>
> That sounds sane, but unfortunately might not be possible with the
> existing IOCTL. Keep in mind that we need to keep backward compatibility
> here.
>
>
> Sample code:
>
> Int amdgpu_ctx_query(struct amdgpu_ctx_query_parm * out, …..) {
>
> if (ctx- >vram_lost_counter != adev->vram_lost_counter)
>
> out- >status |= AMDGPU_CTX_STATE_VRAM_LOST;
>
> if (ctx- >reset_counter != adev→reset_counter) {
>
> out- >status |= AMDGPU_CTX_STATE_RESET;
>
> if (ctx- >guilty == TRUE)
>
> out- >status |= AMDGPU_CTX_STATE_GUILTY;
>
> }
>
> return 0;
>
> }
>
> For UMD if it found "out.status == 0" means there is no gpu reset
> even occurred, the context is totally regular,
>
> ·*A new IOCTL added for context:*
>
> Void amdgpu_ctx_reinit(){
>
> Ctx→vram_lost_counter = adev→vram_lost_counter;
>
> Ctx→reset_counter = adev→reset_counter;
>
> }
>
>
> Mhm, is there any advantage to just creating a new context?
>
> Regards,
> Christian.
>
>
> *if UMD decide *not* to release the "guilty" context but continue
> using it after UMD acknowledged GPU hang on certain job/context, I
> suggest UMD call "amdgpu_ctx_reinit()":*
>
> That way after you re-init() this context, you can get updated
> result from "amdgpu_ctx_query", which will probably give you
> "out.status == 0" as long as no gpu reset/vram lost hit after re-init().
>
> BR Monk
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2017年10月12日18:13
> To: Haehnle, Nicolai <Nicolai.Haehnle at amd.com>
> <mailto:Nicolai.Haehnle at amd.com>; Michel Dänzer <michel at daenzer.net>
> <mailto:michel at daenzer.net>; Liu, Monk <Monk.Liu at amd.com>
> <mailto:Monk.Liu at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>
> <mailto:Marek.Olsak at amd.com>; Deucher, Alexander
> <Alexander.Deucher at amd.com> <mailto:Alexander.Deucher at amd.com>;
> Zhou, David(ChunMing) <David1.Zhou at amd.com>
> <mailto:David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com>
> <mailto:David.Mao at amd.com>
> Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>
> <mailto:Alejandro.Ramirez at amd.com>; amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>; Filipas, Mario
> <Mario.Filipas at amd.com> <mailto:Mario.Filipas at amd.com>; Ding, Pixel
> <Pixel.Ding at amd.com> <mailto:Pixel.Ding at amd.com>; Li, Bingley
> <Bingley.Li at amd.com> <mailto:Bingley.Li at amd.com>; Jiang, Jerry (SW)
> <Jerry.Jiang at amd.com> <mailto:Jerry.Jiang at amd.com>
> Subject: Re: TDR and VRAM lost handling in KMD (v2)
>
> Am 12.10.2017 um 11:44 schrieb Nicolai Hähnle:
>
> > On 12.10.2017 11:35, Michel Dänzer wrote:
>
> >> On 12/10/17 11:23 AM, Christian König wrote:
>
> >>> Am 12.10.2017 um 11:10 schrieb Nicolai Hähnle:
>
> >>>> On 12.10.2017 10:49, Christian König wrote:
>
> >>>>>> However, !guilty && ctx->reset_counter != adev->reset_counter
>
> >>>>>> does not imply that the context was lost.
>
> >>>>>>
>
> >>>>>> The way I understand it, we should return
>
> >>>>>> AMDGPU_CTX_INNOCENT_RESET if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.
>
> >>>>>>
>
> >>>>>> As far as I understand it, the case of !guilty &&
>
> >>>>>> ctx->reset_counter != adev->reset_counter &&
>
> >>>>>> ctx->vram_lost_counter ==
>
> >>>>>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET,
>
> >>>>>> adev->because a
>
> >>>>>> GPU reset occurred, but it didn't affect our context.
>
> >>>>> I disagree on that.
>
> >>>>>
>
> >>>>> AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there
>
> >>>>> was a reset but we haven't been causing it.
>
> >>>>>
>
> >>>>> That the OpenGL extension is specified otherwise is unfortunate,
>
> >>>>> but I think we shouldn't use that for the kernel interface here.
>
> >>>> Two counterpoints:
>
> >>>>
>
> >>>> 1. Why should any application care that there was a reset while it
>
> >>>> was idle? The OpenGL behavior is what makes sense.
>
> >>>
>
> >>> The application is certainly not interest if a reset happened or
>
> >>> not, but I though that the driver stack might be.
>
> >>>
>
> >>>>
>
> >>>> 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today
>
> >>>> because we never return it :)
>
> >>>>
>
> >>>
>
> >>> Good point.
>
> >>>
>
> >>>> amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which
>
> >>>> is in line with the OpenGL spec: we're conservatively returning
>
> >>>> that a reset happened because we don't know whether the context was
>
> >>>> affected, and we return UNKNOWN because we also don't know whether
>
> >>>> the context was guilty or not.
>
> >>>>
>
> >>>> Returning AMDGPU_CTX_NO_RESET in the case of !guilty &&
>
> >>>> ctx->vram_lost_counter == adev->vram_lost_counter is simply a
>
> >>>> refinement and improvement of the current, overly conservative
>
> >>>> behavior.
>
> >>>
>
> >>> Ok let's reenumerate what I think the different return values should
>
> >>> mean:
>
> >>>
>
> >>> * AMDGPU_CTX_GUILTY_RESET
>
> >>>
>
> >>> guilty is set to true for this context.
>
> >>>
>
> >>> * AMDGPU_CTX_INNOCENT_RESET
>
> >>>
>
> >>> guilty is not set and vram_lost_counter has changed.
>
> >>>
>
> >>> * AMDGPU_CTX_UNKNOWN_RESET
>
> >>>
>
> >>> guilty is not set and vram_lost_counter has not changed, but
>
> >>> gpu_reset_counter has changed.
>
> >>
>
> >> I don't understand the distinction you're proposing between
>
> >> AMDGPU_CTX_INNOCENT_RESET and AMDGPU_CTX_UNKNOWN_RESET. I think both
>
> >> cases you're describing should return either
>
> >> AMDGPU_CTX_INNOCENT_RESET, if the value of guilty is reliable, or
>
> >> AMDGPU_CTX_UNKNOWN_RESET if it's not.
>
> >
>
> > I think it'd make more sense if it was called
>
> > "AMDGPU_CTX_UNAFFECTED_RESET".
>
> >
>
> > So:
>
> > - AMDGPU_CTX_GUILTY_RESET --> the context was affected by a reset, and
>
> > we know that it's the context's fault
>
> > - AMDGPU_CTX_INNOCENT_RESET --> the context was affected by a reset,
>
> > and we know that it *wasn't* the context's fault (no context job
>
> > active)
>
> > - AMDGPU_CTX_UNKNOWN_RESET --> the context was affected by a reset,
>
> > and we don't know who's responsible (this could be returned in the
>
> > unlikely case where context A's gfx job has not yet finished, but
>
> > context B's gfx job has already started; it could be the fault of A,
>
> > it could be the fault of B -- which somehow manages to hang a part of
>
> > the hardware that then prevents A's job from finishing -- or it could
>
> > be both; but it's a bit academic)
>
> > - AMDGPU_CTX_UNAFFECTED_RESET --> there was a reset, but this context
>
> > wasn't affected
>
> >
>
> > This last value would currently just be discarded by Mesa (because we
>
> > should only disturb applications when we have to), but perhaps
>
> > somebody else could find it useful?
>
> Yes, that's what I had in mind as well.
>
> Cause otherwise we would return AMDGPU_CTX_NO_RESET while there
> actually was a reset and that certainly doesn't sound correct to me.
>
> Regards,
>
> Christian.
>
> >
>
> > Cheers,
>
> > Nicolai
>
More information about the amd-gfx
mailing list