TDR and VRAM lost handling in KMD:
Christian König
christian.koenig at amd.com
Wed Oct 11 09:02:47 UTC 2017
> [ML] I think context is better than entity, because for example if you
> only block entity_0 of context and allow entity_N run, that means the
> dependency between entities are broken (e.g. page table updates in
>
> Sdma entity pass but gfx submit in GFX entity blocked, not make sense
> to me)
>
> We’d better either block the whole context or let not…
>
Page table updates are not part of any context.
So I think the only thing we can do is to mark the entity as not
scheduled any more.
> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all
> their fence status to “*ECANCELED*”
>
> Setting ECANCELED should be ok. But I think we should do this when we
> try to run the jobs and not during GPU reset.
>
> [ML] without deep thought and expritment, I’m not sure the difference
> between them, but kick it out in gpu_reset routine is more efficient,
>
I really don't think so. Kicking them out during gpu_reset sounds racy
to me once more.
And marking them canceled when we try to run them has the clear
advantage that all dependencies are meet first.
> ML: KMD mark all contexts as guilty is because that way we can unify
> our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no
> need to worry about vram-lost-counter anymore, that’s a implementation
> style. I don’t think it is related with UMD layer,
>
I don't think that this is a good idea. Instead when you want to unify
the behavior we should use the vram_lost_counter as marker for the
guilty context.
Regards,
Christian.
Am 11.10.2017 um 10:48 schrieb Liu, Monk:
>
> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so
> it's reasonable to use it. However, it /does not/ make sense to mark
> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a
> perfect example where the driver should report context lost to
> applications with the "innocent" flag for contexts that were idle at
> the time of reset. The only context(s) that should be reported as
> "guilty" (or perhaps "unknown" in some cases) are the ones that were
> executing at the time of reset.
>
> ML: KMD mark all contexts as guilty is because that way we can unify
> our IOCTL behavior: e.g. for IOCTL only block “guilty”context , no
> need to worry about vram-lost-counter anymore, that’s a implementation
> style. I don’t think it is related with UMD layer,
>
> For UMD the gl-context isn’t aware of by KMD, so UMD can implement it
> own “guilty” gl-context if you want.
>
> If KMD doesn’t mark all ctx as guilty after VRAM lost, can you
> illustrate what rule KMD should obey to check in KMS IOCTL like
> cs_sumbit ?? let’s see which way better
>
> *From:*Haehnle, Nicolai
> *Sent:* Wednesday, October 11, 2017 4:41 PM
> *To:* Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian
> <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>;
> Deucher, Alexander <Alexander.Deucher at amd.com>
> *Cc:* amd-gfx at lists.freedesktop.org; Ding, Pixel <Pixel.Ding at amd.com>;
> Jiang, Jerry (SW) <Jerry.Jiang at amd.com>; Li, Bingley
> <Bingley.Li at amd.com>; Ramirez, Alejandro <Alejandro.Ramirez at amd.com>;
> Filipas, Mario <Mario.Filipas at amd.com>
> *Subject:* Re: TDR and VRAM lost handling in KMD:
>
> From a Mesa perspective, this almost all sounds reasonable to me.
>
> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so
> it's reasonable to use it. However, it /does not/ make sense to mark
> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a
> perfect example where the driver should report context lost to
> applications with the "innocent" flag for contexts that were idle at
> the time of reset. The only context(s) that should be reported as
> "guilty" (or perhaps "unknown" in some cases) are the ones that were
> executing at the time of reset.
>
> On whether the whole context is marked as guilty from a user space
> perspective, it would simply be nice for user space to get consistent
> answers. It would be a bit odd if we could e.g. succeed in submitting
> an SDMA job after a GFX job was rejected. This would point in favor of
> marking the entire context as guilty (although that could happen
> lazily instead of at reset time). On the other hand, if that's too big
> a burden for the kernel implementation I'm sure we can live without it.
>
> Cheers,
>
> Nicolai
>
> ------------------------------------------------------------------------
>
> *From:*Liu, Monk
> *Sent:* Wednesday, October 11, 2017 10:15:40 AM
> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher,
> Alexander
> *Cc:* amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry
> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
> *Subject:* RE: TDR and VRAM lost handling in KMD:
>
> 1.Set its fence error status to “*ETIME*”,
>
> No, as I already explained ETIME is for synchronous operation.
>
> In other words when we return ETIME from the wait IOCTL it would mean
> that the waiting has somehow timed out, but not the job we waited for.
>
> Please use ECANCELED as well or some other error code when we find
> that we need to distinct the timedout job from the canceled ones
> (probably a good idea, but I'm not sure).
>
> [ML] I’m okay if you insist not to use ETIME
>
> 1.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>
> Not sure. Do we want to set the whole context as guilty or just the
> entity?
>
> Setting the whole contexts as guilty sounds racy to me.
>
> BTW: We should use a different name than "guilty", maybe just "bool
> canceled;" ?
>
> [ML] I think context is better than entity, because for example if you
> only block entity_0 of context and allow entity_N run, that means the
> dependency between entities are broken (e.g. page table updates in
>
> Sdma entity pass but gfx submit in GFX entity blocked, not make sense
> to me)
>
> We’d better either block the whole context or let not…
>
> 1.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set all
> their fence status to “*ECANCELED*”
>
> Setting ECANCELED should be ok. But I think we should do this when we
> try to run the jobs and not during GPU reset.
>
> [ML] without deep thought and expritment, I’m not sure the difference
> between them, but kick it out in gpu_reset routine is more efficient,
>
> Otherwise you need to check context/entity guilty flag in run_job
> routine …and you need to it for every context/entity, I don’t see why
>
> We don’t just kickout all of them in gpu_reset stage ….
>
> a)Iterate over all living ctx, and set all ctx as “*guilty*” since
> VRAM lost actually ruins all VRAM contents
>
> No, that shouldn't be done by comparing the counters. Iterating over
> all contexts is way to much overhead.
>
> [ML] because I want to make KMS IOCTL rules clean, like they don’t
> need to differentiate VRAM lost or not, they only interested in if the
> context is guilty or not, and block
>
> Submit for guilty ones.
>
> *Can you give more details of your idea? And better the detail
> implement in cs_submit, I want to see how you want to block submit
> without checking context guilty flag*
>
> a)Kick out all jobs in all ctx’s KFIFO queue, and set all their fence
> status to “*ECANCELDED*”
>
> Yes and no, that should be done when we try to run the jobs and not
> during GPU reset.
>
> [ML] again, kicking out them in gpu reset routine is high efficient,
> otherwise you need check on every job in run_job()
>
> Besides, can you illustrate the detail implementation ?
>
> Yes and no, dma_fence_get_status() is some specific handling for
> sync_file debugging (no idea why that made it into the common fence code).
>
> It was replaced by putting the error code directly into the fence, so
> just reading that one after waiting should be ok.
>
> Maybe we should fix dma_fence_get_status() to do the right thing for this?
>
> [ML] yeah, that’s too confusing, the name sound really the one I want
> to use, we should change it…
>
> *But look into the implement, I don**’t see why we cannot use it ? it
> also finally return the fence->error *
>
> *From:*Koenig, Christian
> *Sent:* Wednesday, October 11, 2017 3:21 PM
> *To:* Liu, Monk <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>; Haehnle,
> Nicolai <Nicolai.Haehnle at amd.com <mailto:Nicolai.Haehnle at amd.com>>;
> Olsak, Marek <Marek.Olsak at amd.com <mailto:Marek.Olsak at amd.com>>;
> Deucher, Alexander <Alexander.Deucher at amd.com
> <mailto:Alexander.Deucher at amd.com>>
> *Cc:* amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>; Ding, Pixel
> <Pixel.Ding at amd.com <mailto:Pixel.Ding at amd.com>>; Jiang, Jerry (SW)
> <Jerry.Jiang at amd.com <mailto:Jerry.Jiang at amd.com>>; Li, Bingley
> <Bingley.Li at amd.com <mailto:Bingley.Li at amd.com>>; Ramirez, Alejandro
> <Alejandro.Ramirez at amd.com <mailto:Alejandro.Ramirez at amd.com>>;
> Filipas, Mario <Mario.Filipas at amd.com <mailto:Mario.Filipas at amd.com>>
> *Subject:* Re: TDR and VRAM lost handling in KMD:
>
> See inline:
>
> Am 11.10.2017 um 07:33 schrieb Liu, Monk:
>
> Hi Christian & Nicolai,
>
> We need to achieve some agreements on what should MESA/UMD do and
> what should KMD do, *please give your comments with “okay” or “No”
> and your idea on below items,*
>
> ?When a job timed out (set from lockup_timeout kernel parameter),
> What KMD should do in TDR routine :
>
> 1.Update adev->*gpu_reset_counter*, and stop scheduler first,
> (*gpu_reset_counter* is used to force vm flush after GPU reset,
> out of this thread’s scope so no more discussion on it)
>
> Okay.
>
> 2.Set its fence error status to “*ETIME*”,
>
> No, as I already explained ETIME is for synchronous operation.
>
> In other words when we return ETIME from the wait IOCTL it would mean
> that the waiting has somehow timed out, but not the job we waited for.
>
> Please use ECANCELED as well or some other error code when we find
> that we need to distinct the timedout job from the canceled ones
> (probably a good idea, but I'm not sure).
>
> 3.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
>
> Not sure. Do we want to set the whole context as guilty or just the
> entity?
>
> Setting the whole contexts as guilty sounds racy to me.
>
> BTW: We should use a different name than "guilty", maybe just "bool
> canceled;" ?
>
> 4.Kick out this job from scheduler’s mirror list, so this job
> won’t get re-scheduled to ring anymore.
>
> Okay.
>
> 5.Kick out all jobs in this “guilty” ctx’s KFIFO queue, and set
> all their fence status to “*ECANCELED*”
>
> Setting ECANCELED should be ok. But I think we should do this when we
> try to run the jobs and not during GPU reset.
>
> 6.Force signal all fences that get kicked out by above two
> steps,*otherwise UMD will block forever if waiting on those fences*
>
> Okay.
>
> 7.Do gpu reset, which is can be some callbacks to let bare-metal
> and SR-IOV implement with their favor style
>
> Okay.
>
> 8.After reset, KMD need to aware if the VRAM lost happens or not,
> bare-metal can implement some function to judge, while for SR-IOV
> I prefer to read it from GIM side (for initial version we consider
> it’s always VRAM lost, till GIM side change aligned)
>
> Okay.
>
> 9.If VRAM lost not hit, continue, otherwise:
>
> a)Update adev->*vram_lost_counter*,
>
> Okay.
>
> b)Iterate over all living ctx, and set all ctx as “*guilty*” since
> VRAM lost actually ruins all VRAM contents
>
> No, that shouldn't be done by comparing the counters. Iterating over
> all contexts is way to much overhead.
>
> c)Kick out all jobs in all ctx’s KFIFO queue, and set all their
> fence status to “*ECANCELDED*”
>
> Yes and no, that should be done when we try to run the jobs and not
> during GPU reset.
>
> 10.Do GTT recovery and VRAM page tables/entries recovery
> (optional, do we need it ???)
>
> Yes, that is still needed. As Nicolai explained we can't be sure that
> VRAM is still 100% correct even when it isn't cleared.
>
> 11.Re-schedule all JOBs remains in mirror list to ring again and
> restart scheduler (for VRAM lost case, no JOB will re-scheduled)
>
> Okay.
>
> ?For cs_wait() IOCTL:
>
> After it found fence signaled, it should check with
> *“dma_fence_get_status” *to see if there is error there,
>
> And return the error status of fence
>
> Yes and no, dma_fence_get_status() is some specific handling for
> sync_file debugging (no idea why that made it into the common fence code).
>
> It was replaced by putting the error code directly into the fence, so
> just reading that one after waiting should be ok.
>
> Maybe we should fix dma_fence_get_status() to do the right thing for this?
>
> ?For cs_wait_fences() IOCTL:
>
> Similar with above approach
>
> ?For cs_submit() IOCTL:
>
> It need to check if current ctx been marked as “*guilty*” and
> return “*ECANCELED*” if so
>
> ?Introduce a new IOCTL to let UMD query *vram_lost_counter*:
>
> This way, UMD can also block app from submitting, like @Nicolai
> mentioned, we can cache one copy of *vram_lost_counter* when
> enumerate physical device, and deny all
>
> gl-context from submitting if the counter queried bigger than that
> one cached in physical device. (looks a little overkill to me, but
> easy to implement )
>
> UMD can also return error to APP when creating gl-context if found
> current queried*vram_lost_counter *bigger than that one cached in
> physical device.
>
> Okay. Already have a patch for this, please review that one if you
> haven't already done so.
>
> Regards,
> Christian.
>
> BTW: I realized that gl-context is a little different with
> kernel’s context. Because for kernel. BO is not related with
> context but only with FD, while in UMD, BO have a backend
>
> gl-context, so block submitting in UMD layer is also needed
> although KMD will do its job as bottom line
>
> ?Basically “vram_lost_counter” is exposure by kernel to let UMD
> take the control of robust extension feature, it will be UMD’s
> call to move, KMD only deny “guilty” context from submitting
>
> Need your feedback, thx
>
> We’d better make TDR feature landed ASAP
>
> BR Monk
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20171011/c674e13b/attachment-0001.html>
More information about the amd-gfx
mailing list