Re: 回复: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Sun Feb 28 00:55:18 UTC 2021
On 2021-02-26 10:56 p.m., Liu, Monk wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
>
> H Andrey
>
> The scenario I hit here is not the one you mentioned, let me explain
> it with more details by another much easier understood example:
>
> Consider ring you have a job1 on KCQ, but the timeout of KCQ is 60
> seconds (just for example)
>
> You also have a job2 on GFX ring, and the timeout of GFX is 2 seconds
>
> We submit job1 first, and assume job1 have bug and it will cause
> shader hang very very soon
>
> After 10 seconds we submit job2, since KCQ have 60 seconds to report
> TDR thus SW know nothing about the engine already hang
>
So gfx job hangs because it has a dependency on buggy compute job which
already is hanging ?
> After 2 seconds we got TDR report from job2 on GFX ring,
> sched_job_timeout() think the leading job of GFX ring is the black
> sheep so it is deleted from the mirror list
>
> But in fact this job1 is innocent, and we should insert it back after
> recovery , and due to it was already deleted this innocent job’s
> context/process is really harmed
>
I am still missing something - we don't ever delete bad jobs or any jobs
until they are signaled, we reinsert the bad job back into mirror list
in drm_sched_stop
(here -
https://elixir.bootlin.com/linux/v5.11.1/source/drivers/gpu/drm/scheduler/sched_main.c#L385)
after sched thread is stopped and continue with the reset procedure.
Andrey
> Hope above example helps
>
> Thanks
>
> *发件人:*Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
> *发送时间:*2021年2月27日0:50
> *收件人:*Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian
> <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
> *抄送:*Zhang, Andy <Andy.Zhang at amd.com>; Chen, Horace
> <Horace.Chen at amd.com>; Zhang, Jack (Jian) <Jack.Zhang1 at amd.com>
> *主题:*Re: [RFC] a new approach to detect which ring is the real black
> sheep upon TDR reported
>
> On 2021-02-26 6:54 a.m., Liu, Monk wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> See in line
>
> Thanks
>
> ------------------------------------------
>
> Monk Liu | Cloud-GPU Core team
>
> ------------------------------------------
>
> *From:*Koenig, Christian <Christian.Koenig at amd.com>
> <mailto:Christian.Koenig at amd.com>
> *Sent:* Friday, February 26, 2021 3:58 PM
> *To:* Liu, Monk <Monk.Liu at amd.com> <mailto:Monk.Liu at amd.com>;
> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
> *Cc:* Zhang, Andy <Andy.Zhang at amd.com>
> <mailto:Andy.Zhang at amd.com>; Chen, Horace <Horace.Chen at amd.com>
> <mailto:Horace.Chen at amd.com>; Zhang, Jack (Jian)
> <Jack.Zhang1 at amd.com> <mailto:Jack.Zhang1 at amd.com>
> *Subject:* Re: [RFC] a new approach to detect which ring is the
> real black sheep upon TDR reported
>
> Hi Monk,
>
> in general an interesting idea, but I see two major problems with
> that:
>
> 1. It would make the reset take much longer.
>
> 2. Things get often stuck because of timing issues, so a guilty
> job might pass perfectly when run a second time.
>
> [ML] but the innocent ring already reported a TDR, and the drm
> sched logic already deleted this “sched_job” in its mirror list,
> thus you don’t have chance to re-submit it again after reset,
> that’s the major problem here.
>
> Just to confirm I understand correctly, Monk reports a scenario where
> the second TDR that was reported by the innocent job is bailing out
> BEFORE having a chance to run drm_sched_stop for that scheduler which
> should have reinserted the job back into mirror list (because the
> first TDR run is still in progress and hence amdgpu_device_lock_adev
> fails for the second TDR) and so the innocent job which was extracted
> from mirror list in drm_sched_job_timedout is now lost.
> If so and as a possible quick fix until we overhaul the entire design
> as suggested in this thread - maybe we can modify
> drm_sched_backend_ops.timedout_job callback to report back premature
> termination BEFORE drm_sched_stop had a chance to run and then
> reinsert back the job into mirror list from within
> drm_sched_job_timedout? There is no problem of racing against
> concurrent drm_sched_get_cleanup_job once we reinsert there as we
> don't reference the job pointer anymore after this point and so if
> it's already signaled and freed right away - it's ok.
>
> Andrey
>
>
> Apart from that the whole ring mirror list turned out to be a
> really bad idea. E.g. we still struggle with object life time
> because the concept doesn't fit into the object model of the GPU
> scheduler under Linux.
>
> We should probably work on this separately and straighten up the
> job destruction once more and keep the recovery information in the
> fence instead.
>
> [ML] we claim to our customer that no innocent process will be
> dropped or cancelled, and our current logic works for the most
> time, but only when there are different process running on
> gfx/computes rings then we would run into the tricky situation I
> stated here, and the proposal is the only way I can figure out so
> far, do you have a better solution or idea we review it as another
> candidate RFC ? Be note that we raised this proposal is because we
> do hit our trouble and we do need to resolve it …. So even a not
> perfect solution is still better than just cancel the innocent job
> (and their context/process)
>
> Thanks !
>
>
> Regards,
> Christian.
>
> Am 26.02.21 um 06:58 schrieb Liu, Monk:
>
> [AMD Public Use]
>
> Hi all
>
> NAVI2X project hit a really hard to solve issue now, and it
> is turned out to be a general headache of our TDR mechanism ,
> check below scenario:
>
> 1. There is a job1 running on compute1 ring at timestamp
> 2. There is a job2 running on gfx ring at timestamp
> 3. Job1 is the guilty one, and job1/job2 were scheduled to
> their rings at almost the same timestamp
> 4. After 2 seconds we receive two TDR reporting from both GFX
> ring and compute ring
> 5. *Current scheme is that in drm scheduler all the head jobs
> of those two rings are considered “bad job” and taken away
> from the mirror list *
> 6. The result is both the real guilty job (job1) and the
> innocent job (job2) were all deleted from mirror list, and
> their corresponding contexts were also treated as
> guilty*(so the innocent process remains running is not
> secured)*
>
> **
>
> But by our wish the ideal case is TDR mechanism can detect
> which ring is the guilty ring and the innocent ring can
> resubmits all its pending jobs:
>
> 1. Job1 to be deleted from compute1 ring’s mirror list
> 2. Job2 is kept and resubmitted later and its belonging
> process/context are even not aware of this TDR at all
>
> Here I have a proposal tend to achieve above goal and it rough
> procedure is :
>
> 1. Once any ring reports a TDR, the head job is **not**
> treated as “bad job”, and it is **not** deleted from the
> mirror list in drm sched functions
> 2. In vendor’s function (our amdgpu driver here):
>
> * reset GPU
> * repeat below actions on each RINGS * one by one *:
>
> 1.take the head job and submit it on this ring
>
> 2.see if it completes, if not then this job is the real “bad job”
>
> 3. take it away from mirror list if this head job is “bad job”
>
> * After above iteration on all RINGS, we already clears
> all the bad job(s)
>
> 3. Resubmit all jobs from each mirror list to their
> corresponding rings (this is the existed logic)
>
> The idea of this is to use “serial” way to re-run and re-check
> each head job of each RING, in order to take out the real
> black sheep and its guilty context.
>
> P.S.: we can use this approaches only on GFX/KCQ ring reports
> TDR , since those rings are intermutually affected to each
> other. For SDMA ring timeout it definitely proves the head job
> on SDMA ring is really guilty.
>
> Thanks
>
> ------------------------------------------
>
> Monk Liu | Cloud-GPU Core team
>
> ------------------------------------------
>
>
>
> _______________________________________________
>
> amd-gfx mailing list
>
> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
>
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20210227/513b8324/attachment-0001.htm>
More information about the amd-gfx
mailing list