[RFC] a new approach to detect which ring is the real black sheep upon TDR reported

Fri Feb 26 16:49:33 UTC 2021

On 2021-02-26 6:54 a.m., Liu, Monk wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> See in line
>
> Thanks
>
> ------------------------------------------
>
> Monk Liu | Cloud-GPU Core team
>
> ------------------------------------------
>
> *From:* Koenig, Christian <Christian.Koenig at amd.com>
> *Sent:* Friday, February 26, 2021 3:58 PM
> *To:* Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
> *Cc:* Zhang, Andy <Andy.Zhang at amd.com>; Chen, Horace 
> <Horace.Chen at amd.com>; Zhang, Jack (Jian) <Jack.Zhang1 at amd.com>
> *Subject:* Re: [RFC] a new approach to detect which ring is the real 
> black sheep upon TDR reported
>
> Hi Monk,
>
> in general an interesting idea, but I see two major problems with that:
>
> 1. It would make the reset take much longer.
>
> 2. Things get often stuck because of timing issues, so a guilty job 
> might pass perfectly when run a second time.
>
> [ML] but the innocent ring already reported a TDR, and the drm sched 
> logic already deleted this “sched_job” in its mirror list, thus you 
> don’t have chance to re-submit it again after reset, that’s the major 
> problem here.
>

Just to confirm I understand correctly, Monk reports a scenario where 
the second TDR that was reported by the innocent job is bailing out 
BEFORE having a chance to run  drm_sched_stop for that scheduler which 
should have reinserted the job back into mirror list (because the first 
TDR run is still in progress and hence amdgpu_device_lock_adev fails for 
the second TDR) and so the innocent job which was extracted from mirror 
list in drm_sched_job_timedout is now lost.
If so and as a possible quick fix until we overhaul the entire design as 
suggested in this thread - maybe we can modify 
drm_sched_backend_ops.timedout_job callback to report back premature 
termination BEFORE drm_sched_stop had a chance to run and then reinsert 
back the job into mirror list from within drm_sched_job_timedout? There 
is no problem of racing against concurrent drm_sched_get_cleanup_job 
once we reinsert there as we don't reference the job pointer anymore 
after this point and so if it's already signaled and freed right away - 
it's ok.

Andrey

>
> Apart from that the whole ring mirror list turned out to be a really 
> bad idea. E.g. we still struggle with object life time because the 
> concept doesn't fit into the object model of the GPU scheduler under 
> Linux.
>
> We should probably work on this separately and straighten up the job 
> destruction once more and keep the recovery information in the fence 
> instead.
>
> [ML] we claim to our customer that no innocent process will be dropped 
> or cancelled, and our current logic works for the most time, but only 
> when there are different process running on gfx/computes rings then we 
> would run into the tricky situation I stated here, and the proposal is 
> the only way I can figure out so far, do you have a better solution or 
> idea we review it as another candidate RFC ? Be note that we raised 
> this proposal is because we do hit our trouble and we do need to 
> resolve it …. So even a not perfect solution is still better than just 
> cancel the innocent job (and their context/process)
>
> Thanks !
>
>
> Regards,
> Christian.
>
> Am 26.02.21 um 06:58 schrieb Liu, Monk:
>
>     [AMD Public Use]
>
>     Hi all
>
>     NAVI2X  project hit a really hard to solve issue now, and it is
>     turned out to be a general headache of our TDR mechanism , check
>     below scenario:
>
>      1. There is a job1 running on compute1 ring at timestamp
>      2. There is a job2 running on gfx ring at timestamp
>      3. Job1 is the guilty one, and job1/job2 were scheduled to their
>         rings at almost the same timestamp
>      4. After 2 seconds we receive two TDR reporting from both GFX
>         ring and compute ring
>      5. *Current scheme is that in drm scheduler all the head jobs of
>         those two rings are considered “bad job” and taken away from
>         the mirror list *
>      6. The result is both the real guilty job (job1) and the innocent
>         job (job2) were all deleted from mirror list, and their
>         corresponding contexts were also treated as guilty*(so the
>         innocent process remains running is not secured)*
>
>     **
>
>     But by our wish the ideal case is TDR mechanism can detect which
>     ring is the guilty ring and the innocent ring can resubmits all
>     its pending jobs:
>
>      1. Job1 to be deleted from compute1 ring’s mirror list
>      2. Job2 is kept and resubmitted later and its belonging
>         process/context are even not aware of this TDR at all
>
>     Here I have a proposal tend to achieve above goal and it rough
>     procedure is :
>
>      1. Once any ring reports a TDR, the head job is **not** treated
>         as “bad job”, and it is **not** deleted from the mirror list
>         in drm sched functions
>      2. In vendor’s function (our amdgpu driver here):
>
>           * reset GPU
>           * repeat below actions on each RINGS * one by one *:
>
>     1.take the head job and submit it on this ring
>
>     2.see if it completes, if not then this job is the real “bad job”
>
>     3. take it away from mirror list if this head job is “bad job”
>
>           * After above iteration on all RINGS, we already clears all
>             the bad job(s)
>
>      3. Resubmit all jobs from each mirror list to their corresponding
>         rings (this is the existed logic)
>
>     The idea of this is to use “serial” way to re-run and re-check
>     each head job of each RING, in order to take out the real black
>     sheep and its guilty context.
>
>     P.S.: we can use this approaches only on GFX/KCQ ring reports TDR
>     , since those rings are intermutually affected to each other. For
>     SDMA ring timeout it definitely proves the head job on SDMA ring
>     is really guilty.
>
>     Thanks
>
>     ------------------------------------------
>
>     Monk Liu | Cloud-GPU Core team
>
>     ------------------------------------------
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20210226/9c54624e/attachment-0001.htm>