[RFC] a new approach to detect which ring is the real black sheep upon TDR reported

Fri Feb 26 07:57:52 UTC 2021

Hi Monk,

in general an interesting idea, but I see two major problems with that:

1. It would make the reset take much longer.

2. Things get often stuck because of timing issues, so a guilty job 
might pass perfectly when run a second time.

Apart from that the whole ring mirror list turned out to be a really bad 
idea. E.g. we still struggle with object life time because the concept 
doesn't fit into the object model of the GPU scheduler under Linux.

We should probably work on this separately and straighten up the job 
destruction once more and keep the recovery information in the fence 
instead.

Regards,
Christian.

Am 26.02.21 um 06:58 schrieb Liu, Monk:
>
> [AMD Public Use]
>
>
> Hi all
>
> NAVI2X  project hit a really hard to solve issue now, and it is turned 
> out to be a general headache of our TDR mechanism , check below scenario:
>
>  1. There is a job1 running on compute1 ring at timestamp
>  2. There is a job2 running on gfx ring at timestamp
>  3. Job1 is the guilty one, and job1/job2 were scheduled to their
>     rings at almost the same timestamp
>  4. After 2 seconds we receive two TDR reporting from both GFX ring
>     and compute ring
>  5. *Current scheme is that in drm scheduler all the head jobs of
>     those two rings are considered “bad job” and taken away from the
>     mirror list *
>  6. The result is both the real guilty job (job1) and the innocent job
>     (job2) were all deleted from mirror list, and their corresponding
>     contexts were also treated as guilty*(so the innocent process
>     remains running is not secured)*
>
> **
>
> But by our wish the ideal case is TDR mechanism can detect which ring 
> is the guilty ring and the innocent ring can resubmits all its pending 
> jobs:
>
>  1. Job1 to be deleted from compute1 ring’s mirror list
>  2. Job2 is kept and resubmitted later and its belonging
>     process/context are even not aware of this TDR at all
>
> Here I have a proposal tend to achieve above goal and it rough 
> procedure is :
>
>  1. Once any ring reports a TDR, the head job is **not** treated as
>     “bad job”, and it is **not** deleted from the mirror list in drm
>     sched functions
>  2. In vendor’s function (our amdgpu driver here):
>       * reset GPU
>       * repeat below actions on each RINGS * one by one *:
>
> 1.take the head job and submit it on this ring
>
> 2.see if it completes, if not then this job is the real “bad job”
>
> 3. take it away from mirror list if this head job is “bad job”
>
>       * After above iteration on all RINGS, we already clears all the
>         bad job(s)
>  2. Resubmit all jobs from each mirror list to their corresponding
>     rings (this is the existed logic)
>
> The idea of this is to use “serial” way to re-run and re-check each 
> head job of each RING, in order to take out the real black sheep and 
> its guilty context.
>
> P.S.: we can use this approaches only on GFX/KCQ ring reports TDR , 
> since those rings are intermutually affected to each other. For SDMA 
> ring timeout it definitely proves the head job on SDMA ring is really 
> guilty.
>
> Thanks
>
> ------------------------------------------
>
> Monk Liu | Cloud-GPU Core team
>
> ------------------------------------------
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20210226/e1fd6832/attachment.htm>