[RFC] a new approach to detect which ring is the real black sheep upon TDR reported

Fri Feb 26 12:05:08 UTC 2021

Yeah that is exactly the stuff which doesn't works at all. We got 
feedback for multiple people that this whole approach of tying the job 
to the tdr was not a good idea at all.

What we should do instead is to have a pointer in the scheduler fence to 
which job it belongs. Freeing up the job when the scheduler fence is 
signaled is then job of the driver and not the scheduler any more.

The scheduler then gives the scheduler fence to the driver when a 
timeout is detected and the driver can do the rest of the handling all 
by itself.

But this problem is orthogonal to the suggested solution here.

> do you have a better solution or idea we review it as another 
> candidate RFC ?

I don't see much other option either. We could do something like only 
allowing one application at a time to use the gfx/compute block, but 
that would be even worse.

If we do this we should probably make it configurable as a module parameter.

Regards,
Christian.

Am 26.02.21 um 12:57 schrieb Liu, Monk:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> static void drm_sched_job_timedout(struct work_struct *work)
>
> 279 {
>
> 280     struct drm_gpu_scheduler *sched;
>
> 281     struct drm_sched_job *job;
>
> 282
>
> 283     sched = container_of(work, struct drm_gpu_scheduler, 
> work_tdr.work);
>
> 284
>
> 285     /* Protects against concurrent deletion in 
> drm_sched_get_cleanup_job */
>
> *286 spin_lock(&sched->job_list_lock);*
>
> *287     job = list_first_entry_or_null(&sched->ring_mirror_list,*
>
> *288                        struct drm_sched_job, node);*
>
> *289*
>
> *290     if (job) {*
>
> *291         /**
>
> *292          * Remove the bad job so it cannot be freed by concurrent*
>
> *293          * drm_sched_cleanup_jobs. It will be reinserted back 
> after sched->thread*
>
> *294          * is parked at which point it's safe.*
>
> *295          */*
>
> *296 list_del_init(&job->node);*
>
> *297 spin_unlock(&sched->job_list_lock);*
>
> *298*
>
> *299 job->sched->ops->timedout_job(job);*
>
> Thanks
>
> ------------------------------------------
>
> Monk Liu | Cloud-GPU Core team
>
> ------------------------------------------
>
> *From:* Liu, Monk
> *Sent:* Friday, February 26, 2021 7:54 PM
> *To:* Koenig, Christian <Christian.Koenig at amd.com>; 
> amd-gfx at lists.freedesktop.org
> *Cc:* Zhang, Andy <Andy.Zhang at amd.com>; Chen, Horace 
> <Horace.Chen at amd.com>; Zhang, Jack (Jian) <Jack.Zhang1 at amd.com>
> *Subject:* RE: [RFC] a new approach to detect which ring is the real 
> black sheep upon TDR reported
>
> [AMD Official Use Only - Internal Distribution Only]
>
> See in line
>
> Thanks
>
> ------------------------------------------
>
> Monk Liu | Cloud-GPU Core team
>
> ------------------------------------------
>
> *From:* Koenig, Christian <Christian.Koenig at amd.com 
> <mailto:Christian.Koenig at amd.com>>
> *Sent:* Friday, February 26, 2021 3:58 PM
> *To:* Liu, Monk <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>; 
> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
> *Cc:* Zhang, Andy <Andy.Zhang at amd.com <mailto:Andy.Zhang at amd.com>>; 
> Chen, Horace <Horace.Chen at amd.com <mailto:Horace.Chen at amd.com>>; 
> Zhang, Jack (Jian) <Jack.Zhang1 at amd.com <mailto:Jack.Zhang1 at amd.com>>
> *Subject:* Re: [RFC] a new approach to detect which ring is the real 
> black sheep upon TDR reported
>
> Hi Monk,
>
> in general an interesting idea, but I see two major problems with that:
>
> 1. It would make the reset take much longer.
>
> 2. Things get often stuck because of timing issues, so a guilty job 
> might pass perfectly when run a second time.
>
> [ML] but the innocent ring already reported a TDR, and the drm sched 
> logic already deleted this “sched_job” in its mirror list, thus you 
> don’t have chance to re-submit it again after reset, that’s the major 
> problem here.
>
> Apart from that the whole ring mirror list turned out to be a really 
> bad idea. E.g. we still struggle with object life time because the 
> concept doesn't fit into the object model of the GPU scheduler under 
> Linux.
>
> We should probably work on this separately and straighten up the job 
> destruction once more and keep the recovery information in the fence 
> instead.
>
> [ML] we claim to our customer that no innocent process will be dropped 
> or cancelled, and our current logic works for the most time, but only 
> when there are different process running on gfx/computes rings then we 
> would run into the tricky situation I stated here, and the proposal is 
> the only way I can figure out so far, do you have a better solution or 
> idea we review it as another candidate RFC ? Be note that we raised 
> this proposal is because we do hit our trouble and we do need to 
> resolve it …. So even a not perfect solution is still better than just 
> cancel the innocent job (and their context/process)
>
> Thanks !
>
>
> Regards,
> Christian.
>
> Am 26.02.21 um 06:58 schrieb Liu, Monk:
>
>     [AMD Public Use]
>
>     Hi all
>
>     NAVI2X  project hit a really hard to solve issue now, and it is
>     turned out to be a general headache of our TDR mechanism , check
>     below scenario:
>
>      1. There is a job1 running on compute1 ring at timestamp
>      2. There is a job2 running on gfx ring at timestamp
>      3. Job1 is the guilty one, and job1/job2 were scheduled to their
>         rings at almost the same timestamp
>      4. After 2 seconds we receive two TDR reporting from both GFX
>         ring and compute ring
>      5. *Current scheme is that in drm scheduler all the head jobs of
>         those two rings are considered “bad job” and taken away from
>         the mirror list *
>      6. The result is both the real guilty job (job1) and the innocent
>         job (job2) were all deleted from mirror list, and their
>         corresponding contexts were also treated as guilty*(so the
>         innocent process remains running is not secured)*
>
>     **
>
>     But by our wish the ideal case is TDR mechanism can detect which
>     ring is the guilty ring and the innocent ring can resubmits all
>     its pending jobs:
>
>      1. Job1 to be deleted from compute1 ring’s mirror list
>      2. Job2 is kept and resubmitted later and its belonging
>         process/context are even not aware of this TDR at all
>
>     Here I have a proposal tend to achieve above goal and it rough
>     procedure is :
>
>      1. Once any ring reports a TDR, the head job is **not** treated
>         as “bad job”, and it is **not** deleted from the mirror list
>         in drm sched functions
>      2. In vendor’s function (our amdgpu driver here):
>
>           * reset GPU
>           * repeat below actions on each RINGS * one by one *:
>
>     1.take the head job and submit it on this ring
>
>     2.see if it completes, if not then this job is the real “bad job”
>
>     3. take it away from mirror list if this head job is “bad job”
>
>           * After above iteration on all RINGS, we already clears all
>             the bad job(s)
>
>      3. Resubmit all jobs from each mirror list to their corresponding
>         rings (this is the existed logic)
>
>     The idea of this is to use “serial” way to re-run and re-check
>     each head job of each RING, in order to take out the real black
>     sheep and its guilty context.
>
>     P.S.: we can use this approaches only on GFX/KCQ ring reports TDR
>     , since those rings are intermutually affected to each other. For
>     SDMA ring timeout it definitely proves the head job on SDMA ring
>     is really guilty.
>
>     Thanks
>
>     ------------------------------------------
>
>     Monk Liu | Cloud-GPU Core team
>
>     ------------------------------------------
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20210226/675c99d2/attachment-0001.htm>