Re: 回复: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

Sat Feb 27 16:07:43 UTC 2021

Am 27.02.21 um 04:50 schrieb Liu, Monk:
>
> [AMD Official Use Only - Internal Distribution Only]
>
>
> the code I pasted is to illustrate why the innocent job is already 
> taken out in the mirror list thus my suggested proposal won’t work 
> unless we don’t delete the job in sched_job_timeout() routine, and the 
> problem you stated is with my understanding also kind of related with 
> my suggested solution – the job removing from list should be handled 
> by driver instead of scheduler .
>

Yes, exactly that's my thinking as well.

> let make scheduler’s duty clear and simple : the sched_job_timeout() 
> only get notification when a sched_job timedout but it doesn’t judge 
>  if the leading job in mirror list should be blamed , all those 
> checking should be left to driver to take action.
>

Need to get a detailed look, but it sounds correct as well.

> >>If we do this we should probably make it configurable as a module 
> parameter.
>
> That’s ok,  maybe we can reuse the existed parm “gpu_recovery”, extend 
> it with:
>
> l0 – no recovery initiated after job timeout
>
> l1 – legacy TDR behave
>
> l2 – enhanced TDR behave (the one suggested here)
>

Yes, something like that should work. Key point is we had a couple of 
people who already suggested to optimize the reset routine so that it 
doesn't take so long.

So far I pushed back on this because the reset routine isn't something I 
would optimize for speed. But when it starts to take something like 10 
seconds instead of halve a second because you had an extra long running 
compute job we will certainly see complains.

Regards,
Christian.

> *发件人:*Koenig, Christian <Christian.Koenig at amd.com>
> *发送时间:*2021年2月26日20:05
> *收件人:*Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
> *抄送:*Zhang, Andy <Andy.Zhang at amd.com>; Chen, Horace 
> <Horace.Chen at amd.com>; Zhang, Jack (Jian) <Jack.Zhang1 at amd.com>
> *主题:*Re: [RFC] a new approach to detect which ring is the real black 
> sheep upon TDR reported
>
> Yeah that is exactly the stuff which doesn't works at all. We got 
> feedback for multiple people that this whole approach of tying the job 
> to the tdr was not a good idea at all.
>
> What we should do instead is to have a pointer in the scheduler fence 
> to which job it belongs. Freeing up the job when the scheduler fence 
> is signaled is then job of the driver and not the scheduler any more.
>
> The scheduler then gives the scheduler fence to the driver when a 
> timeout is detected and the driver can do the rest of the handling all 
> by itself.
>
> But this problem is orthogonal to the suggested solution here.
>
>
>     do you have a better solution or idea we review it as another
>     candidate RFC ?
>
>
> I don't see much other option either. We could do something like only 
> allowing one application at a time to use the gfx/compute block, but 
> that would be even worse.
>
> If we do this we should probably make it configurable as a module 
> parameter.
>
> Regards,
> Christian.
>
> Am 26.02.21 um 12:57 schrieb Liu, Monk:
>
>     [AMD Official Use Only - Internal Distribution Only]
>
>     static void drm_sched_job_timedout(struct work_struct *work)
>
>     279 {
>
>     280     struct drm_gpu_scheduler *sched;
>
>     281     struct drm_sched_job *job;
>
>     282
>
>     283     sched = container_of(work, struct drm_gpu_scheduler,
>     work_tdr.work);
>
>     284
>
>     285     /* Protects against concurrent deletion in
>     drm_sched_get_cleanup_job */
>
>     *286 spin_lock(&sched->job_list_lock);*
>
>     *287     job = list_first_entry_or_null(&sched->ring_mirror_list,*
>
>     *288 struct drm_sched_job, node);*
>
>     *289*
>
>     *290     if (job) {*
>
>     *291         /**
>
>     *292          * Remove the bad job so it cannot be freed by
>     concurrent*
>
>     *293          * drm_sched_cleanup_jobs. It will be reinserted back
>     after sched->thread*
>
>     *294          * is parked at which point it's safe.*
>
>     *295          */*
>
>     *296 list_del_init(&job->node);*
>
>     *297 spin_unlock(&sched->job_list_lock);*
>
>     *298*
>
>     *299 job->sched->ops->timedout_job(job);*
>
>     Thanks
>
>     ------------------------------------------
>
>     Monk Liu | Cloud-GPU Core team
>
>     ------------------------------------------
>
>     *From:*Liu, Monk
>     *Sent:* Friday, February 26, 2021 7:54 PM
>     *To:* Koenig, Christian <Christian.Koenig at amd.com>
>     <mailto:Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
>     <mailto:amd-gfx at lists.freedesktop.org>
>     *Cc:* Zhang, Andy <Andy.Zhang at amd.com>
>     <mailto:Andy.Zhang at amd.com>; Chen, Horace <Horace.Chen at amd.com>
>     <mailto:Horace.Chen at amd.com>; Zhang, Jack (Jian)
>     <Jack.Zhang1 at amd.com> <mailto:Jack.Zhang1 at amd.com>
>     *Subject:* RE: [RFC] a new approach to detect which ring is the
>     real black sheep upon TDR reported
>
>     [AMD Official Use Only - Internal Distribution Only]
>
>     See in line
>
>     Thanks
>
>     ------------------------------------------
>
>     Monk Liu | Cloud-GPU Core team
>
>     ------------------------------------------
>
>     *From:*Koenig, Christian <Christian.Koenig at amd.com
>     <mailto:Christian.Koenig at amd.com>>
>     *Sent:* Friday, February 26, 2021 3:58 PM
>     *To:* Liu, Monk <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>;
>     amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
>     *Cc:* Zhang, Andy <Andy.Zhang at amd.com
>     <mailto:Andy.Zhang at amd.com>>; Chen, Horace <Horace.Chen at amd.com
>     <mailto:Horace.Chen at amd.com>>; Zhang, Jack (Jian)
>     <Jack.Zhang1 at amd.com <mailto:Jack.Zhang1 at amd.com>>
>     *Subject:* Re: [RFC] a new approach to detect which ring is the
>     real black sheep upon TDR reported
>
>     Hi Monk,
>
>     in general an interesting idea, but I see two major problems with
>     that:
>
>     1. It would make the reset take much longer.
>
>     2. Things get often stuck because of timing issues, so a guilty
>     job might pass perfectly when run a second time.
>
>     [ML] but the innocent ring already reported a TDR, and the drm
>     sched logic already deleted this “sched_job” in its mirror list,
>     thus you don’t have chance to re-submit it again after reset,
>     that’s the major problem here.
>
>     Apart from that the whole ring mirror list turned out to be a
>     really bad idea. E.g. we still struggle with object life time
>     because the concept doesn't fit into the object model of the GPU
>     scheduler under Linux.
>
>     We should probably work on this separately and straighten up the
>     job destruction once more and keep the recovery information in the
>     fence instead.
>
>     [ML] we claim to our customer that no innocent process will be
>     dropped or cancelled, and our current logic works for the most
>     time, but only when there are different process running on
>     gfx/computes rings then we would run into the tricky situation I
>     stated here, and the proposal is the only way I can figure out so
>     far, do you have a better solution or idea we review it as another
>     candidate RFC ? Be note that we raised this proposal is because we
>     do hit our trouble and we do need to resolve it …. So even a not
>     perfect solution is still better than just cancel the innocent job
>     (and their context/process)
>
>     Thanks !
>
>
>     Regards,
>     Christian.
>
>     Am 26.02.21 um 06:58 schrieb Liu, Monk:
>
>         [AMD Public Use]
>
>         Hi all
>
>         NAVI2X  project hit a really hard to solve issue now, and it
>         is turned out to be a general headache of our TDR mechanism ,
>         check below scenario:
>
>          1. There is a job1 running on compute1 ring at timestamp
>          2. There is a job2 running on gfx ring at timestamp
>          3. Job1 is the guilty one, and job1/job2 were scheduled to
>             their rings at almost the same timestamp
>          4. After 2 seconds we receive two TDR reporting from both GFX
>             ring and compute ring
>          5. *Current scheme is that in drm scheduler all the head jobs
>             of those two rings are considered “bad job” and taken away
>             from the mirror list *
>          6. The result is both the real guilty job (job1) and the
>             innocent job (job2) were all deleted from mirror list, and
>             their corresponding contexts were also treated as
>             guilty*(so the innocent process remains running is not
>             secured)*
>
>         **
>
>         But by our wish the ideal case is TDR mechanism can detect
>         which ring is the guilty ring and the innocent ring can
>         resubmits all its pending jobs:
>
>          1. Job1 to be deleted from compute1 ring’s mirror list
>          2. Job2 is kept and resubmitted later and its belonging
>             process/context are even not aware of this TDR at all
>
>         Here I have a proposal tend to achieve above goal and it rough
>         procedure is :
>
>          1. Once any ring reports a TDR, the head job is **not**
>             treated as “bad job”, and it is **not** deleted from the
>             mirror list in drm sched functions
>          2. In vendor’s function (our amdgpu driver here):
>
>               * reset GPU
>               * repeat below actions on each RINGS * one by one *:
>
>         1.take the head job and submit it on this ring
>
>         2.see if it completes, if not then this job is the real “bad job”
>
>         3. take it away from mirror list if this head job is “bad job”
>
>               * After above iteration on all RINGS, we already clears
>                 all the bad job(s)
>
>          3. Resubmit all jobs from each mirror list to their
>             corresponding rings (this is the existed logic)
>
>         The idea of this is to use “serial” way to re-run and re-check
>         each head job of each RING, in order to take out the real
>         black sheep and its guilty context.
>
>         P.S.: we can use this approaches only on GFX/KCQ ring reports
>         TDR , since those rings are intermutually affected to each
>         other. For SDMA ring timeout it definitely proves the head job
>         on SDMA ring is really guilty.
>
>         Thanks
>
>         ------------------------------------------
>
>         Monk Liu | Cloud-GPU Core team
>
>         ------------------------------------------
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20210227/6f5217c2/attachment-0001.htm>