<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body> <div dir="auto"> <div><br> <div class="gmail_extra"><br> <div class="gmail_quote">Am 04.10.2019 18:02 schrieb Steven Price <steven.price@arm.com>:<br type="attribution"> <blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div><font size="2"><span style="font-size:11pt"> <div>On 04/10/2019 16:34, Koenig, Christian wrote:<br> > Am 04.10.19 um 17:27 schrieb Steven Price:<br> >> On 04/10/2019 16:03, Neil Armstrong wrote:<br> >>> On 04/10/2019 16:53, Grodzovsky, Andrey wrote:<br> >>>> On 10/3/19 4:34 AM, Neil Armstrong wrote:<br> >>>>> Hi Andrey,<br> >>>>><br> >>>>> Le 02/10/2019 à 16:40, Grodzovsky, Andrey a écrit :<br> >>>>>> On 9/30/19 10:52 AM, Hillf Danton wrote:<br> >>>>>>> On Mon, 30 Sep 2019 11:17:45 +0200 Neil Armstrong wrote:<br> >>>>>>>> Did a new run from 5.3:<br> >>>>>>>><br> >>>>>>>> [ 35.971972] Call trace:<br> >>>>>>>> [ 35.974391] drm_sched_increase_karma+0x5c/0xf0<br> >>>>>>>> ffff000010667f38 FFFF000010667F94<br> >>>>>>>> drivers/gpu/drm/scheduler/sched_main.c:335<br> >>>>>>>><br> >>>>>>>> The crashing line is :<br> >>>>>>>> if (bad->s_fence->scheduled.context ==<br> >>>>>>>> entity->fence_context) {<br> >>>>>>>><br> >>>>>>>> Doesn't seem related to guilty job.<br> >>>>>>> Bail out if s_fence is no longer fresh.<br> >>>>>>><br> >>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c<br> >>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c<br> >>>>>>> @@ -333,6 +333,10 @@ void drm_sched_increase_karma(struct drm<br> >>>>>>> <br> >>>>>>> spin_lock(&rq->lock);<br> >>>>>>> list_for_each_entry_safe(entity, tmp, &rq->entities, list) {<br> >>>>>>> + if (!smp_load_acquire(&bad->s_fence)) {<br> >>>>>>> + spin_unlock(&rq->lock);<br> >>>>>>> + return;<br> >>>>>>> + }<br> >>>>>>> if (bad->s_fence->scheduled.context ==<br> >>>>>>> entity->fence_context) {<br> >>>>>>> if (atomic_read(&bad->karma) ><br> >>>>>>> @@ -543,7 +547,7 @@ EXPORT_SYMBOL(drm_sched_job_init);<br> >>>>>>> void drm_sched_job_cleanup(struct drm_sched_job *job)<br> >>>>>>> {<br> >>>>>>> dma_fence_put(&job->s_fence->finished);<br> >>>>>>> - job->s_fence = NULL;<br> >>>>>>> + smp_store_release(&job->s_fence, 0);<br> >>>>>>> }<br> >>>>>>> EXPORT_SYMBOL(drm_sched_job_cleanup);<br> >>>>> This fixed the problem on the 10 CI runs.<br> >>>>><br> >>>>> Neil<br> >>>><br> >>>> These are good news but I still fail to see how this fixes the problem -<br> >>>> Hillf, do you mind explaining how you came up with this particular fix -<br> >>>> what was the bug you saw ?<br> >>> As Steven explained, seems the same job was submitted on both HW slots,<br> >>> and then when timeout occurs each thread calls panfrost_job_timedout<br> >>> which leads to drm_sched_stop() on the first call and on the<br> >>> second call the job was already freed.<br> >>><br> >>> Steven proposed a working fix, and this one does the same but on<br> >>> the drm_sched side. This one looks cleaner, but panfrost should<br> >>> not call drm_sched_stop() twice for the same job.<br> >> I'm not sure that Hillf's fix is sufficient. In particular in<br> >> drm_sched_increase_karma() I don't understand how the smp_load_acquire()<br> >> call prevents bad->s_fence becoming NULL immediately afterwards (but<br> >> admittedly the window is much smaller). But really this is just a<br> >> Panfrost bug (calling drm_sched_stop() twice on the same job).<br> >><br> >> The part of my change that I'd welcome feedback on is changing<br> >> cancel_delayed_work() to cancel_delayed_work_sync() in drm_sched_stop()<br> >> when called on different scheduler to the bad job. It's not clear to me<br> >> exactly what the semantics of the function should be, and I haven't<br> >> tested the effect of the change on drivers other than Panfrost.<br> > <br> > Yeah, at least of hand that change doesn't seem to make sense to me.<br> <br> We need to ensure that any other timeouts that might have started<br> processing are complete before actually resetting the hardware.<br> Otherwise after the reset another thread could come along and attempt to<br> reset the hardware again (and cause a double free of a job). </div> </span></font></div> </blockquote> </div> </div> </div> <div dir="auto"><br> </div> <div dir="auto">This is intentional behaviour. If you don't want the double reset in Panfrost you should probably call the cancel_sync yourself.</div> <div dir="auto"><br> </div> <div dir="auto">Regards,</div> <div dir="auto">Christian.</div> <div dir="auto"><br> </div> <div dir="auto"><br> </div> <div dir="auto"> <div class="gmail_extra"> <div class="gmail_quote"> <blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div><font size="2"><span style="font-size:11pt"> <div>My change<br> to use the _sync() variant prevents this happening.<br> <br> > Multiple timeout workers can perfectly run in parallel, Panfrost needs <br> > to make sure that they don't affect each other.<br> > <br> > The simplest way of doing this is to have a mutex you trylock when <br> > starting the reset.<br> > <br> > The first one grabbing it wins, all other just silently return.<br> <br> Ah that would simplify my change removing the need for the new variable.<br> I hadn't thought to use trylock. I'll give that a spin and post a new<br> simpler version.<br> <br> Thanks,<br> <br> Steve<br> <br> > Regards,<br> > Christian.<br> > <br> >><br> >> Steve<br> >><br> >>> Neil<br> >>><br> >>>> Andrey<br> >>>><br> >>>>>> Does this change help the problem ? Note that drm_sched_job_cleanup is<br> >>>>>> called from scheduler thread which is stopped at all times when work_tdr<br> >>>>>> thread is running and anyway the 'bad' job is still in the<br> >>>>>> ring_mirror_list while it's being accessed from<br> >>>>>> drm_sched_increase_karma so I don't think drm_sched_job_cleanup can be<br> >>>>>> called for it BEFORE or while drm_sched_increase_karma is executed.<br> >>>>>><br> >>>>>> Andrey<br> >>>>>><br> >>>>>><br> >>>>>>> <br> >>>>>>> --<br> >>>>>>><br> >>> _______________________________________________<br> >>> dri-devel mailing list<br> >>> dri-devel@lists.freedesktop.org<br> >>> <a href="https://lists.freedesktop.org/mailman/listinfo/dri-devel">https://lists.freedesktop.org/mailman/listinfo/dri-devel</a><br> >>><br> > <br> > _______________________________________________<br> > dri-devel mailing list<br> > dri-devel@lists.freedesktop.org<br> > <a href="https://lists.freedesktop.org/mailman/listinfo/dri-devel">https://lists.freedesktop.org/mailman/listinfo/dri-devel</a><br> > <br> <br> </div> </span></font></div> </blockquote> </div> <br> </div> </div> </div> </body> </html>