blocking ops in drm_sched_cleanup_jobs()

Mon Sep 16 14:24:46 UTC 2019

On Mon, Sep 16, 2019 at 10:11 AM Koenig, Christian
<Christian.Koenig at amd.com> wrote:
>
> Hi Steven,
>
> the problem seems to be than panfrost is trying to sleep while freeing a
> job. E.g. it tries to take a mutex.
>
> That is not allowed any more since we need to free the jobs from atomic
> and even interrupt context.
>
> Your suggestion wouldn't work because this way jobs are not freed when
> there isn't a new one to be scheduled.

One fix would be to make sure that any that any calls to
drm_sched_cleanup_jobs are atomic, by putting preempt_disable/enable
or local_irq_disable/enable in there, at least when lockdep or sleep
debugging is enabled. That should help catch these reliable, instead
of just once every blue moon.
-Daniel

>
> Regards,
> Christian.
>
> Am 13.09.19 um 16:50 schrieb Steven Price:
> > Hi,
> >
> > I hit the below splat randomly with panfrost. From what I can tell this
> > is a more general issue which would affect other drivers.
> >
> > ----8<-----
> > [58604.913130] ------------[ cut here ]------------
> > [58604.918590] WARNING: CPU: 1 PID: 1758 at kernel/sched/core.c:6556 __might_sleep+0x74/0x98
> > [58604.927965] do not call blocking ops when !TASK_RUNNING; state=1 set at [<0c590494>] prepare_to_wait_event+0x104/0x164
> > [58604.940047] Modules linked in: panfrost gpu_sched
> > [58604.945370] CPU: 1 PID: 1758 Comm: pan_js Not tainted 5.3.0-rc1+ #13
> > [58604.952500] Hardware name: Rockchip (Device Tree)
> > [58604.957815] [<c0111150>] (unwind_backtrace) from [<c010c99c>] (show_stack+0x10/0x14)
> > [58604.966521] [<c010c99c>] (show_stack) from [<c07adbb4>] (dump_stack+0x9c/0xd4)
> > [58604.974639] [<c07adbb4>] (dump_stack) from [<c0121da8>] (__warn+0xe8/0x104)
> > [58604.982462] [<c0121da8>] (__warn) from [<c0121e08>] (warn_slowpath_fmt+0x44/0x6c)
> > [58604.990867] [<c0121e08>] (warn_slowpath_fmt) from [<c014eccc>] (__might_sleep+0x74/0x98)
> > [58604.999973] [<c014eccc>] (__might_sleep) from [<c07c73d8>] (__mutex_lock+0x38/0x948)
> > [58605.008690] [<c07c73d8>] (__mutex_lock) from [<c07c7d00>] (mutex_lock_nested+0x18/0x20)
> > [58605.017841] [<c07c7d00>] (mutex_lock_nested) from [<bf00b54c>] (panfrost_gem_free_object+0x60/0x10c [panfrost])
> > [58605.029430] [<bf00b54c>] (panfrost_gem_free_object [panfrost]) from [<bf00cecc>] (panfrost_job_put+0x138/0x150 [panfrost])
> > [58605.042076] [<bf00cecc>] (panfrost_job_put [panfrost]) from [<bf00121c>] (drm_sched_cleanup_jobs+0xc8/0xe0 [gpu_sched])
> > [58605.054417] [<bf00121c>] (drm_sched_cleanup_jobs [gpu_sched]) from [<bf001300>] (drm_sched_main+0xcc/0x26c [gpu_sched])
> > [58605.066620] [<bf001300>] (drm_sched_main [gpu_sched]) from [<c0146cfc>] (kthread+0x13c/0x154)
> > [58605.076226] [<c0146cfc>] (kthread) from [<c01010b4>] (ret_from_fork+0x14/0x20)
> > [58605.084346] Exception stack(0xe959bfb0 to 0xe959bff8)
> > [58605.090046] bfa0:                                     00000000 00000000 00000000 00000000
> > [58605.099250] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> > [58605.108480] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000
> > [58605.116210] irq event stamp: 179
> > [58605.119955] hardirqs last  enabled at (187): [<c017f7e4>] console_unlock+0x564/0x5c4
> > [58605.128935] hardirqs last disabled at (202): [<c017f308>] console_unlock+0x88/0x5c4
> > [58605.137788] softirqs last  enabled at (216): [<c0102334>] __do_softirq+0x18c/0x548
> > [58605.146543] softirqs last disabled at (227): [<c0129528>] irq_exit+0xc4/0x10c
> > [58605.154618] ---[ end trace f65bdbd9ea9adfc0 ]---
> > ----8<-----
> >
> > The problem is that drm_sched_main() calls drm_sched_cleanup_jobs() as
> > part of the condition of wait_event_interruptible:
> >
> >>              wait_event_interruptible(sched->wake_up_worker,
> >>                                       (drm_sched_cleanup_jobs(sched),
> >>                                       (!drm_sched_blocked(sched) &&
> >>                                        (entity = drm_sched_select_entity(sched))) ||
> >>                                       kthread_should_stop()));
> > When drm_sched_cleanup_jobs() is called *after* a wait (i.e. after
> > prepare_to_wait_event() has been called), then any might_sleep() will
> > moan loudly about it. This doesn't seem to happen often (I've only
> > triggered it once) because usually drm_sched_cleanup_jobs() either
> > doesn't sleep or does the sleeping during the first call that
> > wait_event_interruptible() makes (which is before the task state is set).
> >
> > I don't really understand why drm_sched_cleanup_jobs() needs to be
> > called here, a simple change like below 'fixes' it. But I presume
> > there's some reason for the call being part of the
> > wait_event_interruptible condition. Can anyone shed light on this?
> >
> > The code was introduced in commit 5918045c4ed4 ("drm/scheduler: rework job destruction")
> >
> > Steve
> >
> > ----8<-----
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 9a0ee74d82dc..528f295e3a31 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -699,11 +699,12 @@ static int drm_sched_main(void *param)
> >               struct drm_sched_job *sched_job;
> >               struct dma_fence *fence;
> >
> > +             drm_sched_cleanup_jobs(sched);
> > +
> >               wait_event_interruptible(sched->wake_up_worker,
> > -                                      (drm_sched_cleanup_jobs(sched),
> >                                        (!drm_sched_blocked(sched) &&
> >                                         (entity = drm_sched_select_entity(sched))) ||
> > -                                      kthread_should_stop()));
> > +                                      kthread_should_stop());
> >
> >               if (!entity)
> >                       continue;
>
> _______________________________________________
> dri-devel mailing list
> dri-devel at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel


-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch