[PATCH] drm/shmem-helper: Don't remove the offset in vm_area_struct pgoff

Thu Feb 18 18:20:06 UTC 2021

On Thu, Feb 18, 2021 at 6:16 PM Rob Herring <robh+dt at kernel.org> wrote:
>
> On Thu, Feb 18, 2021 at 10:51 AM Steven Price <steven.price at arm.com> wrote:
> >
> > On 18/02/2021 16:38, Rob Herring wrote:
> > > On Thu, Feb 18, 2021 at 10:15 AM Steven Price <steven.price at arm.com> wrote:
> > >>
> > >> On 18/02/2021 15:45, Alyssa Rosenzweig wrote:
> > >>>> Yeah plus Cc: stable for backporting and I think an igt or similar for
> > >>>> panfrost to check this works correctly would be pretty good too. Since
> > >>>> if it took us over 1 year to notice this bug it's pretty clear that
> > >>>> normal testing doesn't catch this. So very likely we'll break this
> > >>>> again.
> > >>>
> > >>> Unfortunately there are a lot of kernel bugs which are noticed during actual
> > >>> use (but not CI runs), some of which have never been fixed. I do know
> > >>> the shrinker impl is buggy for us, if this is the fix I'm very happy.
> > >>
> > >> I doubt this will actually "fix" anything - if I understand correctly
> > >> then the sequence which is broken is:
> > >>
> > >>    * allocate BO, mmap to CPU
> > >>    * madvise(DONTNEED)
> > >>    * trigger purge
> > >>    * try to access the BO memory
> > >>
> > >> which is an invalid sequence for user space - the attempt to access
> > >> memory should cause a SIGSEGV. However because drm_vma_node_unmap() is
> > >> unable to find the mappings there may still be page table entries
> > >> present which would provide access to memory the kernel has freed. Which
> > >> is of course a big security hole and so this fix is needed.
> > >>
> > >> In what way do you find the shrinker impl buggy? I'm aware there's some
> > >> dodgy locking (although I haven't worked out how to fix it) - but AFAICT
> > >> it's more deadlock territory rather than lacking in locks. Are there
> > >> correctness issues?
> > >
> > > What's there was largely a result of getting lockdep happy.
> > >
> > >>>> btw for testing shrinkers recommended way is to have a debugfs file
> > >>>> that just force-shrinks everything. That way you avoid all the trouble
> > >>>> that tend to happen when you drive a system close to OOM on linux, and
> > >>>> it's also much faster.
> > >>>
> > >>> 2nding this as a good idea.
> > >>>
> > >>
> > >> Sounds like a good idea to me too. But equally I'm wondering whether the
> > >> best (short term) solution is to actually disable the shrinker. I'm
> > >> somewhat surprised that nobody has got fed up with the "Purging xxx
> > >> bytes" message spam - which makes me think that most people are not
> > >> hitting memory pressure to trigger the shrinker.
> > >
> > > If the shrinker is dodgy, then it's probably good to have the messages
> > > to know if it ran.
> > >
> > >> The shrinker on kbase caused a lot of grief - and the only way I managed
> > >> to get that under control was by writing a static analysis tool for the
> > >> locking, and by upsetting people by enforcing the (rather dumb) rules of
> > >> the tool on the code base. I've been meaning to look at whether sparse
> > >> can do a similar check of locks.
> > >
> > > Lockdep doesn't cover it?
> >
> > Short answer: no ;)

It's pretty good actually, if you correctly annotate things up.

> > The problem with lockdep is that you have to trigger the locking
> > scenario to get a warning out of it. For example you obviously won't get
> > any warnings about the shrinker without triggering the shrinker (which
> > means memory pressure since we don't have the debugfs file to trigger it).
>
> Actually, you don't need debugfs. Writing to /proc/sys/vm/drop_caches
> will do it. Though maybe there's other code path scenarios that
> wouldn't cover.

Huh didn't know, but it's a bit a shotgun, plus it doesn't use
fs_reclaim shrinker annotations, which means you don't have lockdep
checks. I think at least, would need some deadlock and testing.
>
>
> > I have to admit I'm not 100% sure I've seen any lockdep warnings based
> > on buffer objects recently. I can trigger them based on jobs:
> >
> > -----8<------
> > [  265.764474] ======================================================
> > [  265.771380] WARNING: possible circular locking dependency detected
> > [  265.778294] 5.11.0-rc2+ #22 Tainted: G        W
> > [  265.784148] ------------------------------------------------------
> > [  265.791050] kworker/0:3/90 is trying to acquire lock:
> > [  265.796694] c0d982b0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x38
> > [  265.804994]
> > [  265.804994] but task is already holding lock:
> > [  265.811513] c49a348c (&js->queue[j].lock){+.+.}-{3:3}, at: panfrost_reset+0x124/0x1cc [panfrost]
> > [  265.821375]
> > [  265.821375] which lock already depends on the new lock.
> > [  265.821375]
> > [  265.830524]
> > [  265.830524] the existing dependency chain (in reverse order) is:
> > [  265.838892]
> > [  265.838892] -> #2 (&js->queue[j].lock){+.+.}-{3:3}:
> > [  265.845996]        mutex_lock_nested+0x18/0x20
> > [  265.850961]        panfrost_scheduler_stop+0x1c/0x94 [panfrost]
> > [  265.857590]        panfrost_reset+0x54/0x1cc [panfrost]
> > [  265.863441]        process_one_work+0x238/0x51c
> > [  265.868503]        worker_thread+0x22c/0x2e0
> > [  265.873270]        kthread+0x128/0x138
> > [  265.877455]        ret_from_fork+0x14/0x38
> > [  265.882028]        0x0
> > [  265.884657]
> > [  265.884657] -> #1 (dma_fence_map){++++}-{0:0}:
> > [  265.891277]        dma_resv_lockdep+0x1b4/0x290
> > [  265.896339]        do_one_initcall+0x5c/0x2e8
> > [  265.901206]        kernel_init_freeable+0x184/0x1d4
> > [  265.906651]        kernel_init+0x8/0x11c
> > [  265.911029]        ret_from_fork+0x14/0x38
> > [  265.915610]        0x0
> > [  265.918247]
> > [  265.918247] -> #0 (fs_reclaim){+.+.}-{0:0}:
> > [  265.924579]        lock_acquire+0x3a4/0x45c
> > [  265.929260]        __fs_reclaim_acquire+0x28/0x38
> > [  265.934523]        slab_pre_alloc_hook.constprop.28+0x1c/0x64
> > [  265.940948]        kmem_cache_alloc_trace+0x38/0x114
> > [  265.946493]        panfrost_job_run+0x60/0x2b4 [panfrost]
> > [  265.952540]        drm_sched_resubmit_jobs+0x88/0xc4 [gpu_sched]
> > [  265.959256]        panfrost_reset+0x174/0x1cc [panfrost]
> > [  265.965196]        process_one_work+0x238/0x51c
> > [  265.970259]        worker_thread+0x22c/0x2e0
> > [  265.975025]        kthread+0x128/0x138
> > [  265.979210]        ret_from_fork+0x14/0x38
> > [  265.983784]        0x0
> > [  265.986412]
> > [  265.986412] other info that might help us debug this:
> > [  265.986412]
> > [  265.995353] Chain exists of:
> > [  265.995353]   fs_reclaim --> dma_fence_map --> &js->queue[j].lock
> > [  265.995353]
> > [  266.007028]  Possible unsafe locking scenario:
> > [  266.007028]
> > [  266.013638]        CPU0                    CPU1
> > [  266.018694]        ----                    ----
> > [  266.023750]   lock(&js->queue[j].lock);
> > [  266.028033]                                lock(dma_fence_map);
> > [  266.034648]                                lock(&js->queue[j].lock);
> > [  266.041746]   lock(fs_reclaim);
> > [  266.045252]
> > [  266.045252]  *** DEADLOCK ***
> > [  266.045252]
> > [  266.051861] 4 locks held by kworker/0:3/90:
> > [  266.056530]  #0: c18096a8 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x18c/0x51c
> > [  266.066258]  #1: c46d5f28 ((work_completion)(&pfdev->reset.work)){+.+.}-{0:0}, at: process_one_work+0x18c/0x51c
> > [  266.077546]  #2: c0dfc5a0 (dma_fence_map){++++}-{0:0}, at: panfrost_reset+0x10/0x1cc [panfrost]
> > [  266.087294]  #3: c49a348c (&js->queue[j].lock){+.+.}-{3:3}, at: panfrost_reset+0x124/0x1cc [panfrost]
> > [  266.097624]
> > [  266.097624] stack backtrace:
> > [  266.102487] CPU: 0 PID: 90 Comm: kworker/0:3 Tainted: G        W         5.11.0-rc2+ #22
> > [  266.111529] Hardware name: Rockchip (Device Tree)
> > [  266.116773] Workqueue: events panfrost_reset [panfrost]
> > [  266.122628] [<c010f0c0>] (unwind_backtrace) from [<c010ad38>] (show_stack+0x10/0x14)
> > [  266.131288] [<c010ad38>] (show_stack) from [<c07c3c28>] (dump_stack+0xa4/0xd0)
> > [  266.139363] [<c07c3c28>] (dump_stack) from [<c0168760>] (check_noncircular+0x6c/0x90)
> > [  266.148116] [<c0168760>] (check_noncircular) from [<c016a2c0>] (__lock_acquire+0xe90/0x16a0)
> > [  266.157549] [<c016a2c0>] (__lock_acquire) from [<c016b5f8>] (lock_acquire+0x3a4/0x45c)
> > [  266.166399] [<c016b5f8>] (lock_acquire) from [<c0247b98>] (__fs_reclaim_acquire+0x28/0x38)
> > [  266.175637] [<c0247b98>] (__fs_reclaim_acquire) from [<c025445c>] (slab_pre_alloc_hook.constprop.28+0x1c/0x64)
> > [  266.186820] [<c025445c>] (slab_pre_alloc_hook.constprop.28) from [<c0255714>] (kmem_cache_alloc_trace+0x38/0x114)
> > [  266.198292] [<c0255714>] (kmem_cache_alloc_trace) from [<bf00d28c>] (panfrost_job_run+0x60/0x2b4 [panfrost])
> > [  266.209295] [<bf00d28c>] (panfrost_job_run [panfrost]) from [<bf00102c>] (drm_sched_resubmit_jobs+0x88/0xc4 [gpu_sched])
> > [  266.221476] [<bf00102c>] (drm_sched_resubmit_jobs [gpu_sched]) from [<bf00d0a0>] (panfrost_reset+0x174/0x1cc [panfrost])
> > [  266.233649] [<bf00d0a0>] (panfrost_reset [panfrost]) from [<c0139fd4>] (process_one_work+0x238/0x51c)
> > [  266.243974] [<c0139fd4>] (process_one_work) from [<c013aaa0>] (worker_thread+0x22c/0x2e0)
> > [  266.253115] [<c013aaa0>] (worker_thread) from [<c013fd64>] (kthread+0x128/0x138)
> > [  266.261382] [<c013fd64>] (kthread) from [<c010011c>] (ret_from_fork+0x14/0x38)
> > [  266.269456] Exception stack(0xc46d5fb0 to 0xc46d5ff8)
> > [  266.275098] 5fa0:                                     00000000 00000000 00000000 00000000
> > [  266.284236] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> > [  266.293373] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> > -----8<------
> >
> > And indeed a sleeping in atomic bug:
> > -----8<-----
> > [  289.672380] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:935
> > [  289.681210] rockchip-vop ff930000.vop: [drm:vop_crtc_atomic_flush] *ERROR* VOP vblank IRQ stuck for 10 ms
> > [  289.681880] in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 31, name: kworker/0:1
> > [  289.701901] INFO: lockdep is turned off.
> > [  289.706339] irq event stamp: 9414
> > [  289.710095] hardirqs last  enabled at (9413): [<c07cd1cc>] _raw_spin_unlock_irq+0x20/0x40
> > [  289.719381] hardirqs last disabled at (9414): [<c07c9140>] __schedule+0x170/0x698
> > [  289.727855] softirqs last  enabled at (9112): [<c077379c>] reg_todo+0x64/0x51c
> > [  289.736044] softirqs last disabled at (9110): [<c077377c>] reg_todo+0x44/0x51c
> > [  289.744233] CPU: 0 PID: 31 Comm: kworker/0:1 Tainted: G        W         5.11.0-rc2+ #22
> > [  289.753375] Hardware name: Rockchip (Device Tree)
> > [  289.758711] Workqueue: events panfrost_reset [panfrost]
> > [  289.764886] [<c010f0c0>] (unwind_backtrace) from [<c010ad38>] (show_stack+0x10/0x14)
> > [  289.773721] [<c010ad38>] (show_stack) from [<c07c3c28>] (dump_stack+0xa4/0xd0)
> > [  289.781948] [<c07c3c28>] (dump_stack) from [<c01490f8>] (___might_sleep+0x1bc/0x244)
> > [  289.790761] [<c01490f8>] (___might_sleep) from [<c07ca720>] (__mutex_lock+0x34/0x3c4)
> > [  289.799654] [<c07ca720>] (__mutex_lock) from [<c07caac8>] (mutex_lock_nested+0x18/0x20)
> > [  289.808732] [<c07caac8>] (mutex_lock_nested) from [<bf00b704>] (panfrost_gem_free_object+0x20/0x100 [panfrost])
> > [  289.820358] [<bf00b704>] (panfrost_gem_free_object [panfrost]) from [<bf00ccb0>] (kref_put+0x38/0x5c [panfrost])
> > [  289.832247] [<bf00ccb0>] (kref_put [panfrost]) from [<bf00ce0c>] (panfrost_job_cleanup+0x120/0x140 [panfrost])
> > [  289.843936] [<bf00ce0c>] (panfrost_job_cleanup [panfrost]) from [<bf00ccb0>] (kref_put+0x38/0x5c [panfrost])
> > [  289.855435] [<bf00ccb0>] (kref_put [panfrost]) from [<c054acb8>] (dma_fence_signal_timestamp_locked+0x1c0/0x1f8)
> > [  289.867163] [<c054acb8>] (dma_fence_signal_timestamp_locked) from [<c054b1f8>] (dma_fence_signal+0x38/0x58)
> > [  289.878207] [<c054b1f8>] (dma_fence_signal) from [<bf001388>] (drm_sched_job_done+0x58/0x148 [gpu_sched])
> > [  289.889237] [<bf001388>] (drm_sched_job_done [gpu_sched]) from [<bf001524>] (drm_sched_start+0xa4/0xd0 [gpu_sched])
> > [  289.901389] [<bf001524>] (drm_sched_start [gpu_sched]) from [<bf00d0ac>] (panfrost_reset+0x180/0x1cc [panfrost])
> > [  289.913286] [<bf00d0ac>] (panfrost_reset [panfrost]) from [<c0139fd4>] (process_one_work+0x238/0x51c)
> > [  289.923970] [<c0139fd4>] (process_one_work) from [<c013aaa0>] (worker_thread+0x22c/0x2e0)
> > [  289.933257] [<c013aaa0>] (worker_thread) from [<c013fd64>] (kthread+0x128/0x138)
> > [  289.941661] [<c013fd64>] (kthread) from [<c010011c>] (ret_from_fork+0x14/0x38)
> > [  289.949867] Exception stack(0xc2243fb0 to 0xc2243ff8)
> > [  289.955604] 3fa0:                                     00000000 00000000 00000000 00000000
> > [  289.964840] 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> > [  289.974066] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> > -----8<----
> >
> > Certainly here the mutex causing the problem is the shrinker_lock!
> >
> > The above is triggered by chucking a whole ton of jobs which
> > fault at the GPU.
> >
> > Sadly I haven't found time to work out how to untangle the locks.
>
> They are tricky because pretty much any memory allocation can trigger
> things as I recall.

The above should only be possible with my dma_fence annotations, and
yes the point to bugs in the drm/scheduler. They shouldn't matter for
panfrost, and those patches aren't in upstream yet.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch