[PATCH] drm/shmem-helper: Don't remove the offset in vm_area_struct pgoff

Thu Feb 18 17:15:53 UTC 2021

On Thu, Feb 18, 2021 at 10:51 AM Steven Price <steven.price at arm.com> wrote:
>
> On 18/02/2021 16:38, Rob Herring wrote:
> > On Thu, Feb 18, 2021 at 10:15 AM Steven Price <steven.price at arm.com> wrote:
> >>
> >> On 18/02/2021 15:45, Alyssa Rosenzweig wrote:
> >>>> Yeah plus Cc: stable for backporting and I think an igt or similar for
> >>>> panfrost to check this works correctly would be pretty good too. Since
> >>>> if it took us over 1 year to notice this bug it's pretty clear that
> >>>> normal testing doesn't catch this. So very likely we'll break this
> >>>> again.
> >>>
> >>> Unfortunately there are a lot of kernel bugs which are noticed during actual
> >>> use (but not CI runs), some of which have never been fixed. I do know
> >>> the shrinker impl is buggy for us, if this is the fix I'm very happy.
> >>
> >> I doubt this will actually "fix" anything - if I understand correctly
> >> then the sequence which is broken is:
> >>
> >>    * allocate BO, mmap to CPU
> >>    * madvise(DONTNEED)
> >>    * trigger purge
> >>    * try to access the BO memory
> >>
> >> which is an invalid sequence for user space - the attempt to access
> >> memory should cause a SIGSEGV. However because drm_vma_node_unmap() is
> >> unable to find the mappings there may still be page table entries
> >> present which would provide access to memory the kernel has freed. Which
> >> is of course a big security hole and so this fix is needed.
> >>
> >> In what way do you find the shrinker impl buggy? I'm aware there's some
> >> dodgy locking (although I haven't worked out how to fix it) - but AFAICT
> >> it's more deadlock territory rather than lacking in locks. Are there
> >> correctness issues?
> >
> > What's there was largely a result of getting lockdep happy.
> >
> >>>> btw for testing shrinkers recommended way is to have a debugfs file
> >>>> that just force-shrinks everything. That way you avoid all the trouble
> >>>> that tend to happen when you drive a system close to OOM on linux, and
> >>>> it's also much faster.
> >>>
> >>> 2nding this as a good idea.
> >>>
> >>
> >> Sounds like a good idea to me too. But equally I'm wondering whether the
> >> best (short term) solution is to actually disable the shrinker. I'm
> >> somewhat surprised that nobody has got fed up with the "Purging xxx
> >> bytes" message spam - which makes me think that most people are not
> >> hitting memory pressure to trigger the shrinker.
> >
> > If the shrinker is dodgy, then it's probably good to have the messages
> > to know if it ran.
> >
> >> The shrinker on kbase caused a lot of grief - and the only way I managed
> >> to get that under control was by writing a static analysis tool for the
> >> locking, and by upsetting people by enforcing the (rather dumb) rules of
> >> the tool on the code base. I've been meaning to look at whether sparse
> >> can do a similar check of locks.
> >
> > Lockdep doesn't cover it?
>
> Short answer: no ;)
>
> The problem with lockdep is that you have to trigger the locking
> scenario to get a warning out of it. For example you obviously won't get
> any warnings about the shrinker without triggering the shrinker (which
> means memory pressure since we don't have the debugfs file to trigger it).

Actually, you don't need debugfs. Writing to /proc/sys/vm/drop_caches
will do it. Though maybe there's other code path scenarios that
wouldn't cover.

> I have to admit I'm not 100% sure I've seen any lockdep warnings based
> on buffer objects recently. I can trigger them based on jobs:
>
> -----8<------
> [  265.764474] ======================================================
> [  265.771380] WARNING: possible circular locking dependency detected
> [  265.778294] 5.11.0-rc2+ #22 Tainted: G        W
> [  265.784148] ------------------------------------------------------
> [  265.791050] kworker/0:3/90 is trying to acquire lock:
> [  265.796694] c0d982b0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x38
> [  265.804994]
> [  265.804994] but task is already holding lock:
> [  265.811513] c49a348c (&js->queue[j].lock){+.+.}-{3:3}, at: panfrost_reset+0x124/0x1cc [panfrost]
> [  265.821375]
> [  265.821375] which lock already depends on the new lock.
> [  265.821375]
> [  265.830524]
> [  265.830524] the existing dependency chain (in reverse order) is:
> [  265.838892]
> [  265.838892] -> #2 (&js->queue[j].lock){+.+.}-{3:3}:
> [  265.845996]        mutex_lock_nested+0x18/0x20
> [  265.850961]        panfrost_scheduler_stop+0x1c/0x94 [panfrost]
> [  265.857590]        panfrost_reset+0x54/0x1cc [panfrost]
> [  265.863441]        process_one_work+0x238/0x51c
> [  265.868503]        worker_thread+0x22c/0x2e0
> [  265.873270]        kthread+0x128/0x138
> [  265.877455]        ret_from_fork+0x14/0x38
> [  265.882028]        0x0
> [  265.884657]
> [  265.884657] -> #1 (dma_fence_map){++++}-{0:0}:
> [  265.891277]        dma_resv_lockdep+0x1b4/0x290
> [  265.896339]        do_one_initcall+0x5c/0x2e8
> [  265.901206]        kernel_init_freeable+0x184/0x1d4
> [  265.906651]        kernel_init+0x8/0x11c
> [  265.911029]        ret_from_fork+0x14/0x38
> [  265.915610]        0x0
> [  265.918247]
> [  265.918247] -> #0 (fs_reclaim){+.+.}-{0:0}:
> [  265.924579]        lock_acquire+0x3a4/0x45c
> [  265.929260]        __fs_reclaim_acquire+0x28/0x38
> [  265.934523]        slab_pre_alloc_hook.constprop.28+0x1c/0x64
> [  265.940948]        kmem_cache_alloc_trace+0x38/0x114
> [  265.946493]        panfrost_job_run+0x60/0x2b4 [panfrost]
> [  265.952540]        drm_sched_resubmit_jobs+0x88/0xc4 [gpu_sched]
> [  265.959256]        panfrost_reset+0x174/0x1cc [panfrost]
> [  265.965196]        process_one_work+0x238/0x51c
> [  265.970259]        worker_thread+0x22c/0x2e0
> [  265.975025]        kthread+0x128/0x138
> [  265.979210]        ret_from_fork+0x14/0x38
> [  265.983784]        0x0
> [  265.986412]
> [  265.986412] other info that might help us debug this:
> [  265.986412]
> [  265.995353] Chain exists of:
> [  265.995353]   fs_reclaim --> dma_fence_map --> &js->queue[j].lock
> [  265.995353]
> [  266.007028]  Possible unsafe locking scenario:
> [  266.007028]
> [  266.013638]        CPU0                    CPU1
> [  266.018694]        ----                    ----
> [  266.023750]   lock(&js->queue[j].lock);
> [  266.028033]                                lock(dma_fence_map);
> [  266.034648]                                lock(&js->queue[j].lock);
> [  266.041746]   lock(fs_reclaim);
> [  266.045252]
> [  266.045252]  *** DEADLOCK ***
> [  266.045252]
> [  266.051861] 4 locks held by kworker/0:3/90:
> [  266.056530]  #0: c18096a8 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x18c/0x51c
> [  266.066258]  #1: c46d5f28 ((work_completion)(&pfdev->reset.work)){+.+.}-{0:0}, at: process_one_work+0x18c/0x51c
> [  266.077546]  #2: c0dfc5a0 (dma_fence_map){++++}-{0:0}, at: panfrost_reset+0x10/0x1cc [panfrost]
> [  266.087294]  #3: c49a348c (&js->queue[j].lock){+.+.}-{3:3}, at: panfrost_reset+0x124/0x1cc [panfrost]
> [  266.097624]
> [  266.097624] stack backtrace:
> [  266.102487] CPU: 0 PID: 90 Comm: kworker/0:3 Tainted: G        W         5.11.0-rc2+ #22
> [  266.111529] Hardware name: Rockchip (Device Tree)
> [  266.116773] Workqueue: events panfrost_reset [panfrost]
> [  266.122628] [<c010f0c0>] (unwind_backtrace) from [<c010ad38>] (show_stack+0x10/0x14)
> [  266.131288] [<c010ad38>] (show_stack) from [<c07c3c28>] (dump_stack+0xa4/0xd0)
> [  266.139363] [<c07c3c28>] (dump_stack) from [<c0168760>] (check_noncircular+0x6c/0x90)
> [  266.148116] [<c0168760>] (check_noncircular) from [<c016a2c0>] (__lock_acquire+0xe90/0x16a0)
> [  266.157549] [<c016a2c0>] (__lock_acquire) from [<c016b5f8>] (lock_acquire+0x3a4/0x45c)
> [  266.166399] [<c016b5f8>] (lock_acquire) from [<c0247b98>] (__fs_reclaim_acquire+0x28/0x38)
> [  266.175637] [<c0247b98>] (__fs_reclaim_acquire) from [<c025445c>] (slab_pre_alloc_hook.constprop.28+0x1c/0x64)
> [  266.186820] [<c025445c>] (slab_pre_alloc_hook.constprop.28) from [<c0255714>] (kmem_cache_alloc_trace+0x38/0x114)
> [  266.198292] [<c0255714>] (kmem_cache_alloc_trace) from [<bf00d28c>] (panfrost_job_run+0x60/0x2b4 [panfrost])
> [  266.209295] [<bf00d28c>] (panfrost_job_run [panfrost]) from [<bf00102c>] (drm_sched_resubmit_jobs+0x88/0xc4 [gpu_sched])
> [  266.221476] [<bf00102c>] (drm_sched_resubmit_jobs [gpu_sched]) from [<bf00d0a0>] (panfrost_reset+0x174/0x1cc [panfrost])
> [  266.233649] [<bf00d0a0>] (panfrost_reset [panfrost]) from [<c0139fd4>] (process_one_work+0x238/0x51c)
> [  266.243974] [<c0139fd4>] (process_one_work) from [<c013aaa0>] (worker_thread+0x22c/0x2e0)
> [  266.253115] [<c013aaa0>] (worker_thread) from [<c013fd64>] (kthread+0x128/0x138)
> [  266.261382] [<c013fd64>] (kthread) from [<c010011c>] (ret_from_fork+0x14/0x38)
> [  266.269456] Exception stack(0xc46d5fb0 to 0xc46d5ff8)
> [  266.275098] 5fa0:                                     00000000 00000000 00000000 00000000
> [  266.284236] 5fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> [  266.293373] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> -----8<------
>
> And indeed a sleeping in atomic bug:
> -----8<-----
> [  289.672380] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:935
> [  289.681210] rockchip-vop ff930000.vop: [drm:vop_crtc_atomic_flush] *ERROR* VOP vblank IRQ stuck for 10 ms
> [  289.681880] in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 31, name: kworker/0:1
> [  289.701901] INFO: lockdep is turned off.
> [  289.706339] irq event stamp: 9414
> [  289.710095] hardirqs last  enabled at (9413): [<c07cd1cc>] _raw_spin_unlock_irq+0x20/0x40
> [  289.719381] hardirqs last disabled at (9414): [<c07c9140>] __schedule+0x170/0x698
> [  289.727855] softirqs last  enabled at (9112): [<c077379c>] reg_todo+0x64/0x51c
> [  289.736044] softirqs last disabled at (9110): [<c077377c>] reg_todo+0x44/0x51c
> [  289.744233] CPU: 0 PID: 31 Comm: kworker/0:1 Tainted: G        W         5.11.0-rc2+ #22
> [  289.753375] Hardware name: Rockchip (Device Tree)
> [  289.758711] Workqueue: events panfrost_reset [panfrost]
> [  289.764886] [<c010f0c0>] (unwind_backtrace) from [<c010ad38>] (show_stack+0x10/0x14)
> [  289.773721] [<c010ad38>] (show_stack) from [<c07c3c28>] (dump_stack+0xa4/0xd0)
> [  289.781948] [<c07c3c28>] (dump_stack) from [<c01490f8>] (___might_sleep+0x1bc/0x244)
> [  289.790761] [<c01490f8>] (___might_sleep) from [<c07ca720>] (__mutex_lock+0x34/0x3c4)
> [  289.799654] [<c07ca720>] (__mutex_lock) from [<c07caac8>] (mutex_lock_nested+0x18/0x20)
> [  289.808732] [<c07caac8>] (mutex_lock_nested) from [<bf00b704>] (panfrost_gem_free_object+0x20/0x100 [panfrost])
> [  289.820358] [<bf00b704>] (panfrost_gem_free_object [panfrost]) from [<bf00ccb0>] (kref_put+0x38/0x5c [panfrost])
> [  289.832247] [<bf00ccb0>] (kref_put [panfrost]) from [<bf00ce0c>] (panfrost_job_cleanup+0x120/0x140 [panfrost])
> [  289.843936] [<bf00ce0c>] (panfrost_job_cleanup [panfrost]) from [<bf00ccb0>] (kref_put+0x38/0x5c [panfrost])
> [  289.855435] [<bf00ccb0>] (kref_put [panfrost]) from [<c054acb8>] (dma_fence_signal_timestamp_locked+0x1c0/0x1f8)
> [  289.867163] [<c054acb8>] (dma_fence_signal_timestamp_locked) from [<c054b1f8>] (dma_fence_signal+0x38/0x58)
> [  289.878207] [<c054b1f8>] (dma_fence_signal) from [<bf001388>] (drm_sched_job_done+0x58/0x148 [gpu_sched])
> [  289.889237] [<bf001388>] (drm_sched_job_done [gpu_sched]) from [<bf001524>] (drm_sched_start+0xa4/0xd0 [gpu_sched])
> [  289.901389] [<bf001524>] (drm_sched_start [gpu_sched]) from [<bf00d0ac>] (panfrost_reset+0x180/0x1cc [panfrost])
> [  289.913286] [<bf00d0ac>] (panfrost_reset [panfrost]) from [<c0139fd4>] (process_one_work+0x238/0x51c)
> [  289.923970] [<c0139fd4>] (process_one_work) from [<c013aaa0>] (worker_thread+0x22c/0x2e0)
> [  289.933257] [<c013aaa0>] (worker_thread) from [<c013fd64>] (kthread+0x128/0x138)
> [  289.941661] [<c013fd64>] (kthread) from [<c010011c>] (ret_from_fork+0x14/0x38)
> [  289.949867] Exception stack(0xc2243fb0 to 0xc2243ff8)
> [  289.955604] 3fa0:                                     00000000 00000000 00000000 00000000
> [  289.964840] 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> [  289.974066] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> -----8<----
>
> Certainly here the mutex causing the problem is the shrinker_lock!
>
> The above is triggered by chucking a whole ton of jobs which
> fault at the GPU.
>
> Sadly I haven't found time to work out how to untangle the locks.

They are tricky because pretty much any memory allocation can trigger
things as I recall.

Rob