[RFC] dma-buf/fence: avoid holding lock while calling cb

Rob Clark robdclark at gmail.com
Sun Oct 16 17:56:55 UTC 2016


On Sun, Oct 16, 2016 at 1:38 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> On Sun, Oct 16, 2016 at 12:03:53PM -0400, Rob Clark wrote:
>> Currently with fence-array, we have a potential deadlock situation.  If we
>> fence_add_callback() on an array-fence, the array-fence's lock is aquired
>> first, and in it's ->enable_signaling() callback, it will install cb's on
>> it's array-member fences, so the array-member's lock is acquired second.
>>
>> But in the signal path, the array-member's lock is acquired first, and the
>> array-fence's lock acquired second.
>>
>> One approach to deal with this is avoid holding the fence's lock when
>> calling the cb.  It is a bit intrusive and I haven't fixed up all the
>> other drivers that call directly or indirectly fence_signal_locked().
>>
>> I guess the other option would be introduce a work-queue for array-fence?
>> Or??
>
> Not keen on any yet, still hoping for a gold ore.

yeah, that is why I sent these two approaches as RFC.. hopefully if
there is a better way, someone will speak up ;-)

I do think that, unless someone has a better idea, I like how the wq
in fence-array worked out better..

>> lockdep splat:
>>
>>  ======================================================
>> [ INFO: possible circular locking dependency detected ]
>> 4.7.0-rc7+ #489 Not tainted
>> -------------------------------------------------------
>> surfaceflinger/2034 is trying to acquire lock:
>>  (&(&array->lock)->rlock){......}, at: [<ffff00000858cddc>] fence_signal+0x5c/0xf8
>>
>> but task is already holding lock:
>>  (&(&obj->child_list_lock)->rlock){......}, at: [<ffff0000085902f8>] sw_sync_ioctl+0x228/0x3b0
>>
>> which lock already depends on the new lock.
>>
>> the existing dependency chain (in reverse order) is:
>>
>> -> #1 (&(&obj->child_list_lock)->rlock){......}:
>>        [<ffff000008108924>] __lock_acquire+0x173c/0x18d8
>>        [<ffff000008108e0c>] lock_acquire+0x4c/0x68
>>        [<ffff000008ac6a6c>] _raw_spin_lock_irqsave+0x54/0x70
>>        [<ffff00000858d05c>] fence_add_callback+0x3c/0x100
>>        [<ffff00000858f100>] fence_array_enable_signaling+0x80/0x170
>>        [<ffff00000858d0d8>] fence_add_callback+0xb8/0x100
>>        [<ffff00000858f504>] sync_file_poll+0xd4/0xf0
>>        [<ffff0000081fd3a0>] do_sys_poll+0x220/0x438
>>        [<ffff0000081fd8d0>] SyS_ppoll+0x1b0/0x1d8
>>        [<ffff000008084f30>] el0_svc_naked+0x24/0x28
>>
>> -> #0 (&(&array->lock)->rlock){......}:
>>        [<ffff000008104768>] print_circular_bug+0x80/0x2e0
>>        [<ffff0000081089ac>] __lock_acquire+0x17c4/0x18d8
>>        [<ffff000008108e0c>] lock_acquire+0x4c/0x68
>>        [<ffff000008ac6a6c>] _raw_spin_lock_irqsave+0x54/0x70
>>        [<ffff00000858cddc>] fence_signal+0x5c/0xf8
>>        [<ffff00000858f268>] fence_array_cb_func+0x78/0x88
>>        [<ffff00000858cb28>] fence_signal_locked+0x80/0xe0
>>        [<ffff0000085903c8>] sw_sync_ioctl+0x2f8/0x3b0
>>        [<ffff0000081faf6c>] do_vfs_ioctl+0xa4/0x790
>>        [<ffff0000081fb6e4>] SyS_ioctl+0x8c/0xa0
>>        [<ffff000008084f30>] el0_svc_naked+0x24/0x28
>>
>> other info that might help us debug this:
>>
>>  Possible unsafe locking scenario:
>>
>>        CPU0                    CPU1
>>        ----                    ----
>>   lock(&(&obj->child_list_lock)->rlock);
>>                                lock(&(&array->lock)->rlock);
>>                                lock(&(&obj->child_list_lock)->rlock);
>>   lock(&(&array->lock)->rlock);
>>
>>  *** DEADLOCK ***
>>
>> 1 lock held by surfaceflinger/2034:
>>  #0:  (&(&obj->child_list_lock)->rlock){......}, at: [<ffff0000085902f8>] sw_sync_ioctl+0x228/0x3b0
>> ---
>>  drivers/dma-buf/fence.c      | 22 ++++++++++++++--------
>>  drivers/dma-buf/sw_sync.c    |  2 +-
>>  drivers/dma-buf/sync_debug.c | 16 +++++++++-------
>>  include/linux/fence.h        |  6 +++---
>>  4 files changed, 27 insertions(+), 19 deletions(-)
>>
>> diff --git a/drivers/dma-buf/fence.c b/drivers/dma-buf/fence.c
>> index cc05ddd..917f905 100644
>> --- a/drivers/dma-buf/fence.c
>> +++ b/drivers/dma-buf/fence.c
>> @@ -63,9 +63,9 @@ EXPORT_SYMBOL(fence_context_alloc);
>>   *
>>   * Unlike fence_signal, this function must be called with fence->lock held.
>>   */
>> -int fence_signal_locked(struct fence *fence)
>> +int fence_signal_locked(struct fence *fence, unsigned long flags)
>>  {
>> -     struct fence_cb *cur, *tmp;
>> +     struct fence_cb *cur;
>>       int ret = 0;
>>
>>       lockdep_assert_held(fence->lock);
>> @@ -88,9 +88,12 @@ int fence_signal_locked(struct fence *fence)
>>       } else
>>               trace_fence_signaled(fence);
>>
>> -     list_for_each_entry_safe(cur, tmp, &fence->cb_list, node) {
>> +     while (!list_empty(&fence->cb_list)) {
>> +             cur = list_first_entry(&fence->cb_list, struct fence_cb, node);
>>               list_del_init(&cur->node);
>> +             spin_unlock_irqrestore(fence->lock, flags);
>
> We don't need to pass down flags here, as we don't have to change the
> context of the lock (i.e. we just leave irqs disabled if they have been
> by the caller). That spares us the issue of the flags being changed
> here, and the changed flags not being propagated back.
>
> We need only grab the list under the lock once, as once signaled no new
> elements will be added to the list (after add_callback takes the lock it
> checks for the signal).
>
> i.e.
>
> struct list_head cb_list;
>
> list_splice_init(&fence->cb_list, &cb_list;
> spin_unlock(fence->lock);
>
> list_for_each_entry_safe(cur, tmp, &cb_list, node)
>         cur->func(fence, cur);
>
> spin_lock(fence->lock);

hmm, yeah, not having to pass flags everywhere simplifies things..
although wasn't sure how much that would be abusing the spin_lock
interface..

> Though I'm not convinced that dropping the lock is going to be safe for
> everyone, since that lock may be communal and guarding move than just
> the cb_list, e.g. virtio_gpu_fence_event_process().

yeah, that is part of why I liked the wq in fence-array better.. if we
go with this first approach instead, I'll need to audit more closely
the other callers, which I haven't done yet, since what they are doing
might not be safe if we start dropping locks temporarily.. and it adds
an unexpected twist for future users of the API, not really following
the principle of least surprise.

BR,
-R

> -Chris
>
> --
> Chris Wilson, Intel Open Source Technology Centre


More information about the dri-devel mailing list