[PATCH] dma-buf/sw_sync: Fix timeline/pt overflow cases

Dominik Behr dbehr at chromium.org
Wed Jun 28 22:17:27 UTC 2017


I think the kernel has problems with Android fences which were slowly
broken as they were de-staged:

1. They allowed for fence/timeline value/seqno to overflow/rollover and
that would break only if delta between timeline and earlier unsignaled
fence exceeded 2^31 or so. Android relies on that behavior.
It was also possible to clean-up timeline by incrementing it for instance
by calling timeline_inc(0x7FFFFFFF) twice. This doesn't work anymore
because if no one is actively waiting for fence, active_list is empty and
timeline_inc does not signal fences anymore; just increments timeline value
and allows it to rollover without any effect.

2. They did guarantee that when timeline was destroyed all fences on
timeline were signaled (or error state). Thus, if timeline was destroyed,
or if process that owned the timeline died, processes that depended on it
would not hang. Android also relies on that behavior for cleanup when a
process crashes for example.

3. It seems from some stack traces that I have seen that timeline_inc
signals fence from inside spinlock, which can cause fence_array to call
fence_array_release, which calls flush_work()
e.g.
<4>[  140.762030]  [<ffffffff858939fb>] dump_stack+0x4d/0x63
<4>[  140.762035]  [<ffffffff8568b593>] ___might_sleep+0x149/0x14e
<4>[  140.762039]  [<ffffffff8568b637>] __might_sleep+0x9f/0xa6
<4>[  140.762045]  [<ffffffff8567ef2a>] flush_work+0x39/0x19a
<4>[  140.762049]  [<ffffffff8568eb5c>] ? try_to_wake_up+0x20b/0x21b
<4>[  140.762055]  [<ffffffff85a54cea>] fence_array_release+0x2e/0x63
<4>[  140.762058]  [<ffffffff85a53a65>] fence_release+0x82/0x8e
<4>[  140.762061]  [<ffffffff85a54cba>] fence_put+0x15/0x17
<4>[  140.762065]  [<ffffffff85a54e08>] fence_array_cb_func+0x1f/0x39
<4>[  140.762068]  [<ffffffff85a53881>] fence_signal_locked+0x8e/0xa3
<4>[  140.762072]  [<ffffffff85a55cda>] sync_timeline_signal+0xcd/0x10a
<4>[  140.762075]  [<ffffffff85a5613b>] sw_sync_ioctl+0x159/0x17f


On Wed, Jun 28, 2017 at 2:00 PM, Sean Paul <seanpaul at chromium.org> wrote:

> On Wed, Jun 28, 2017 at 08:45:55PM +0100, Chris Wilson wrote:
> > Quoting Sean Paul (2017-06-28 17:47:24)
> > > On Wed, Jun 28, 2017 at 05:00:20PM +0100, Chris Wilson wrote:
> > > > Quoting Sean Paul (2017-06-28 16:51:11)
> > > > > Protect against long-running processes from overflowing the
> timeline
> > > > > and creating fences that go back in time. While we're at it, avoid
> > > > > overflowing while we're incrementing the timeline.
> > > > >
> > > > > Signed-off-by: Sean Paul <seanpaul at chromium.org>
> > > > > ---
> > > > >  drivers/dma-buf/sw_sync.c | 7 ++++++-
> > > > >  1 file changed, 6 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/dma-buf/sw_sync.c b/drivers/dma-buf/sw_sync.c
> > > > > index 69c5ff36e2f9..40934619ed88 100644
> > > > > --- a/drivers/dma-buf/sw_sync.c
> > > > > +++ b/drivers/dma-buf/sw_sync.c
> > > > > @@ -142,7 +142,7 @@ static void sync_timeline_signal(struct
> sync_timeline *obj, unsigned int inc)
> > > > >
> > > > >         spin_lock_irqsave(&obj->child_list_lock, flags);
> > > > >
> > > > > -       obj->value += inc;
> > > > > +       obj->value += min(inc, ~0x0U - obj->value);
> > > >
> > > > The timeline uses u32 seqno, so just obj->value += min(inc, INT_MAX);
> > > >
> > > Hi Chris,
> > > Thanks for the review.
> > >
> > > I don't think that solves the same problem I was trying to solve. The
> issue is
> > > that android userspace increments value by 0x7fffffff twice in order
> to ensure
> > > all fences have signaled. This is causing value to overflow and
> is_signaled will
> > > never be true. With your snippet, the possibility of overflow still
> exists.
> > >
> > > > Better of course would be to report the error,
> > >
> > > AFAIK, it's not an error to jump the timeline, perhaps just bad taste.
> Capping
> > > value at UINT_MAX will ensure all fences are signaled, and the check
> below ensures
> > > that fences can't be created beyond that (returning an error at that
> point in
> > > time).
> >
> > UINT_MAX doesn't imply all fences will be signaled either, the timeline
> > is supposed to wrap.
> >
> > The issue is timeline_fence_signaled() is using the wrong test, it
> > should be return (int)(fence->seqno - parent->value) <= 0; If it helps
> > extract a little helper from dma_fence_is_later().
>
> Understood, thank you for clarifying. This still doesn't solve the issue
> of userspace
> jumping the timeline by INT_MAX multiple times. In that case, value will
> rollover and
> even the new signaled() will fail to report.
>
> Sean
>
> > -Chris
>
> --
> Sean Paul, Software Engineer, Google / Chromium OS
> _______________________________________________
> dri-devel mailing list
> dri-devel at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20170628/93cc603f/attachment.html>


More information about the dri-devel mailing list