[Intel-gfx] [PATCH 2/2] drm/i915: Use rcu instead of stop_machine in set_wedged

Tue Oct 10 12:30:27 UTC 2017

On Tue, Oct 10, 2017 at 10:21:45AM +0100, Chris Wilson wrote:
> Quoting Daniel Vetter (2017-10-09 17:44:01)
> > stop_machine is not really a locking primitive we should use, except
> > when the hw folks tell us the hw is broken and that's the only way to
> > work around it.
> > 
> > This patch tries to address the locking abuse of stop_machine() from
> > 
> > commit 20e4933c478a1ca694b38fa4ac44d99e659941f5
> > Author: Chris Wilson <chris at chris-wilson.co.uk>
> > Date:   Tue Nov 22 14:41:21 2016 +0000
> > 
> >     drm/i915: Stop the machine as we install the wedged submit_request handler
> > 
> > Chris said parts of the reasons for going with stop_machine() was that
> > it's no overhead for the fast-path. But these callbacks use irqsave
> > spinlocks and do a bunch of MMIO, and rcu_read_lock is _real_ fast.
> > 
> > To stay as close as possible to the stop_machine semantics we first
> > update all the submit function pointers to the nop handler, then call
> > synchronize_rcu() to make sure no new requests can be submitted. This
> > should give us exactly the huge barrier we want.
> > 
> > I pondered whether we should annotate engine->submit_request as __rcu
> > and use rcu_assign_pointer and rcu_dereference on it. But the reason
> > behind those is to make sure the compiler/cpu barriers are there for
> > when you have an actual data structure you point at, to make sure all
> > the writes are seen correctly on the read side. But we just have a
> > function pointer, and .text isn't changed, so no need for these
> > barriers and hence no need for annotations.
> > 
> > Unfortunately there's a complication with the call to
> > intel_engine_init_global_seqno:
> 
> This is still broken in the same way as nop_submit_request may execute
> while you sleep, breaking cancel_requests.

I guess then I didn't understand which race you mean, since I think the
one I've found should be fixed now. Can you pls explain in more detail
what exactly is racing against what else?

>From what I can tell legacy cancel_request is entirely under the spinlock,
so hard to race, same for lrc. And with the global seqno update untangled,
they shouldn't complete too early anymore, which I thought was the race
you pointed out on the previous thread. I did reply to that to check my
understanding, but didn't get a reply.

Thanks, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch