[Intel-gfx] [PATCH] drm/i915: Use rcu instead of stop_machine

Chris Wilson chris at chris-wilson.co.uk
Fri Oct 6 08:42:13 UTC 2017


Quoting Daniel Vetter (2017-10-05 17:12:35)
> On Thu, Oct 05, 2017 at 03:30:12PM +0100, Chris Wilson wrote:
> > Quoting Daniel Vetter (2017-10-05 15:09:48)
> > > stop_machine is not really a locking primitive we should use, except
> > > when the hw folks tell us the hw is broken and that's the only way to
> > > work around it.
> > > 
> > > This patch here is just a suggestion for how to fix it up, possible
> > > changes needed to make it actually work:
> > > 
> > > - Set the nop_submit_request first for _all_ engines, before
> > >   proceeding.
> > > 
> > > - Make sure engine->cancel_requests copes with the possibility that
> > >   not all tests have consistently used the new or old version. I dont
> > >   think this is a problem, since the same can happen really with the
> > >   stop_machine() locking - stop_machine also doesn't give you any kind
> > >   of global ordering against other cpu threads, it just makes them
> > >   stop.
> > > 
> > > This patch tries to address the locking snafu from
> > 
> > There's a locking snafu in the code?
> >  
> > > commit 20e4933c478a1ca694b38fa4ac44d99e659941f5
> > > Author: Chris Wilson <chris at chris-wilson.co.uk>
> > > Date:   Tue Nov 22 14:41:21 2016 +0000
> > > 
> > >     drm/i915: Stop the machine as we install the wedged submit_request handler
> > > 
> > > Chris said parts of the reasons for going with stop_machine() was that
> > > it's no overhead for the fast-path.
> > 
> > More than that, you don't even have to think about it. It's a one off
> > event that changes execution paths. I actually never thought about
> > putting the lock mechanism around the caller (that does prevent the issue
> > I was dreading of being inside the callback as it changed), it is still
> > magic that has nothing to do with the code flow. What variable should we
> > document as being rcu protected, (*engine->submit_request)()?
> 
> Yeah, engine->submit_request would be the one, except it shouldn't use
> rcu_assign_pointer/rcu_dereference since we don't need those barriers,
> ever - the .text doesn't change after all. That's why I didn't splatter
> the sparse annotations all over it. In a way we don't need the data
> barriers of rcu, but really only the "everyone has passed the critical
> section now" part of it.
> 
> > I'm definitely not sold on having set-wedge dictate terms to the rest of
> > the code.
> 
> Well, it sounds like tglx would very much prefer we clean this locking
> inversion up on our side and not get them to break the cycle. Not what I
> wanted Thomas to look at, but since I botched it and attached the wrong
> splat we got the answer for this patch here, and it seems to be "don't do
> that".

The analogy is with kernel live-patching (which is what we are doing
here). Should every module in the kernel manually markup itself to
support the esoteric feature, or should the feature support the wider
kernel?
-Chris


More information about the Intel-gfx mailing list