[Intel-gfx] [PATCH] drm/i915: Use rcu instead of stop_machine

Fri Oct 6 08:56:35 UTC 2017

On Fri, Oct 6, 2017 at 10:42 AM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> Quoting Daniel Vetter (2017-10-05 17:12:35)
>> On Thu, Oct 05, 2017 at 03:30:12PM +0100, Chris Wilson wrote:
>> > Quoting Daniel Vetter (2017-10-05 15:09:48)
>> > > stop_machine is not really a locking primitive we should use, except
>> > > when the hw folks tell us the hw is broken and that's the only way to
>> > > work around it.
>> > >
>> > > This patch here is just a suggestion for how to fix it up, possible
>> > > changes needed to make it actually work:
>> > >
>> > > - Set the nop_submit_request first for _all_ engines, before
>> > >   proceeding.
>> > >
>> > > - Make sure engine->cancel_requests copes with the possibility that
>> > >   not all tests have consistently used the new or old version. I dont
>> > >   think this is a problem, since the same can happen really with the
>> > >   stop_machine() locking - stop_machine also doesn't give you any kind
>> > >   of global ordering against other cpu threads, it just makes them
>> > >   stop.
>> > >
>> > > This patch tries to address the locking snafu from
>> >
>> > There's a locking snafu in the code?
>> >
>> > > commit 20e4933c478a1ca694b38fa4ac44d99e659941f5
>> > > Author: Chris Wilson <chris at chris-wilson.co.uk>
>> > > Date:   Tue Nov 22 14:41:21 2016 +0000
>> > >
>> > >     drm/i915: Stop the machine as we install the wedged submit_request handler
>> > >
>> > > Chris said parts of the reasons for going with stop_machine() was that
>> > > it's no overhead for the fast-path.
>> >
>> > More than that, you don't even have to think about it. It's a one off
>> > event that changes execution paths. I actually never thought about
>> > putting the lock mechanism around the caller (that does prevent the issue
>> > I was dreading of being inside the callback as it changed), it is still
>> > magic that has nothing to do with the code flow. What variable should we
>> > document as being rcu protected, (*engine->submit_request)()?
>>
>> Yeah, engine->submit_request would be the one, except it shouldn't use
>> rcu_assign_pointer/rcu_dereference since we don't need those barriers,
>> ever - the .text doesn't change after all. That's why I didn't splatter
>> the sparse annotations all over it. In a way we don't need the data
>> barriers of rcu, but really only the "everyone has passed the critical
>> section now" part of it.
>>
>> > I'm definitely not sold on having set-wedge dictate terms to the rest of
>> > the code.
>>
>> Well, it sounds like tglx would very much prefer we clean this locking
>> inversion up on our side and not get them to break the cycle. Not what I
>> wanted Thomas to look at, but since I botched it and attached the wrong
>> splat we got the answer for this patch here, and it seems to be "don't do
>> that".
>
> The analogy is with kernel live-patching (which is what we are doing
> here). Should every module in the kernel manually markup itself to
> support the esoteric feature, or should the feature support the wider
> kernel?

This is completely different from kernel live patching, where the
actual instructions get changed while we execute them. You'd need to
annotate every single IP load, which is impossible. This here seems to
be much more a bog standard critical section, restrict to a very small
piece of code (the hw submit functions essentially), and if we can't
get that to work then our locking is seriously in bad shape.
stop_machine really isn't a valid locking mechanism, except in very
extreme cases where there's really no other solution.

If you disagree, then pls convince Thomas Gleixner that he needs to
fix this on his end.

The other bit is that lockdep is warn_once, so every week we have this
in our CI is a week with reduced test coverage. Even if Thomas fixes
this in the cpu hotplug code I expect it'll take longer than to get
some interim solution in here in i915, since it doesn't sound trivial
at all to fix this on the cpu hotplug side either.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch