[Intel-gfx] [PATCH 2/2] drm/i915: Use rcu instead of stop_machine in set_wedged

Chris Wilson chris at chris-wilson.co.uk
Tue Oct 10 13:29:02 UTC 2017


Quoting Chris Wilson (2017-10-10 10:21:45)
> Quoting Daniel Vetter (2017-10-09 17:44:01)
> > stop_machine is not really a locking primitive we should use, except
> > when the hw folks tell us the hw is broken and that's the only way to
> > work around it.
> > 
> > This patch tries to address the locking abuse of stop_machine() from
> > 
> > commit 20e4933c478a1ca694b38fa4ac44d99e659941f5
> > Author: Chris Wilson <chris at chris-wilson.co.uk>
> > Date:   Tue Nov 22 14:41:21 2016 +0000
> > 
> >     drm/i915: Stop the machine as we install the wedged submit_request handler
> > 
> > Chris said parts of the reasons for going with stop_machine() was that
> > it's no overhead for the fast-path. But these callbacks use irqsave
> > spinlocks and do a bunch of MMIO, and rcu_read_lock is _real_ fast.
> > 
> > To stay as close as possible to the stop_machine semantics we first
> > update all the submit function pointers to the nop handler, then call
> > synchronize_rcu() to make sure no new requests can be submitted. This
> > should give us exactly the huge barrier we want.
> > 
> > I pondered whether we should annotate engine->submit_request as __rcu
> > and use rcu_assign_pointer and rcu_dereference on it. But the reason
> > behind those is to make sure the compiler/cpu barriers are there for
> > when you have an actual data structure you point at, to make sure all
> > the writes are seen correctly on the read side. But we just have a
> > function pointer, and .text isn't changed, so no need for these
> > barriers and hence no need for annotations.
> > 
> > Unfortunately there's a complication with the call to
> > intel_engine_init_global_seqno:
> 
> This is still broken in the same way as nop_submit_request may execute
> while you sleep, breaking cancel_requests.

My apologies, I misread the diff and didn't catch the removal of
init_seqno. From that pov, the execlists should be intact and I can't
see a way for a lack of -EIO being reported.

(That we have them is a topic for another day, as set-wedged is not
meant to be regarded as a normal operation, but user data loss.)

The residual panic I have is that we had to throw set-wedged vs unwedge
ordering out of the window; it used to be struct_mutex. Making
set-wedged prolonged only makes the window wider, it didn't create that
window, and at the moment it's carefully ordered inside the
i915_handle_error fallback. (We blissfully ignore debugfs.)

Reviewed-by: Chris Wilson <chris at chris-wilson.co.uk>
-Chris


More information about the Intel-gfx mailing list