[Intel-gfx] [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture

Tue Oct 13 04:47:31 PDT 2015

On Tue, Oct 13, 2015 at 01:37:32PM +0200, Daniel Vetter wrote:
> On Fri, Oct 09, 2015 at 12:40:51PM +0100, Tomas Elf wrote:
> > On 09/10/2015 09:27, Daniel Vetter wrote:
> > >On Thu, Oct 08, 2015 at 07:31:34PM +0100, Tomas Elf wrote:
> > >>Using safe list iterators alleviates the problem of unsynchronized driver list
> > >>manipulations while error state capture is in the process of traversing lists.
> > >>
> > >>Signed-off-by: Tomas Elf <tomas.elf at intel.com>
> > >>---
> > >>  drivers/gpu/drm/i915/i915_gpu_error.c | 38 +++++++++++++++++------------------
> > >>  1 file changed, 19 insertions(+), 19 deletions(-)
> > >>
> > >>diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> > >>index 2f04e4f..32c1799 100644
> > >>--- a/drivers/gpu/drm/i915/i915_gpu_error.c
> > >>+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> > >>@@ -718,10 +718,10 @@ static void capture_bo(struct drm_i915_error_buffer *err,
> > >>  static u32 capture_active_bo(struct drm_i915_error_buffer *err,
> > >>  			     int count, struct list_head *head)
> > >>  {
> > >>-	struct i915_vma *vma;
> > >>+	struct i915_vma *vma, *tmpvma;
> > >>  	int i = 0;
> > >>
> > >>-	list_for_each_entry(vma, head, mm_list) {
> > >>+	list_for_each_entry_safe(vma, tmpvma, head, mm_list) {
> > >
> > >_safe isn't safe against anything, it's only safe against removal of the
> > >current list item done with a list_del/list_move from this thread. I.e.
> > >the only thing it does is have a temporary pointer to the next list
> > >element so if you delete the current one it won't fall over.
> > >
> > >It doesn't protect against anything else, and especially not against other
> > >threads totally stomping the list.
> > 
> > Absolutely! But it's good enough to make the test not fall over within an
> > hour of testing and instead allow it to run for 12+ hours during continuous
> > long-duration testing, which is what I need for the TDR validation.
> 
> Well it looks really suspicious, since the only thing this protects
> against is against deleting the element we look at right now. The only
> sensible theory I can come up with is that this slightly helps when
> starting the list walk, in case someone comes around and deletes the first
> element in the list from the retire worker. Since the retire worker tends
> to do more work per list item than we do that's enough to make the race
> really unlikely. But it's still just duct-tape.
> 
> > >So we need proper protection for these lists, independent of
> > >dev->struct_mutex. The least invasive way to do that is probably rcu,
> > >which only requires us that we release the memory for any object sitting
> > >on these lists with kfree_rcu instead of normal kfree. Another option
> > >might be a spinlock, but that would be a lot more invasive, and Chris
> > >won't like the execbuf overhead this causes ;-)
> > >-Daniel
> > 
> > Awesome! Let's go with that then.
> 
> See our massive irc discussion ... it's more tricky :(
> 
> In short rcu works perfectly if you only have the following lifetime
> sequence:
> - kmalloc object
> - add to list
> - remove from list (eventually)
> - free object
> 
> If you readd an object to a list, or even worse, move it then the rcu list
> helpers won't work. What could happen on the read side is:
> 
> - you walk the rcu list, keeping track of the head element to know when to
>   stop
> - while looking at that list one of the yet untraversed items gets moved
>   to a new list
> 
> -> you'll traverse the new list forever since that one will never hit the
> head element.
> 
> Not a problem for requests/vmas, but a problem for some of the object lists.
> 
> Note that the problem isn't that we re-add the element (if that happens we
> might end up walking parts of the list twice or parts of it not at all),
> but moving to a different list where there's a different head element.
> 
> I haven't checked, but maybe we're lucky and all the object lists we're
> looking at will always have the same head element. Then I think it'll work
> out (albeit racy, but who cares about that for the error capture) and with
> the guarantee (when using rcu delayed freeing) that'll we'll never chase
> freed memory.

Ah, so you agree that stop_machine() is good enough.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre