[Intel-gfx] [PATCH 2/8] drm/i915: Migrate to safe iterators in error state capture
Chris Wilson
chris at chris-wilson.co.uk
Tue Oct 13 04:47:31 PDT 2015
On Tue, Oct 13, 2015 at 01:37:32PM +0200, Daniel Vetter wrote:
> On Fri, Oct 09, 2015 at 12:40:51PM +0100, Tomas Elf wrote:
> > On 09/10/2015 09:27, Daniel Vetter wrote:
> > >On Thu, Oct 08, 2015 at 07:31:34PM +0100, Tomas Elf wrote:
> > >>Using safe list iterators alleviates the problem of unsynchronized driver list
> > >>manipulations while error state capture is in the process of traversing lists.
> > >>
> > >>Signed-off-by: Tomas Elf <tomas.elf at intel.com>
> > >>---
> > >> drivers/gpu/drm/i915/i915_gpu_error.c | 38 +++++++++++++++++------------------
> > >> 1 file changed, 19 insertions(+), 19 deletions(-)
> > >>
> > >>diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> > >>index 2f04e4f..32c1799 100644
> > >>--- a/drivers/gpu/drm/i915/i915_gpu_error.c
> > >>+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> > >>@@ -718,10 +718,10 @@ static void capture_bo(struct drm_i915_error_buffer *err,
> > >> static u32 capture_active_bo(struct drm_i915_error_buffer *err,
> > >> int count, struct list_head *head)
> > >> {
> > >>- struct i915_vma *vma;
> > >>+ struct i915_vma *vma, *tmpvma;
> > >> int i = 0;
> > >>
> > >>- list_for_each_entry(vma, head, mm_list) {
> > >>+ list_for_each_entry_safe(vma, tmpvma, head, mm_list) {
> > >
> > >_safe isn't safe against anything, it's only safe against removal of the
> > >current list item done with a list_del/list_move from this thread. I.e.
> > >the only thing it does is have a temporary pointer to the next list
> > >element so if you delete the current one it won't fall over.
> > >
> > >It doesn't protect against anything else, and especially not against other
> > >threads totally stomping the list.
> >
> > Absolutely! But it's good enough to make the test not fall over within an
> > hour of testing and instead allow it to run for 12+ hours during continuous
> > long-duration testing, which is what I need for the TDR validation.
>
> Well it looks really suspicious, since the only thing this protects
> against is against deleting the element we look at right now. The only
> sensible theory I can come up with is that this slightly helps when
> starting the list walk, in case someone comes around and deletes the first
> element in the list from the retire worker. Since the retire worker tends
> to do more work per list item than we do that's enough to make the race
> really unlikely. But it's still just duct-tape.
>
> > >So we need proper protection for these lists, independent of
> > >dev->struct_mutex. The least invasive way to do that is probably rcu,
> > >which only requires us that we release the memory for any object sitting
> > >on these lists with kfree_rcu instead of normal kfree. Another option
> > >might be a spinlock, but that would be a lot more invasive, and Chris
> > >won't like the execbuf overhead this causes ;-)
> > >-Daniel
> >
> > Awesome! Let's go with that then.
>
> See our massive irc discussion ... it's more tricky :(
>
> In short rcu works perfectly if you only have the following lifetime
> sequence:
> - kmalloc object
> - add to list
> - remove from list (eventually)
> - free object
>
> If you readd an object to a list, or even worse, move it then the rcu list
> helpers won't work. What could happen on the read side is:
>
> - you walk the rcu list, keeping track of the head element to know when to
> stop
> - while looking at that list one of the yet untraversed items gets moved
> to a new list
>
> -> you'll traverse the new list forever since that one will never hit the
> head element.
>
> Not a problem for requests/vmas, but a problem for some of the object lists.
>
> Note that the problem isn't that we re-add the element (if that happens we
> might end up walking parts of the list twice or parts of it not at all),
> but moving to a different list where there's a different head element.
>
> I haven't checked, but maybe we're lucky and all the object lists we're
> looking at will always have the same head element. Then I think it'll work
> out (albeit racy, but who cares about that for the error capture) and with
> the guarantee (when using rcu delayed freeing) that'll we'll never chase
> freed memory.
Ah, so you agree that stop_machine() is good enough.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
More information about the Intel-gfx
mailing list