[Intel-gfx] [PATCH 2/2] RFC: drm/i915: Switch obj->mm.lock lockdep annotations on its head

Wed Aug 14 14:48:46 UTC 2019

On Wed, Aug 14, 2019 at 3:06 PM Chris Wilson <chris at chris-wilson.co.uk> wrote:
>
> Quoting Daniel Vetter (2019-08-14 13:49:33)
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
> > index d474c6ac4100..1ea3c3c96a5a 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
> > @@ -157,7 +157,15 @@ struct drm_i915_gem_object {
> >         unsigned int pin_global;
> >
> >         struct {
> > -               struct mutex lock; /* protects the pages and their use */
> > +               /*
> > +                * Protects the pages and their use.
> > +                *
> > +                * IMPORTANT: It is not allowed to allocate memory while holding
> > +                * this lock, because the shrinker might recurse on it, except
> > +                * when there are no pages allocated yet and the object isn't
> > +                * visible on any LRU.
>
> It's not meant to be public free-for-lock, just to guard the transition
> between 0<->1. Inside that transition we do page allocations.
>
> Everyone else takes a pin.

Well yeah the comment is missing a lot of things.
>
> > +                */
> > +               struct mutex lock;
> >                 atomic_t pages_pin_count;
> >
> >                 struct sg_table *pages;
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pages.c b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
> > index 18f0ce0135c1..3b7ec6e6ea8b 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_pages.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
> > @@ -101,7 +101,7 @@ int __i915_gem_object_get_pages(struct drm_i915_gem_object *obj)
>
> Fwiw, we have use cases (and people asking where are those patches) for
> nested allocations.

Yeah and it's those patches I'm freaking out about. Imo the current
annotations for obj->mm.lock just annotate the call chain nesting.
Which trivially shuts up lockdep, but also guarantees you're not going
to find real deadlocks before they happen. There needs to be a very
clear proof attached to why the nesting annotation is correct. Random
selection of examples:
- block partitions nesting in their overall device, where you never
take them the other way round
- the struct_mutex nesting, which works because there's only ever one
struct_mutex in the entire system. As soon as there are two we need a
new reason (and that's the reason for patch 1 here).
- obj->mm.lock, which nest both ways with fs_reclaim, but there's a
clear difference for the pages_pin_count goes 0->1 (allocation
allowed) and 1->0 (can be called from shrinker context or anything
else that needs freeing Which this series tries to make a bit clearer

Just noticing that the shrinker generally can nest within normal code
and annotating that nesting like that doesn't really proof anything.
And allows some fun stuff, like someon allocating memory from a
put_pages call, which I expect will lead to some fun later on. Just
kinda usual paranoia.

And I'm not sure at all whether at least the internal patches floating
around with a lot more nesting actually work. There's not really solid
explanation attached to them, and I haven't figured one out (unlike
for the two cases in upstream we have already, with struct_mutex and
obj->mm.lock).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch