[Intel-gfx] [PATCH 5/7] drm/i915/gem/ttm: Respect the objection region in placement_from_obj

Fri Jul 16 16:00:15 UTC 2021

On Fri, 16 Jul 2021 at 16:52, Matthew Auld
<matthew.william.auld at gmail.com> wrote:
>
> On Fri, 16 Jul 2021 at 15:10, Jason Ekstrand <jason at jlekstrand.net> wrote:
> >
> > On Fri, Jul 16, 2021 at 8:54 AM Matthew Auld
> > <matthew.william.auld at gmail.com> wrote:
> > >
> > > On Thu, 15 Jul 2021 at 23:39, Jason Ekstrand <jason at jlekstrand.net> wrote:
> > > >
> > > > Whenever we had a user object (n_placements > 0), we were ignoring
> > > > obj->mm.region and always putting obj->placements[0] as the requested
> > > > region.  For LMEM+SMEM objects, this was causing them to get shoved into
> > > > LMEM on every i915_ttm_get_pages() even when SMEM was requested by, say,
> > > > i915_gem_object_migrate().
> > >
> > > i915_ttm_migrate calls i915_ttm_place_from_region() directly with the
> > > requested region, so there shouldn't be an issue with migration right?
> > > Do you have some more details?
> >
> > With i915_ttm_migrate directly, no.  But, in the last patch in the
> > series, we're trying to migrate LMEM+SMEM buffers into SMEM on
> > attach() and pin it there.  This blows up in a very unexpected (IMO)
> > way.  The flow goes something like this:
> >
> >  - Client attempts a dma-buf import from another device
> >  - In attach() we call i915_gem_object_migrate() which calls
> > i915_ttm_migrate() which migrates as requested.
> >  - Once the migration is complete, we call i915_gem_object_pin_pages()
> > which calls i915_ttm_get_pages() which depends on
> > i915_ttm_placement_from_obj() and so migrates it right back to LMEM.
>
> The mm.pages must be NULL here, otherwise it would just increment the
> pages_pin_count?
>
> >
> > Maybe the problem here is actually that our TTM code isn't respecting
> > obj->mm.pages_pin_count?
>
> I think if the resource is moved, we always nuke the mm.pages after
> being notified of the move. Also TTM is also not allowed to move
> pinned buffers.
>
> I guess if we are evicted/swapped, so assuming we are not holding the
> object lock, and it's not pinned, the future call to get_pages() will
> see mm.pages = NULL, even though the ttm_resource is still there, and
> because we prioritise the placements[0], instead of mm.region we end
> up moving it for no good reason. But in your case you are holding the
> lock, or it's pinned? Also is this just with the selftest, or
> something real?

Or at least in the selftest I see ____i915_gem_object_get_pages()
which doesn't even consider the mm.pages AFAIK.

>
> >
> > In case you can't tell, I really have no clue what I'm doing here.
> > I'm really stumbling around in the dark finding things that make my
> > bug go away.  I'm happy for the feedback.
> >
> > --Jason
> >
> > >
> > > >
> > > > Signed-off-by: Jason Ekstrand <jason at jlekstrand.net>
> > > > Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
> > > > Cc: Matthew Auld <matthew.auld at intel.com>
> > > > Cc: Maarten Lankhorst <maarten.lankhorst at linux.intel.com>
> > > > ---
> > > >  drivers/gpu/drm/i915/gem/i915_gem_ttm.c | 3 +--
> > > >  1 file changed, 1 insertion(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> > > > index d30f274c329c7..5985e994d56cf 100644
> > > > --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> > > > +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
> > > > @@ -150,8 +150,7 @@ i915_ttm_placement_from_obj(const struct drm_i915_gem_object *obj,
> > > >         unsigned int i;
> > > >
> > > >         placement->num_placement = 1;
> > > > -       i915_ttm_place_from_region(num_allowed ? obj->mm.placements[0] :
> > > > -                                  obj->mm.region, requested, flags);
> > > > +       i915_ttm_place_from_region(obj->mm.region, requested, flags);
> > > >
> > > >         /* Cache this on object? */
> > > >         placement->num_busy_placement = num_allowed;
> > > > --
> > > > 2.31.1
> > > >