[Intel-xe] [PATCH] drm/xe: Use Xe BO pin / unpin functions for FB pin

Thu Apr 6 10:22:31 UTC 2023

On Thu, 6 Apr 2023 at 10:32, Christian König <christian.koenig at amd.com> wrote:
>
> Am 05.04.23 um 19:04 schrieb Matthew Auld:
> > On Wed, 5 Apr 2023 at 15:28, Matthew Brost <matthew.brost at intel.com> wrote:
> >> On Wed, Apr 05, 2023 at 02:14:22PM +0000, Matthew Brost wrote:
> >>> On Wed, Apr 05, 2023 at 02:19:35PM +0100, Matthew Auld wrote:
> >>>> On Wed, 5 Apr 2023 at 05:19, Matthew Brost <matthew.brost at intel.com> wrote:
> >>>>> Need to use XE BO pin / unpin for suspend / resume to work properly.
> >>>>> Part this fix too is too allow xe_bo_pin to be called more than once.
> >>>>>
> >>>>> Should fix:
> >>>>> https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/244
> >>>>>
> >>>>> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
> >>>> Change makes sense to me, but I think we also need to update xe_bo_unpin()?
> >>>>
> >>> Yep, thought that coded correctly but was looking at xe_bo_unpin_external().
> >>>
> >>>> But I must be misunderstanding something here. AFAICT we can have
> >>>> various pinned VRAM objects, and xe_bo_pin() still adds them to
> >>>> pinned.kernel_bo_present, where during suspend we evict everything on
> >>>> that list to system memory. But as these are all still pinned, suspend
> >>>> is still going to fail with this patch, given the recent TTM changes.
> >>> What reference recent TTM changes? This all worked at one point in time.
> >>>
> >> Oh, I see:
> >> git format-patch -1 f87c1f0b7b79b
> >>
> >> I would've NAK'd that one as this broke us. What you think is the best
> >> way to move forward on this one? I like the idea of using TTM to move
> >> the objects...
> >>
> >> I know we could just hack this and set pin_count=0 before calling
> >> validate aand restore it after? Horried hack though?
> >>
> >> Revert the above patch + post upstream?
> > Christian, can you share your thoughts here?
> >
> > This is regarding the TTM change: f87c1f0b7b79 ("drm/ttm: prevent
> > moving of pinned BOs"). To quickly summarize, it looks like Xe was
> > relying on being able to move pinned BOs for suspend-resume, where the
> > driver has some special kernel-internal VRAM buffers that are
> > perma-pinned, but need to be moved to system memory during suspend and
> > then copied back when resuming. This was all handled by just doing
> > something like ttm_bo_validate(PL_SYSTEM), without needing any special
> > handling. But this ofc no longer works.
>
> Yeah, exactly that's what this check here should prevent :)
>
> > So wondering if it makes any sense to move the decision back into the
> > driver, for whether to treat moving pinned objects as an error or not,
> > given the above use case with suspend-resume in Xe?
>
> Well pinning a BO means that it's fixed and can't move any more.
>
> The background is that the physical location (domain, address whatever)
> for this BO is used somewhere in the hardware and you need to notify the
> hardware if that ever changes. Typical applications are scanout,
> firmware buffers etc....

Yeah, it looks like Xe already makes sure to save the [fpfn, lpfn] to
ensure it gets moved back to the exact same spot in VRAM (all
perma-pinned kernel stuff is allocated CONTIG), in case something was
referencing the address. And since this is suspend-resume I guess
we're pretty sure nothing is actively using the object.

>
> Now if you still want to evacuate those buffers during suspend/resume
> you basically have three options:
>
> 1. Manually make sure inside your driver that the location doesn't
> change during evacuation/moving it back in again.
>      You can easily allocate additional backing store for you BOs and
> then copy the content back/force manually similar to what
> ttm_bo_validate() would do.
>      This wouldn't trigger the warning above and should work as well,
> see functions ttm_bo_mem_space() and ttm_bo_move_null().

i915 is pretty much doing something similar. It just feels a bit like
re-implementing ttm_bo_validate(), so I was wondering if that could be
avoided.

>
> 2. Unpin them on suspend, re-pin them on resume.
>      This is what we do for example for scanout. Re-pinning then usually
> means that you also re-program the hw with the new location information.

Yeah, something along this line is probably the simplest. Otherwise I
guess we just need to do the proper solution with 1.

>
> 3. Stop caring about evacuation during suspend.
>      We actually started to do this for a bunch of the read-only
> firmware buffers and just re-load the firmware after resume.
>      Has some pitfalls as well (if for example the fw changes), but
> generally seems to work quite fine for a lot of use cases.
>
> If you want to keep moving the BOs just using approach #1 should work
> for you, but approach #2 or #3 is probably cleaner from a high level
> perspective.

Thanks for taking a look Christian.

>
> Regards,
> Christian.
>
> >
> >> BTW, this patch is needed regardless with the fix in xe_bo_unpin().
> >>
> >> Matt
> >>
> >>> Matt
> >>>
> >>>> What am I missing?
> >>>>
> >>>>> ---
> >>>>>   drivers/gpu/drm/xe/display/xe_fb_pin.c | 4 ++--
> >>>>>   drivers/gpu/drm/xe/xe_bo.c             | 5 +++--
> >>>>>   2 files changed, 5 insertions(+), 4 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/xe/display/xe_fb_pin.c b/drivers/gpu/drm/xe/display/xe_fb_pin.c
> >>>>> index 65c0bc28a3d1..c7c849be288f 100644
> >>>>> --- a/drivers/gpu/drm/xe/display/xe_fb_pin.c
> >>>>> +++ b/drivers/gpu/drm/xe/display/xe_fb_pin.c
> >>>>> @@ -205,7 +205,7 @@ static struct i915_vma *__xe_pin_fb_vma(struct intel_framebuffer *fb,
> >>>>>
> >>>>>          ret = xe_bo_validate(bo, NULL, true);
> >>>>>          if (!ret)
> >>>>> -               ttm_bo_pin(&bo->ttm);
> >>>>> +               xe_bo_pin(bo);
> >>>>>          ttm_bo_unreserve(&bo->ttm);
> >>>>>          if (ret)
> >>>>>                  goto err;
> >>>>> @@ -222,7 +222,7 @@ static struct i915_vma *__xe_pin_fb_vma(struct intel_framebuffer *fb,
> >>>>>
> >>>>>   err_unpin:
> >>>>>          ttm_bo_reserve(&bo->ttm, false, false, NULL);
> >>>>> -       ttm_bo_unpin(&bo->ttm);
> >>>>> +       xe_bo_unpin(bo);
> >>>>>          ttm_bo_unreserve(&bo->ttm);
> >>>>>   err:
> >>>>>          kfree(vma);
> >>>>> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> >>>>> index 3a482c61c3ec..25c696bd380c 100644
> >>>>> --- a/drivers/gpu/drm/xe/xe_bo.c
> >>>>> +++ b/drivers/gpu/drm/xe/xe_bo.c
> >>>>> @@ -1278,8 +1278,8 @@ int xe_bo_pin(struct xe_bo *bo)
> >>>>>           */
> >>>>>          XE_BUG_ON(bo->ttm.base.import_attach);
> >>>>>
> >>>>> -       /* We only expect at most 1 pin */
> >>>>> -       XE_BUG_ON(xe_bo_is_pinned(bo));
> >>>>> +       if (xe_bo_is_pinned(bo))
> >>>>> +               goto pin;
> >>>>>
> >>>>>          err = xe_bo_validate(bo, NULL, false);
> >>>>>          if (err)
> >>>>> @@ -1308,6 +1308,7 @@ int xe_bo_pin(struct xe_bo *bo)
> >>>>>                  }
> >>>>>          }
> >>>>>
> >>>>> +pin:
> >>>>>          ttm_bo_pin(&bo->ttm);
> >>>>>
> >>>>>          /*
> >>>>> --
> >>>>> 2.34.1
> >>>>>
>