[Intel-xe] [PATCH] drm/xe/vm: uncombine the combined_links.destroy
Matthew Brost
matthew.brost at intel.com
Mon Aug 7 18:29:28 UTC 2023
On Mon, Aug 07, 2023 at 05:51:22PM +0100, Matthew Auld wrote:
> On 07/08/2023 17:16, Matthew Brost wrote:
> > On Mon, Aug 07, 2023 at 04:19:46PM +0100, Matthew Auld wrote:
> > > On 07/08/2023 15:49, Matthew Brost wrote:
> > > > On Mon, Aug 07, 2023 at 03:41:30PM +0100, Matthew Auld wrote:
> > > > > On 07/08/2023 14:28, Matthew Brost wrote:
> > > > > > On Mon, Aug 07, 2023 at 02:02:27PM +0100, Matthew Auld wrote:
> > > > > > > It looks like it's plausible for eviction to be happening as we are
> > > > > > > closing the vm, however the rebind_link and destroy links are in this
> > > > > > > case not mutually exclusive, since eviction only needs the vm dma-resv
> > > > > > > lock which the vm close drops when accessing the contested list. If we
> > > > > > > race with eviction here it can run rampant and corrupt the list since
> > > > > > > both the destroy and rebind links are the same list entry underneath.
> > > > > > > Simplest is to just split these entries into separate links.
> > > > > > >
> > > > > >
> > > > > > I think there is another problem here too, I think
> > > > > > vm->notifier.rebind_list could be corrupted too.
> > > > >
> > > > > AFAICT that one always has vm->lock held, and close_and_put() doesn't
> > > > > release it until nuking the vma, so I think it's fine?
> > > > >
> > > >
> > > > External BOs do not hold the VM->lock in xe_bo_trigger_rebind so I do
> > > > think there is a potenial issue.
> > >
> > > Maybe I'm misunderstanding something as I only see it being added to
> > > vm->notifier.rebind_list, which doesn't involve the combined_links. Once it
> > > can fully lock the vm later, when for example doing xe_vm_lock_dma_resv(),
> > > it moves it to the real vm->rebind_list but at that point we are holding the
> > > vm->lock, and if the vma was since nuked we never actually transfer it.
> > >
> >
> > It is not related to the combined list, rather a race between
> > close_and_put & trigger_rebind.
> >
> > close_and_put can remove from the vm->notifier.rebind_list, then
> > trigger_rebind can add it back in, VMA is destroyed, and now we have
> > destroyed (corrupt memory) on the vm->notifier.rebind_list.
> >
> > Make sense?
>
> I'm still not seeing it tbh. AFAICT trigger_rebind/eviction needs the bo
> lock and so does destroying the vma (if applicable). If vma destruction runs
> first then the vma will be gone from the list of vmas for that object, so
> eviction never sees it. If the eviction ran first then vma destruction will
> simply remove it from the list. So the actual vma destroy vs trigger_rebind
> looks to be serialised to me.
>
Yea I guess upon further inspection I think the vm->notifier.rebind_list
is safe. That being said, protecting again it being destroyed in the
eviction path is not a terrible idea either.
Matt
> >
> > Matt
> >
> > > >
> > > > Matt
> > > >
> > > > > >
> > > > > > I think a better solution is to check the XE_VMA_DESTROYED bit in
> > > > > > xe_bo_trigger_rebind before adding to either the
> > > > > > vm->notifier.rebind_list or vm->rebind_list. Both checks should done
> > > > > > under the list locks.
> > > > >
> > > > > Ok, let me try that. Thanks for reviewing.
> > > > >
> > > > > >
> > > > > > Matt
> > > > > >
> > > > > > > References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/514
> > > > > > > Signed-off-by: Matthew Auld <matthew.auld at intel.com>
> > > > > > > Cc: Matthew Brost <matthew.brost at intel.com>
> > > > > > > ---
> > > > > > > drivers/gpu/drm/xe/xe_vm.c | 10 +++++-----
> > > > > > > drivers/gpu/drm/xe/xe_vm_types.h | 11 ++++++-----
> > > > > > > 2 files changed, 11 insertions(+), 10 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > > > > > > index cb28dbc2bdbb..aa13b2bcda86 100644
> > > > > > > --- a/drivers/gpu/drm/xe/xe_vm.c
> > > > > > > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > > > > > > @@ -891,6 +891,7 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
> > > > > > > }
> > > > > > > INIT_LIST_HEAD(&vma->combined_links.rebind);
> > > > > > > + INIT_LIST_HEAD(&vma->destroy_link);
> > > > > > > INIT_LIST_HEAD(&vma->notifier.rebind_link);
> > > > > > > INIT_LIST_HEAD(&vma->extobj.link);
> > > > > > > @@ -1063,7 +1064,7 @@ static void xe_vma_destroy(struct xe_vma *vma, struct dma_fence *fence)
> > > > > > > struct xe_vm *vm = xe_vma_vm(vma);
> > > > > > > lockdep_assert_held_write(&vm->lock);
> > > > > > > - XE_WARN_ON(!list_empty(&vma->combined_links.destroy));
> > > > > > > + XE_WARN_ON(!list_empty(&vma->destroy_link));
> > > > > > > if (xe_vma_is_userptr(vma)) {
> > > > > > > XE_WARN_ON(!(vma->gpuva.flags & XE_VMA_DESTROYED));
> > > > > > > @@ -1449,7 +1450,7 @@ void xe_vm_close_and_put(struct xe_vm *vm)
> > > > > > > continue;
> > > > > > > }
> > > > > > > - list_move_tail(&vma->combined_links.destroy, &contested);
> > > > > > > + list_add_tail(&vma->destroy_link, &contested);
> > > > > > > }
> > > > > > > /*
> > > > > > > @@ -1477,9 +1478,8 @@ void xe_vm_close_and_put(struct xe_vm *vm)
> > > > > > > * Since we hold a refcount to the bo, we can remove and free
> > > > > > > * the members safely without locking.
> > > > > > > */
> > > > > > > - list_for_each_entry_safe(vma, next_vma, &contested,
> > > > > > > - combined_links.destroy) {
> > > > > > > - list_del_init(&vma->combined_links.destroy);
> > > > > > > + list_for_each_entry_safe(vma, next_vma, &contested, destroy_link) {
> > > > > > > + list_del_init(&vma->destroy_link);
> > > > > > > xe_vma_destroy_unlocked(vma);
> > > > > > > }
> > > > > > > diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
> > > > > > > index 3681a5ff588b..a01dde418803 100644
> > > > > > > --- a/drivers/gpu/drm/xe/xe_vm_types.h
> > > > > > > +++ b/drivers/gpu/drm/xe/xe_vm_types.h
> > > > > > > @@ -79,13 +79,14 @@ struct xe_vma {
> > > > > > > * mutually exclusive execution with userptr.
> > > > > > > */
> > > > > > > struct list_head rebind;
> > > > > > > - /**
> > > > > > > - * @destroy: link to contested list when VM is being closed.
> > > > > > > - * Protected by vm->lock in write mode and vm's resv lock.
> > > > > > > - */
> > > > > > > - struct list_head destroy;
> > > > > > > } combined_links;
> > > > > > > + /**
> > > > > > > + * @destroy_link: link to contested list when VM is being closed.
> > > > > > > + * Protected by vm->lock in write mode and vm's resv lock.
> > > > > > > + */
> > > > > > > + struct list_head destroy_link;
> > > > > > > +
> > > > > > > union {
> > > > > > > /** @destroy_cb: callback to destroy VMA when unbind job is done */
> > > > > > > struct dma_fence_cb destroy_cb;
> > > > > > > --
> > > > > > > 2.41.0
> > > > > > >
More information about the Intel-xe
mailing list