[Intel-xe] [PATCH] drm/xe/vm: uncombine the combined_links.destroy

Matthew Auld matthew.auld at intel.com
Mon Aug 7 16:51:22 UTC 2023


On 07/08/2023 17:16, Matthew Brost wrote:
> On Mon, Aug 07, 2023 at 04:19:46PM +0100, Matthew Auld wrote:
>> On 07/08/2023 15:49, Matthew Brost wrote:
>>> On Mon, Aug 07, 2023 at 03:41:30PM +0100, Matthew Auld wrote:
>>>> On 07/08/2023 14:28, Matthew Brost wrote:
>>>>> On Mon, Aug 07, 2023 at 02:02:27PM +0100, Matthew Auld wrote:
>>>>>> It looks like it's plausible for eviction to be happening as we are
>>>>>> closing the vm, however the rebind_link and destroy links are in this
>>>>>> case not mutually exclusive, since eviction only needs the vm dma-resv
>>>>>> lock which the vm close drops when accessing the contested list. If we
>>>>>> race with eviction here it can run rampant and corrupt the list since
>>>>>> both the destroy and rebind links are the same list entry underneath.
>>>>>> Simplest is to just split these entries into separate links.
>>>>>>
>>>>>
>>>>> I think there is another problem here too, I think
>>>>> vm->notifier.rebind_list could be corrupted too.
>>>>
>>>> AFAICT that one always has vm->lock held, and close_and_put() doesn't
>>>> release it until nuking the vma, so I think it's fine?
>>>>
>>>
>>> External BOs do not hold the VM->lock in xe_bo_trigger_rebind so I do
>>> think there is a potenial issue.
>>
>> Maybe I'm misunderstanding something as I only see it being added to
>> vm->notifier.rebind_list, which doesn't involve the combined_links. Once it
>> can fully lock the vm later, when for example doing xe_vm_lock_dma_resv(),
>> it moves it to the real vm->rebind_list but at that point we are holding the
>> vm->lock, and if the vma was since nuked we never actually transfer it.
>>
> 
> It is not related to the combined list, rather a race between
> close_and_put & trigger_rebind.
> 
> close_and_put can remove from the vm->notifier.rebind_list, then
> trigger_rebind can add it back in, VMA is destroyed, and now we have
> destroyed (corrupt memory) on the vm->notifier.rebind_list.
> 
> Make sense?

I'm still not seeing it tbh. AFAICT trigger_rebind/eviction needs the bo 
lock and so does destroying the vma (if applicable). If vma destruction 
runs first then the vma will be gone from the list of vmas for that 
object, so eviction never sees it. If the eviction ran first then vma 
destruction will simply remove it from the list. So the actual vma 
destroy vs trigger_rebind looks to be serialised to me.

> 
> Matt
> 
>>>
>>> Matt
>>>
>>>>>
>>>>> I think a better solution is to check the XE_VMA_DESTROYED bit in
>>>>> xe_bo_trigger_rebind before adding to either the
>>>>> vm->notifier.rebind_list or vm->rebind_list. Both checks should done
>>>>> under the list locks.
>>>>
>>>> Ok, let me try that. Thanks for reviewing.
>>>>
>>>>>
>>>>> Matt
>>>>>
>>>>>> References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/514
>>>>>> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
>>>>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>>>>> ---
>>>>>>     drivers/gpu/drm/xe/xe_vm.c       | 10 +++++-----
>>>>>>     drivers/gpu/drm/xe/xe_vm_types.h | 11 ++++++-----
>>>>>>     2 files changed, 11 insertions(+), 10 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
>>>>>> index cb28dbc2bdbb..aa13b2bcda86 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_vm.c
>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm.c
>>>>>> @@ -891,6 +891,7 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
>>>>>>     	}
>>>>>>     	INIT_LIST_HEAD(&vma->combined_links.rebind);
>>>>>> +	INIT_LIST_HEAD(&vma->destroy_link);
>>>>>>     	INIT_LIST_HEAD(&vma->notifier.rebind_link);
>>>>>>     	INIT_LIST_HEAD(&vma->extobj.link);
>>>>>> @@ -1063,7 +1064,7 @@ static void xe_vma_destroy(struct xe_vma *vma, struct dma_fence *fence)
>>>>>>     	struct xe_vm *vm = xe_vma_vm(vma);
>>>>>>     	lockdep_assert_held_write(&vm->lock);
>>>>>> -	XE_WARN_ON(!list_empty(&vma->combined_links.destroy));
>>>>>> +	XE_WARN_ON(!list_empty(&vma->destroy_link));
>>>>>>     	if (xe_vma_is_userptr(vma)) {
>>>>>>     		XE_WARN_ON(!(vma->gpuva.flags & XE_VMA_DESTROYED));
>>>>>> @@ -1449,7 +1450,7 @@ void xe_vm_close_and_put(struct xe_vm *vm)
>>>>>>     			continue;
>>>>>>     		}
>>>>>> -		list_move_tail(&vma->combined_links.destroy, &contested);
>>>>>> +		list_add_tail(&vma->destroy_link, &contested);
>>>>>>     	}
>>>>>>     	/*
>>>>>> @@ -1477,9 +1478,8 @@ void xe_vm_close_and_put(struct xe_vm *vm)
>>>>>>     	 * Since we hold a refcount to the bo, we can remove and free
>>>>>>     	 * the members safely without locking.
>>>>>>     	 */
>>>>>> -	list_for_each_entry_safe(vma, next_vma, &contested,
>>>>>> -				 combined_links.destroy) {
>>>>>> -		list_del_init(&vma->combined_links.destroy);
>>>>>> +	list_for_each_entry_safe(vma, next_vma, &contested, destroy_link) {
>>>>>> +		list_del_init(&vma->destroy_link);
>>>>>>     		xe_vma_destroy_unlocked(vma);
>>>>>>     	}
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>> index 3681a5ff588b..a01dde418803 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>> +++ b/drivers/gpu/drm/xe/xe_vm_types.h
>>>>>> @@ -79,13 +79,14 @@ struct xe_vma {
>>>>>>     		 * mutually exclusive execution with userptr.
>>>>>>     		 */
>>>>>>     		struct list_head rebind;
>>>>>> -		/**
>>>>>> -		 * @destroy: link to contested list when VM is being closed.
>>>>>> -		 * Protected by vm->lock in write mode and vm's resv lock.
>>>>>> -		 */
>>>>>> -		struct list_head destroy;
>>>>>>     	} combined_links;
>>>>>> +	/**
>>>>>> +	 * @destroy_link: link to contested list when VM is being closed.
>>>>>> +	 * Protected by vm->lock in write mode and vm's resv lock.
>>>>>> +	 */
>>>>>> +	struct list_head destroy_link;
>>>>>> +
>>>>>>     	union {
>>>>>>     		/** @destroy_cb: callback to destroy VMA when unbind job is done */
>>>>>>     		struct dma_fence_cb destroy_cb;
>>>>>> -- 
>>>>>> 2.41.0
>>>>>>


More information about the Intel-xe mailing list