[Intel-xe] [PATCH v2 08/31] drm/xe: VM LRU bulk move

Fri May 12 09:03:30 UTC 2023

On 5/11/23 16:11, Matthew Brost wrote:
> On Thu, May 11, 2023 at 09:24:05AM +0200, Thomas Hellström wrote:
>> On 5/10/23 20:40, Matthew Brost wrote:
>>> On Wed, May 10, 2023 at 10:14:12AM +0200, Thomas Hellström wrote:
>>>> On 5/10/23 00:05, Matthew Brost wrote:
>>>>> On Tue, May 09, 2023 at 02:47:54PM +0200, Thomas Hellström wrote:
>>>>>> On 5/2/23 02:17, Matthew Brost wrote:
>>>>>>> Use the TTM LRU bulk move for BOs tied to a VM. Update the bulk moves
>>>>>>> LRU position on every exec.
>>>>>>>
>>>>>>> Signed-off-by: Matthew Brost <matthew.brost at intel.com>
>>>>>>> ---
>>>>>>>      drivers/gpu/drm/xe/xe_bo.c       | 32 ++++++++++++++++++++++++++++----
>>>>>>>      drivers/gpu/drm/xe/xe_bo.h       |  4 ++--
>>>>>>>      drivers/gpu/drm/xe/xe_dma_buf.c  |  2 +-
>>>>>>>      drivers/gpu/drm/xe/xe_exec.c     |  6 ++++++
>>>>>>>      drivers/gpu/drm/xe/xe_vm_types.h |  3 +++
>>>>>>>      5 files changed, 40 insertions(+), 7 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
>>>>>>> index 3ab404e33fae..da99ee53e7d7 100644
>>>>>>> --- a/drivers/gpu/drm/xe/xe_bo.c
>>>>>>> +++ b/drivers/gpu/drm/xe/xe_bo.c
>>>>>>> @@ -985,6 +985,23 @@ static void xe_gem_object_free(struct drm_gem_object *obj)
>>>>>>>      	ttm_bo_put(container_of(obj, struct ttm_buffer_object, base));
>>>>>>>      }
>>>>>>> +static void xe_gem_object_close(struct drm_gem_object *obj,
>>>>>>> +				struct drm_file *file_priv)
>>>>>>> +{
>>>>>>> +	struct xe_bo *bo = gem_to_xe_bo(obj);
>>>>>>> +
>>>>>>> +	if (bo->vm && !xe_vm_no_dma_fences(bo->vm)) {
>>>>>> Is there a reason we don't use bulk moves for LR vms? Admittedly bumping LRU
>>>>>> doesn't make much sense when we support user-space command buffer chaining,
>>>>>> but I think we should be doing it on exec at least, no?
>>>>> Maybe you could make the argument for compute VMs, the preempt worker in
>>>>> that case should probably do a bulk move. I can change this if desired.
>>>> Yes, please.
>>>>> Fot a fault VM it makes no sense as the fault handler updates the LRU
>>>>> for individual BOs.
>>>> Yes that makes sense.
>>>>>>> +		struct ww_acquire_ctx ww;
>>>>>>> +
>>>>>>> +		XE_BUG_ON(!xe_bo_is_user(bo));
>>>>>> Also why can't we use this for kernel objects as well? At some point we want
>>>>>> to get to evictable page-table objects? Could we do this in the
>>>>>> release_notify() callback to cover all potential bos?
>>>>>>
>>>>> xe_gem_object_close is a user call, right? We can't call this on kernel
>>>>> BOs. This also could be outside the if statement.
>>>> Hmm, yes the question was can we stop doing this in xe_gem_object_close()
>>>> and instead do it in release_notify() to cover also kernel objects. Since
>>>> release_notify() is called just after individualizing dma_resv, it makes
>>>> sense to individualize also LRU at that point?
>>>>
>>> If we ever support moving kernel BOs, then yes. We need to do a lot of
>>> work to get there, with I'd rather leave this where is but I'll add a
>>> comment indicating if we want to support kernel BO eviction, this should
>>> be updated.
>>>
>>> Sound good?
>> Well, I can't see the motivation to have it in gem close? Are other drivers
>> doing that? Whether the object should be bulk moved or not is tied to
>> whether it's a vm private object or not and that is closely tied to whether
>> the reservation object is the vm resv or the object resv?
>>
> AMDGPU does via amdgpu_gem_object_close -> amdgpu_vm_bo_del, so yes.
>
> I also think I moved it here as before release_notify() I think there is
> an assert TTM for the bulk move being NULL, let me find that.
>
>   319 static void ttm_bo_release(struct kref *kref)
>   320 {
>   321         struct ttm_buffer_object *bo =
>   322             container_of(kref, struct ttm_buffer_object, kref);
>   323         struct ttm_device *bdev = bo->bdev;
>   324         int ret;
>   325
>   326         WARN_ON_ONCE(bo->pin_count);
>   327         WARN_ON_ONCE(bo->bulk_move);

Ugh, that's unfortunate.

In any case, it looks like if a client has multiple handles to the 
object, the close() callback will be called multiple times, and the bulk 
object released on the first, right?

The second best option would I guess then be to have it in 
xe_gem_object_free(), I suppose, but the problem with that is the 
potentially sleeping uninterruptible object lock :(.

If we could have it in release_notify() we already have the object lock.

So can we have it in xe_gem_object_free() for now and later perhaps ping 
Christian about moving that WARN_ON_ONCE?

/Thomas

> Matt
>
>> /Thomas
>>
>>> Matt
>>>
>>>> /Thomas
>>>>
>>>>
>>>>> Matt
>>>>>
>>>>>> /Thomas
>>>>>>
>>>>>>