[PATCH v7 4/4] drm/amdgpu: track bo memory stats at runtime

Wed Nov 13 13:53:14 UTC 2024

[Public]

> From: Koenig, Christian <Christian.Koenig at amd.com>
> Sent: Wednesday, November 13, 2024 6:39
> Am 13.11.24 um 11:25 schrieb Tvrtko Ursulin:
> > On 13/11/2024 08:49, Christian König wrote:
> >> Am 12.11.24 um 19:16 schrieb Li, Yunxiang (Teddy):
> >>> [SNIP]
> >>>>> +   size = sign * amdgpu_bo_size(bo);
> >>>>> +   res = bo->tbo.resource;
> >>>>> +   type = res ? res->mem_type :
> >>>>> amdgpu_bo_get_preferred_placement(bo);
> >>>> Again, it's a clear NAK from my side to do stuff like that.
> >>>>
> >>>> When there isn't any backing store the BO should *not* be accounted
> >>>> to anything.
> >>> I don't have a preference either way, but I think it should be a
> >>> separate discussion to properly define what drm-total- means.
> >
> > Total must show the total size of all BOs which exist even if they
> > don't currently have a backing store. That's how drm-usage-stats.rst
> > defines the field and that is how all the other drivers work.
>
> In that case we should only look at the preferred placement and not the backing
> store at all.
>
> But that makes the total identical to the requested value, doesn't it?

Yes, the issue is not which BO needs to be counted but where they should be counted. This gets more complicated if we consider BOs to prefer multiple placements.

IMO it makes sense to have drm-total- to work like the legacy amd-requested- where we look at BO's preferred placement. For multiple preferred placements we say that the implementation needs to pick one of them to avoid double counting, but which one is up to the implementation as long as it's done in a consistent manner. Does that sound reasonable?

> >
> >>>>> +   type = res ? res->mem_type :
> >>>>> amdgpu_bo_get_preferred_placement(bo);
> >>>>> +   shared =
> >>>>> drm_gem_object_is_shared_for_memory_stats(&bo->tbo.base);
> >>>>> +
> >>>>> +   if (type >= __AMDGPU_PL_LAST)
> >>>>> +           return;
> >>>>> +
> >>>>> +   spin_lock(&vm->status_lock);
> >>>>> +
> >>>>> +   if (shared)
> >>>>> +           vm->stats[type].drm.shared += size;
> >>>>> +   else
> >>>>> +           vm->stats[type].drm.private += size;
> >>>>> +   if (res)
> >>>>> +           vm->stats[type].drm.resident += size;
> >>>>> +   if (bo->flags & AMDGPU_GEM_CREATE_DISCARDABLE)
> >>>>> +           vm->stats[type].drm.purgeable += size;
> >>>>> +
> >>>>> +   if (bo->preferred_domains & AMDGPU_GEM_DOMAIN_VRAM) {
> >>>>> +           vm->stats[TTM_PL_VRAM].requested += size;
> >>>>> +           if (type != TTM_PL_VRAM)
> >>>>> +                   vm->stats[TTM_PL_VRAM].evicted += size;
> >>>> Again that is incorrect. BOs can be created with VRAM|GTT as their
> >>>> placement.
> >>>>
> >>>> If such a BO is placed into GTT that doesn't mean it is evicted.
> >>> In that case, do we count BO with VRAM|GTT in both VRAM and GTT's
> >>> .requested field? and if they are not in either, they go in both
> >>> .evicted field?
> >>
> >> Oh, good question depends on the definition of the requested field.
> >>
> >> Accounting it to VRAM.evicted while GTT placement is desirable as
> >> well is certainly not correct.
> >>
> >>  From my understanding they should go into the VRAM request, but not
> >> account to evicted. But Tvrtko might see that differently.
> >
> > Semantics of requested and evicted are kind of amdgpu 'legacy' thing.
> > So the question is whether or not they should keep matching.
> > Originally they were like this (I will edit out parts which deal with
> > CPU visible for ease of comparison, and which have since been removed
> > anyway):
> >
> >        if (bo->preferred_domains & AMDGPU_GEM_DOMAIN_VRAM) {
> >                stats->requested_vram += size;
> >                if (res->mem_type != TTM_PL_VRAM)
> >                         stats->evicted_vram += size;
> >         } else if (bo->preferred_domains & AMDGPU_GEM_DOMAIN_GTT) {
> >                stats->requested_gtt += size;
> >         }
> >
> > So the part about accounting as evicted with dual preferred placement
> > was there from the start.
> >
> > Then after my changes:
> >
> >         if (bo->preferred_domains & AMDGPU_GEM_DOMAIN_VRAM) {
> >                 stats[TTM_PL_VRAM].requested += size;
> >                 if (type != TTM_PL_VRAM) {
> >                         stats[TTM_PL_VRAM].evicted += size;
> >                 }
> >         } else if (bo->preferred_domains & AMDGPU_GEM_DOMAIN_GTT) {
> >                 stats[TTM_PL_TT].requested += size;
> >         }
> >
> > I mostly kept the same semantics.
> >
> > Teddy's version keeps it the same:
> >
> >     if (bo->preferred_domains & AMDGPU_GEM_DOMAIN_VRAM) {
> >         vm->stats[TTM_PL_VRAM].requested += size;
> >         if (type != TTM_PL_VRAM)
> >             vm->stats[TTM_PL_VRAM].evicted += size;
> >     } else if (bo->preferred_domains & AMDGPU_GEM_DOMAIN_GTT) {
> >         vm->stats[TTM_PL_TT].requested += size;
> >     }
> >
> > If no AMD tools depend on the legacy semantics for evicted/requested
> > we can change them. There is some overlap with the standard keys
> > anyway and the fact preferred mask is unordered made the original
> > behaviour a bit presumptuous to begin with. In summary I think it
> > depends on whether we need to keep the legacy semantics, or even the
> > keys themselves.
>
> As far as I know we don't have any dependency on the amdgpu specific keys.
>
> But I need to double check what drm-usage-stats.rst actually defines.
> Maybe that doesn't really match what we have in amdgpu and other TTM drivers as
> information.
>
> Thanks,
> Christian.

If we go with the above definition for drm-total-, I can change the amd-requested- fields to just mirror drm-total-, and have amd-evicted-vram check "preferred_domains & amdgpu_mem_type_to_domain(memtype) == 0", that should cover it?

Teddy

> >
> >
> > Regards,
> >
> > Tvrtko
> >
> >>>>> @@ -2612,7 +2707,6 @@ void amdgpu_vm_fini(struct amdgpu_device
> >>>>> *adev, struct amdgpu_vm *vm)
> >>>>>
> >>>>>      root = amdgpu_bo_ref(vm->root.bo);
> >>>>>      amdgpu_bo_reserve(root, true);
> >>>>> -   amdgpu_vm_put_task_info(vm->task_info);
> >>>>>      amdgpu_vm_set_pasid(adev, vm, 0);
> >>>>>      dma_fence_wait(vm->last_unlocked, false);
> >>>>>      dma_fence_put(vm->last_unlocked); @@ -2660,6 +2754,15 @@ void
> >>>>> amdgpu_vm_fini(struct amdgpu_device *adev,
> >>>> struct amdgpu_vm *vm)
> >>>>>              }
> >>>>>      }
> >>>>>
> >>>>> +   if (!amdgpu_vm_stats_is_zero(vm)) {
> >>>>> +           struct amdgpu_task_info *ti = vm->task_info;
> >>>>> +
> >>>>> +           dev_warn(adev->dev,
> >>>>> +                    "VM memory stats for proc %s(%d) task %s(%d)
> >>>>> is non-zero
> >>>> when fini\n",
> >>>>> + ti->process_name, ti->pid, ti->task_name, ti->tgid);
> >>>>> +   }
> >>>>> +
> >>>>> +   amdgpu_vm_put_task_info(vm->task_info);
> >>>> Please don't move the call to amdgpu_vm_put_task_info().
> >>> Is keeping the task_info alive a hazard here? I could copy out the
> >>> info, it just seemed a bit wasteful.
> >>
> >> Ah, now I see why you have moved that.
> >>
> >> IIRC we need to free up the task info before releasing the PASID, but
> >> that info might be outdated. Need to check the code.
> >>
> >> Does it work if you move the message further up or does the root PD
> >> then break your neck because it isn't released yet?
> >>
> >> Thanks,
> >> Christian.
> >>
> >>>
> >>> Regards,
> >>> Teddy
> >>