[PATCH v5 4/4] drm/amdgpu: track bo memory stats at runtime

Wed Oct 23 12:24:42 UTC 2024

On 23/10/2024 13:12, Christian König wrote:
> Am 23.10.24 um 13:37 schrieb Tvrtko Ursulin:
>>
>> On 23/10/2024 10:14, Christian König wrote:
>>> Am 23.10.24 um 09:38 schrieb Tvrtko Ursulin:
>>>>
>>>> On 22/10/2024 17:24, Christian König wrote:
>>>>> Am 22.10.24 um 17:17 schrieb Li, Yunxiang (Teddy):
>>>>>> [Public]
>>>>>>
>>>>>>>> +static uint32_t fold_memtype(uint32_t memtype) {
>>>>>>> In general please add prefixes to even static functions, e.g. 
>>>>>>> amdgpu_vm_ or
>>>>>>> amdgpu_bo_.
>>>>>>>
>>>>>>>> +   /* Squash private placements into 'cpu' to keep the legacy 
>>>>>>>> userspace view.
>>>>>>> */
>>>>>>>> +   switch (mem_type) {
>>>>>>>> +   case TTM_PL_VRAM:
>>>>>>>> +   case TTM_PL_TT:
>>>>>>>> +           return memtype
>>>>>>>> +   default:
>>>>>>>> +           return TTM_PL_SYSTEM;
>>>>>>>> +   }
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static uint32_t bo_get_memtype(struct amdgpu_bo *bo) {
>>>>>>> That whole function belongs into amdgpu_bo.c
>>>>>> Do you mean bo_get_memtype or fold_memtype? I debated whether 
>>>>>> bo_get_memtype should go into amdgpu_vm.c or amdgpu_bo.c, and 
>>>>>> since it's using fold_memtype and only useful for memory stats 
>>>>>> because of folding the private placements I just left them here 
>>>>>> together with the other mem stats code.
>>>>>>
>>>>>> I can move it to amdgpu_bo.c make it return the memtype verbatim 
>>>>>> and just fold it when I do the accounting.
>>>>>
>>>>> I think that folding GDS, GWS and OA into system is also a bug. We 
>>>>> should really not doing that.
>>>>>
>>>>> Just wanted to point out for this round that the code to query the 
>>>>> current placement from a BO should probably go into amdgpu_bo.c and 
>>>>> not amdgpu_vm.c
>>>>>
>>>>>>
>>>>>>>> +   struct ttm_resource *res = bo->tbo.resource;
>>>>>>>> +   const uint32_t domain_to_pl[] = {
>>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_CPU)]      = TTM_PL_SYSTEM,
>>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GTT)]      = TTM_PL_TT,
>>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_VRAM)]     = TTM_PL_VRAM,
>>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GDS)]      = AMDGPU_PL_GDS,
>>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GWS)]      = AMDGPU_PL_GWS,
>>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_OA)]       = AMDGPU_PL_OA,
>>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_DOORBELL)] =
>>>>>>> AMDGPU_PL_DOORBELL,
>>>>>>>> +   };
>>>>>>>> +   uint32_t domain;
>>>>>>>> +
>>>>>>>> +   if (res)
>>>>>>>> +           return fold_memtype(res->mem_type);
>>>>>>>> +
>>>>>>>> +   /*
>>>>>>>> +    * If no backing store use one of the preferred domain for 
>>>>>>>> basic
>>>>>>>> +    * stats. We take the MSB since that should give a reasonable
>>>>>>>> +    * view.
>>>>>>>> +    */
>>>>>>>> +   BUILD_BUG_ON(TTM_PL_VRAM < TTM_PL_TT || TTM_PL_VRAM <
>>>>>>> TTM_PL_SYSTEM);
>>>>>>>> +   domain = fls(bo->preferred_domains & AMDGPU_GEM_DOMAIN_MASK);
>>>>>>>> +   if (drm_WARN_ON_ONCE(&adev->ddev,
>>>>>>>> +                        domain == 0 || --domain >= 
>>>>>>>> ARRAY_SIZE(domain_to_pl)))
>>>>>>> It's perfectly legal to create a BO without a placement. That one 
>>>>>>> just won't have a
>>>>>>> backing store.
>>>>>>>
>>>>>> This is lifted from the previous change I'm rebasing onto. I think 
>>>>>> what it’s trying to do is if the BO doesn't have a placement, use 
>>>>>> the "biggest" (VRAM > TT > SYSTEM) preferred placement for the 
>>>>>> purpose of accounting. Previously we just ignore BOs that doesn't 
>>>>>> have a placement. I guess there's argument for going with either 
>>>>>> approaches.
>>>>>
>>>>> I was not arguing, I'm simply pointing out a bug. It's perfectly 
>>>>> valid for bo->preferred_domains to be 0.
>>>>>
>>>>> So the following WARN_ON() that no bit is set is incorrect.
>>>>>
>>>>>>
>>>>>>>> +           return 0;
>>>>>>>> +   return fold_memtype(domain_to_pl[domain])
>>>>>>> That would need specular execution mitigation if I'm not 
>>>>>>> completely mistaken.
>>>>>>>
>>>>>>> Better use a switch/case statement.
>>>>>>>
>>>>>> Do you mean change the array indexing to a switch statement?
>>>>>
>>>>> Yes.
>>>>
>>>> Did you mean array_index_nospec?
>>>
>>> Yes.
>>>
>>>> Domain is not a direct userspace input and is calculated from the 
>>>> mask which sanitized to allowed values prior to this call. So I 
>>>> *think* switch is an overkill but don't mind it either. Just 
>>>> commenting FWIW.
>>>
>>> I missed that the mask is applied.
>>>
>>> Thinking more about it I'm not sure if we should do this conversion 
>>> in the first place. IIRC Tvrtko you once suggested a patch which 
>>> switched a bunch of code to use the TTM placement instead of the UAPI 
>>> flags.
>>
>> Maybe 8fb0efb10184 ("drm/amdgpu: Reduce mem_type to domain double 
>> indirection") is what are you thinking of?
> 
> Yes, exactly that one.
> 
>>
>>> Going more into this direction I think when we want to look at the 
>>> current placement we should probably also use the TTM PL enumeration 
>>> directly.
>>
>> It does this already. The placement flags are just to "invent" a TTM 
>> PL enum when bo->tbo.resource == NULL.
> 
> Ah, good point! I though we would do the mapping the other way around.
> 
> In this case that is even more something we should probably not do at all.
> 
> When bo->tbo.resource is NULL then this BO isn't resident at all, so it 
> should not account to resident memory.

It doesn't, only for total. I should have pasted more context..:

	struct ttm_resource *res = bo->tbo.resource;
...
         /* DRM stats common fields: */

         stats[type].total += size;
         if (drm_gem_object_is_shared_for_memory_stats(obj))
                 stats[type].drm.shared += size;
         else
                 stats[type].drm.private += size;

         if (res) {
                 stats[type].drm.resident += size

So if no current placement it does not count towards drm-resident-, only 
drm-total- (which is drm.private + drm.resident). Total and resident 
intend to be analogue to for instance VIRT and RES in top(1), or VZS and 
RSS in ps(1).

>> Again, based of the same enum. Not sure if you have something other in 
>> mind or you are happy with that?
> 
> I think that for drm-total-* we should use the GEM flags and for 
> drm-resident-* we should use the TTM placement.

Agreed! :)

>>
>> Then what Teddy does is IMO only tangential, he just changes when 
>> stats are collected and not this aspect.
> 
> Yeah, right but we should probably fix it up in the right way while on it.

Okay, we just need to align on is there a problem and how to fix it.

>> To fold or not the special placements (GWS, GDS & co) is also 
>> tangential. In my patch I just preserved the legacy behaviour so it 
>> can easily be tweaked on top.
> 
> Yeah, but again the original behavior is completely broken.
> 
> GWS, GDS and OA are counted in blocks of HW units (multiplied by 
> PAGE_SIZE IIRC to avoid some GEM&TTM warnings).
> 
> When you accumulate that anywhere in the memory stats then that is just 
> completely off.

Ooops. :) Are they backed by some memory though, be it system or VRAM?

Regards,

Tvrtko