[RFC 0/5] Add capacity key to fdinfo

Wed May 1 13:27:18 UTC 2024

Hi Alex,

On 30/04/2024 19:32, Alex Deucher wrote:
> On Tue, Apr 30, 2024 at 1:27 PM Tvrtko Ursulin <tursulin at igalia.com> wrote:
>>
>> From: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>
>> I have noticed AMD GPUs can have more than one "engine" (ring?) of the same type
>> but amdgpu is not reporting that in fdinfo using the capacity engine tag.
>>
>> This series is therefore an attempt to improve that, but only an RFC since it is
>> quite likely I got stuff wrong on the first attempt. Or if not wrong it may not
>> be very beneficial in AMDs case.
>>
>> So I tried to figure out how to count and store the number of instances of an
>> "engine" type and spotted that could perhaps be used in more than one place in
>> the driver. I was more than a little bit confused by the ip_instance and uapi
>> rings, then how rings are selected to context entities internally. Anyway..
>> hopefully it is a simple enough series to easily spot any such large misses.
>>
>> End result should be that, assuming two "engine" instances, one fully loaded and
>> one idle will only report client using 50% of that engine type.
> 
> That would only be true if there are multiple instantiations of the IP
> on the chip which in most cases is not true.  In most cases there is
> one instance of the IP that can be fed from multiple rings.  E.g. for
> graphics and compute, all of the rings ultimately feed into the same
> compute units on the chip.  So if you have a gfx ring and a compute
> rings, you can schedule work to them asynchronously, but ultimately
> whether they execute serially or in parallel depends on the actual
> shader code in the command buffers and the extent to which it can
> utilize the available compute units in the shader cores.

This is the same as with Intel/i915. Fdinfo is not intended to provide 
utilisation of EUs and such, just how busy are the "entities" kernel 
submits to. So doing something like in this series would make the 
reporting more similar between the two drivers.

I think both the 0-800% or 0-100% range (taking 8 ring compute as an 
example) can be misleading for different workloads. Neither <800% in the 
former means one can send more work and same for <100% in the latter.

There is also a parallel with the CPU world here and hyper threading, if 
not wider, where "What does 100% actually mean?" is also wishy-washy.

Also note that the reporting of actual time based values in fdinfo would 
not changing with this series.

Of if you can guide me towards how to distinguish real vs fake 
parallelism in HW IP blocks I could modify the series to only add 
capacity tags where there are truly independent blocks. That would be 
different from i915 though were I did not bother with that distinction. 
(For reasons that assignment of for instance EUs to compute "rings" 
(command streamers in i915) was supposed to be possible to re-configure 
on the fly. So it did not make sense to try and be super smart in fdinfo.)

> As for the UAPI portion of this, we generally expose a limited number
> of rings to user space and then we use the GPU scheduler to load
> balance between all of the available rings of a type to try and
> extract as much parallelism as we can.

The part I do not understand is the purpose of the ring argument in for 
instance drm_amdgpu_cs_chunk_ib. It appears userspace can create up to N 
scheduling entities using different ring id's, but internally they can 
map to 1:N same scheduler instances (depending on IP type, can be that 
each userspace ring maps to same N hw rings, or for rings with no drm 
sched load balancing userspace ring also does not appear to have a 
relation to the picked drm sched instance.).

So I neither understand how this ring is useful, or how it does not 
create a problem for IP types which use drm_sched_pick_best. It appears 
even if userspace created two scheduling entities with different ring 
ids they could randomly map to same drm sched aka same hw ring, no?

Regards,

Tvrtko

> Alex
> 
> 
>>
>> Tvrtko Ursulin (5):
>>    drm/amdgpu: Cache number of rings per hw ip type
>>    drm/amdgpu: Use cached number of rings from the AMDGPU_INFO_HW_IP_INFO
>>      ioctl
>>    drm/amdgpu: Skip not present rings in amdgpu_ctx_mgr_usage
>>    drm/amdgpu: Show engine capacity in fdinfo
>>    drm/amdgpu: Only show VRAM in fdinfo if it exists
>>
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c    |  3 ++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 39 +++++++++-----
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    | 62 +++-------------------
>>   5 files changed, 49 insertions(+), 70 deletions(-)
>>
>> --
>> 2.44.0