[RFC 0/5] Add capacity key to fdinfo
Tvrtko Ursulin
tvrtko.ursulin at igalia.com
Thu May 2 14:43:09 UTC 2024
On 02/05/2024 14:07, Christian König wrote:
> Am 01.05.24 um 15:27 schrieb Tvrtko Ursulin:
>>
>> Hi Alex,
>>
>> On 30/04/2024 19:32, Alex Deucher wrote:
>>> On Tue, Apr 30, 2024 at 1:27 PM Tvrtko Ursulin <tursulin at igalia.com>
>>> wrote:
>>>>
>>>> From: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>>>
>>>> I have noticed AMD GPUs can have more than one "engine" (ring?) of
>>>> the same type
>>>> but amdgpu is not reporting that in fdinfo using the capacity engine
>>>> tag.
>>>>
>>>> This series is therefore an attempt to improve that, but only an RFC
>>>> since it is
>>>> quite likely I got stuff wrong on the first attempt. Or if not wrong
>>>> it may not
>>>> be very beneficial in AMDs case.
>>>>
>>>> So I tried to figure out how to count and store the number of
>>>> instances of an
>>>> "engine" type and spotted that could perhaps be used in more than
>>>> one place in
>>>> the driver. I was more than a little bit confused by the ip_instance
>>>> and uapi
>>>> rings, then how rings are selected to context entities internally.
>>>> Anyway..
>>>> hopefully it is a simple enough series to easily spot any such large
>>>> misses.
>>>>
>>>> End result should be that, assuming two "engine" instances, one
>>>> fully loaded and
>>>> one idle will only report client using 50% of that engine type.
>>>
>>> That would only be true if there are multiple instantiations of the IP
>>> on the chip which in most cases is not true. In most cases there is
>>> one instance of the IP that can be fed from multiple rings. E.g. for
>>> graphics and compute, all of the rings ultimately feed into the same
>>> compute units on the chip. So if you have a gfx ring and a compute
>>> rings, you can schedule work to them asynchronously, but ultimately
>>> whether they execute serially or in parallel depends on the actual
>>> shader code in the command buffers and the extent to which it can
>>> utilize the available compute units in the shader cores.
>>
>> This is the same as with Intel/i915. Fdinfo is not intended to provide
>> utilisation of EUs and such, just how busy are the "entities" kernel
>> submits to. So doing something like in this series would make the
>> reporting more similar between the two drivers.
>>
>> I think both the 0-800% or 0-100% range (taking 8 ring compute as an
>> example) can be misleading for different workloads. Neither <800% in
>> the former means one can send more work and same for <100% in the latter.
>
> Yeah, I think that's what Alex tries to describe. By using 8 compute
> rings your 800% load is actually incorrect and quite misleading.
>
> Background is that those 8 compute rings won't be active all at the same
> time, but rather waiting on each other for resources.
>
> But this "waiting" is unfortunately considered execution time since the
> used approach is actually not really capable of separating waiting and
> execution time.
Right, so 800% is what gputop could be suggesting today, by the virtue 8
context/clients can each use 100% if they only use a subset of compute
units. I was proposing to expose the capacity in fdinfo so it can be
scaled down and then dicussing how both situation have pros and cons.
>> There is also a parallel with the CPU world here and hyper threading,
>> if not wider, where "What does 100% actually mean?" is also wishy-washy.
>>
>> Also note that the reporting of actual time based values in fdinfo
>> would not changing with this series.
>>
>> Of if you can guide me towards how to distinguish real vs fake
>> parallelism in HW IP blocks I could modify the series to only add
>> capacity tags where there are truly independent blocks. That would be
>> different from i915 though were I did not bother with that
>> distinction. (For reasons that assignment of for instance EUs to
>> compute "rings" (command streamers in i915) was supposed to be
>> possible to re-configure on the fly. So it did not make sense to try
>> and be super smart in fdinfo.)
>
> Well exactly that's the point we don't really have truly independent
> blocks on AMD hardware.
>
> There are things like independent SDMA instances, but those a meant to
> be used like the first instance for uploads and the second for downloads
> etc.. When you use both instances for the same job they will pretty much
> limit each other because of a single resource.
So _never_ multiple instances of the same IP block? No video decode,
encode, anything?
>>> As for the UAPI portion of this, we generally expose a limited number
>>> of rings to user space and then we use the GPU scheduler to load
>>> balance between all of the available rings of a type to try and
>>> extract as much parallelism as we can.
>>
>> The part I do not understand is the purpose of the ring argument in
>> for instance drm_amdgpu_cs_chunk_ib. It appears userspace can create
>> up to N scheduling entities using different ring id's, but internally
>> they can map to 1:N same scheduler instances (depending on IP type,
>> can be that each userspace ring maps to same N hw rings, or for rings
>> with no drm sched load balancing userspace ring also does not appear
>> to have a relation to the picked drm sched instance.).
>>
>> So I neither understand how this ring is useful, or how it does not
>> create a problem for IP types which use drm_sched_pick_best. It
>> appears even if userspace created two scheduling entities with
>> different ring ids they could randomly map to same drm sched aka same
>> hw ring, no?
>
> Yeah, that is correct. The multimedia instances have to use a "fixed"
> load balancing because of lack of firmware support. That should have
> been fixed by now but we never found time to actually validate it.
Gotcha.
> Regarding the "ring" parameter in CS, that is basically just for
> backward compatibility with older userspace. E.g. that we don't map all
> SDMA jobs to the same instance when only once context is used.
I see. In that sense "limits" for compute in amdgpu_ctx_num_entities are
arbitrary, or related to some old userspace expectation?
Regards,
Tvrtko
> Regards,
> Christian.
>
>>
>> Regards,
>>
>> Tvrtko
>>
>>> Alex
>>>
>>>
>>>>
>>>> Tvrtko Ursulin (5):
>>>> drm/amdgpu: Cache number of rings per hw ip type
>>>> drm/amdgpu: Use cached number of rings from the
>>>> AMDGPU_INFO_HW_IP_INFO
>>>> ioctl
>>>> drm/amdgpu: Skip not present rings in amdgpu_ctx_mgr_usage
>>>> drm/amdgpu: Show engine capacity in fdinfo
>>>> drm/amdgpu: Only show VRAM in fdinfo if it exists
>>>>
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 3 ++
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++++
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 39 +++++++++-----
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 62
>>>> +++-------------------
>>>> 5 files changed, 49 insertions(+), 70 deletions(-)
>>>>
>>>> --
>>>> 2.44.0
>
More information about the amd-gfx
mailing list