[PATCH 2/2] drm/amdgpu: add AMDGPU_INFO_VM_STAT to return GPU VM

Christian König ckoenig.leichtzumerken at gmail.com
Tue Jan 24 08:29:04 UTC 2023


Am 24.01.23 um 09:27 schrieb Marek Olšák:
> A new Gallium HUD "value producer" could be added that reads fdinfo 
> without calling the driver.

That sounds good. To be honest I would have plenty of use for that.

> I still think there is merit in having this in amdgpu_drm.h too.

Why?

Christian.

>
> Marek
>
> On Tue, Jan 24, 2023 at 3:13 AM Marek Olšák <maraeo at gmail.com> wrote:
>
>     The table of exposed driver-specific counters:
>     https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/gallium/drivers/radeonsi/si_query.c#L1751
>
>     Counter enums. They use the same interface as e.g. occlusion
>     queries, except that begin_query and end_query save the results in
>     the driver/CPU.
>     https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/gallium/drivers/radeonsi/si_query.h#L45
>
>     Counters exposed by the winsys:
>     https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/gallium/include/winsys/radeon_winsys.h#L126
>
>     I just need to query the counters in the winsys and return them.
>
>     Marek
>
>     On Tue, Jan 24, 2023 at 2:58 AM Christian König
>     <ckoenig.leichtzumerken at gmail.com> wrote:
>
>         How are the counters which the HUD consumes declared?
>
>         See what I want to avoid is a) to nail down the interface with
>         the kernel on specific values and b) make it possible to
>         easily expose new values.
>
>         In other words what we could do with fdinfo is to have
>         something like this:
>
>         GALLIUM_FDINFO_HUD=drm-memory-vram,amd-evicted-vram,amd-mclk
>         glxgears
>
>         And the HUD just displays the values the kernel provides
>         without the need to re-compile mesa when we want to add some
>         more values nor have the values as part of the UAPI.
>
>         Christian.
>
>         Am 24.01.23 um 08:37 schrieb Marek Olšák:
>>         The Gallium HUD doesn't consume strings. It only consumes
>>         values that are exposed as counters from the driver. In this
>>         case, we need the driver to expose evicted stats as counters.
>>         Each counter can set whether the value is absolute (e.g.
>>         memory usage) or monotonic (e.g. perf counter). Parsing
>>         fdinfo to get the values is undesirable.
>>
>>         Marek
>>
>>         On Mon, Jan 23, 2023 at 4:31 AM Christian König
>>         <ckoenig.leichtzumerken at gmail.com> wrote:
>>
>>             Let's do this as valid in fdinfo.
>>
>>             This way we can easily extend whatever the kernel wants
>>             to display as statistics in the userspace HUD.
>>
>>             Regards,
>>             Christian.
>>
>>             Am 21.01.23 um 01:45 schrieb Marek Olšák:
>>>             We badly need a way to query evicted memory usage. It's
>>>             essential for investigating performance problems and it
>>>             uncovered the buddy allocator disaster. Please either
>>>             suggest an alternative, suggest changes, or review. We
>>>             need it ASAP.
>>>
>>>             Thanks,
>>>             Marek
>>>
>>>             On Tue, Jan 10, 2023 at 11:55 AM Marek Olšák
>>>             <maraeo at gmail.com> wrote:
>>>
>>>                 On Tue, Jan 10, 2023 at 11:23 AM Christian König
>>>                 <ckoenig.leichtzumerken at gmail.com> wrote:
>>>
>>>                     Am 10.01.23 um 16:28 schrieb Marek Olšák:
>>>>                     On Wed, Jan 4, 2023 at 9:51 AM Christian König
>>>>                     <ckoenig.leichtzumerken at gmail.com> wrote:
>>>>
>>>>                         Am 04.01.23 um 00:08 schrieb Marek Olšák:
>>>>>                         I see about the access now, but did you
>>>>>                         even look at the patch?
>>>>
>>>>                         I did look at the patch, but I haven't
>>>>                         fully understood yet what you are trying to
>>>>                         do here.
>>>>
>>>>
>>>>                     First and foremost, it returns the evicted size
>>>>                     of VRAM and visible VRAM, and returns visible
>>>>                     VRAM usage. It should be obvious which stat
>>>>                     includes the size of another.
>>>>
>>>>
>>>>>                         Because what the patch does isn't even
>>>>>                         exposed to common drm code, such as the
>>>>>                         preferred domain and visible VRAM
>>>>>                         placement, so it can't be in fdinfo right now.
>>>>>
>>>>>                         Or do you even know what fdinfo contains?
>>>>>                         Because it contains nothing useful. It
>>>>>                         only has VRAM and GTT usage, which we
>>>>>                         already have in the INFO ioctl, so it has
>>>>>                         nothing that we need. We mainly need the
>>>>>                         eviction information and visible VRAM
>>>>>                         information now. Everything else is a bonus.
>>>>
>>>>                         Well the main question is what are you
>>>>                         trying to get from that information? The
>>>>                         eviction list for example is completely
>>>>                         meaningless to userspace, that stuff is
>>>>                         only temporary and will be cleared on the
>>>>                         next CS again.
>>>>
>>>>
>>>>                     I don't know what you mean. The returned
>>>>                     eviction stats look correct and are stable
>>>>                     (they don't change much). You can suggest
>>>>                     changes if you think some numbers are not
>>>>                     reported correctly.
>>>>
>>>>
>>>>                         What we could expose is the VRAM
>>>>                         over-commit value, e.g. how much BOs which
>>>>                         where supposed to be in VRAM are in GTT
>>>>                         now. I think that's what you are looking
>>>>                         for here, right?
>>>>
>>>>
>>>>                     The VRAM overcommit value is "evicted_vram".
>>>>
>>>>
>>>>>                         Also, it's undesirable to open and parse a
>>>>>                         text file if we can just call an ioctl.
>>>>
>>>>                         Well I see the reasoning for that, but I
>>>>                         also see why other drivers do a lot of the
>>>>                         stuff we have as IOCTL as separate files in
>>>>                         sysfs, fdinfo or debugfs.
>>>>
>>>>                         Especially repeating all the static
>>>>                         information which were already available
>>>>                         under sysfs in the INFO IOCTL was a design
>>>>                         mistake as far as I can see. Just compare
>>>>                         what AMDGPU and the KFD code is doing to
>>>>                         what for example i915 is doing.
>>>>
>>>>                         Same for things like debug information
>>>>                         about a process. The fdinfo stuff can be
>>>>                         queried from external tools (gdb, gputop,
>>>>                         umr etc...) as well which makes that
>>>>                         interface more preferred.
>>>>
>>>>
>>>>                     Nothing uses fdinfo in Mesa. No driver uses
>>>>                     sysfs in Mesa except drm shims, noop drivers,
>>>>                     and Intel for perf metrics. sysfs itself is an
>>>>                     unusable mess for the PCIe query and is missing
>>>>                     information.
>>>>
>>>>                     I'm not against exposing more stuff through
>>>>                     sysfs and fdinfo for tools, but I don't see any
>>>>                     reason why drivers should use it (other than
>>>>                     for slowing down queries and initialization).
>>>
>>>                     That's what I'm asking: Is this for some tool or
>>>                     to make some driver decision based on it?
>>>
>>>                     If you just want the numbers for over displaying
>>>                     then I think it would be better to put this into
>>>                     fdinfo together with the other existing stuff there.
>>>
>>>
>>>                     If you want to make allocation decisions based
>>>                     on this then we should have that as IOCTL or
>>>                     even better as mmap() page between kernel and
>>>                     userspace. But in this case I would also
>>>                     calculation the numbers completely different as
>>>                     well.
>>>
>>>                     See we have at least the following things in the
>>>                     kernel:
>>>                     1. The eviction list in the VM.
>>>                         Those are the BOs which are currently
>>>                     evicted and tried to moved back in on the next CS.
>>>
>>>                     2. The VRAM over commit value.
>>>                         In other words how much more VRAM than
>>>                     available has the application tried to allocate?
>>>
>>>                     3. The visible VRAM usage by this application.
>>>
>>>                     The end goal is that the eviction list will go
>>>                     away, e.g. we will always have stable
>>>                     allocations based on allocations of other
>>>                     applications and not constantly swap things in
>>>                     and out.
>>>
>>>                     When you now expose the eviction list to
>>>                     userspace we will be stuck with this interface
>>>                     forever.
>>>
>>>
>>>                 It's for the GALLIUM HUD.
>>>
>>>                 The only missing thing is the size of all evicted
>>>                 VRAM allocations, and the size of all evicted
>>>                 visible VRAM allocations.
>>>
>>>                 1. No list is exposed. Only sums of buffer sizes are
>>>                 exposed. Also, the eviction list has no meaning
>>>                 here. All lists are treated equally, and mem_type is
>>>                 compared with preferred_domains to determine where
>>>                 buffers are and where they should be.
>>>
>>>                 2. I'm not interested in the overcommit value. I'm
>>>                 only interested in knowing the number of bytes of
>>>                 evicted VRAM right now. It can be as variable as the
>>>                 CPU load, but in practice it shouldn't be because
>>>                 PCIe doesn't have the bandwidth to move things quickly.
>>>
>>>                 3. Yes, that's true.
>>>
>>>                 Marek
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20230124/846b4d61/attachment-0001.htm>


More information about the amd-gfx mailing list