Making drm_gpuvm work across gpu devices
Felix Kuehling
felix.kuehling at amd.com
Mon Jan 29 17:52:27 UTC 2024
On 2024-01-29 11:28, Christian König wrote:
> Am 29.01.24 um 17:24 schrieb Felix Kuehling:
>> On 2024-01-29 10:33, Christian König wrote:
>>> Am 29.01.24 um 16:03 schrieb Felix Kuehling:
>>>> On 2024-01-25 13:32, Daniel Vetter wrote:
>>>>> On Wed, Jan 24, 2024 at 09:33:12AM +0100, Christian König wrote:
>>>>>> Am 23.01.24 um 20:37 schrieb Zeng, Oak:
>>>>>>> [SNIP]
>>>>>>> Yes most API are per device based.
>>>>>>>
>>>>>>> One exception I know is actually the kfd SVM API. If you look at
>>>>>>> the svm_ioctl function, it is per-process based. Each
>>>>>>> kfd_process represent a process across N gpu devices.
>>>>>> Yeah and that was a big mistake in my opinion. We should really
>>>>>> not do that
>>>>>> ever again.
>>>>>>
>>>>>>> Need to say, kfd SVM represent a shared virtual address space
>>>>>>> across CPU and all GPU devices on the system. This is by the
>>>>>>> definition of SVM (shared virtual memory). This is very
>>>>>>> different from our legacy gpu *device* driver which works for
>>>>>>> only one device (i.e., if you want one device to access another
>>>>>>> device's memory, you will have to use dma-buf export/import etc).
>>>>>> Exactly that thinking is what we have currently found as blocker
>>>>>> for a
>>>>>> virtualization projects. Having SVM as device independent feature
>>>>>> which
>>>>>> somehow ties to the process address space turned out to be an
>>>>>> extremely bad
>>>>>> idea.
>>>>>>
>>>>>> The background is that this only works for some use cases but not
>>>>>> all of
>>>>>> them.
>>>>>>
>>>>>> What's working much better is to just have a mirror functionality
>>>>>> which says
>>>>>> that a range A..B of the process address space is mapped into a
>>>>>> range C..D
>>>>>> of the GPU address space.
>>>>>>
>>>>>> Those ranges can then be used to implement the SVM feature
>>>>>> required for
>>>>>> higher level APIs and not something you need at the UAPI or even
>>>>>> inside the
>>>>>> low level kernel memory management.
>>>>>>
>>>>>> When you talk about migrating memory to a device you also do this
>>>>>> on a per
>>>>>> device basis and *not* tied to the process address space. If you
>>>>>> then get
>>>>>> crappy performance because userspace gave contradicting
>>>>>> information where to
>>>>>> migrate memory then that's a bug in userspace and not something
>>>>>> the kernel
>>>>>> should try to prevent somehow.
>>>>>>
>>>>>> [SNIP]
>>>>>>>> I think if you start using the same drm_gpuvm for multiple
>>>>>>>> devices you
>>>>>>>> will sooner or later start to run into the same mess we have
>>>>>>>> seen with
>>>>>>>> KFD, where we moved more and more functionality from the KFD to
>>>>>>>> the DRM
>>>>>>>> render node because we found that a lot of the stuff simply
>>>>>>>> doesn't work
>>>>>>>> correctly with a single object to maintain the state.
>>>>>>> As I understand it, KFD is designed to work across devices. A
>>>>>>> single pseudo /dev/kfd device represent all hardware gpu
>>>>>>> devices. That is why during kfd open, many pdd (process device
>>>>>>> data) is created, each for one hardware device for this process.
>>>>>> Yes, I'm perfectly aware of that. And I can only repeat myself
>>>>>> that I see
>>>>>> this design as a rather extreme failure. And I think it's one of
>>>>>> the reasons
>>>>>> why NVidia is so dominant with Cuda.
>>>>>>
>>>>>> This whole approach KFD takes was designed with the idea of
>>>>>> extending the
>>>>>> CPU process into the GPUs, but this idea only works for a few use
>>>>>> cases and
>>>>>> is not something we should apply to drivers in general.
>>>>>>
>>>>>> A very good example are virtualization use cases where you end up
>>>>>> with CPU
>>>>>> address != GPU address because the VAs are actually coming from
>>>>>> the guest VM
>>>>>> and not the host process.
>>>>>>
>>>>>> SVM is a high level concept of OpenCL, Cuda, ROCm etc.. This
>>>>>> should not have
>>>>>> any influence on the design of the kernel UAPI.
>>>>>>
>>>>>> If you want to do something similar as KFD for Xe I think you
>>>>>> need to get
>>>>>> explicit permission to do this from Dave and Daniel and maybe
>>>>>> even Linus.
>>>>> I think the one and only one exception where an SVM uapi like in
>>>>> kfd makes
>>>>> sense, is if the _hardware_ itself, not the software stack defined
>>>>> semantics that you've happened to build on top of that hw,
>>>>> enforces a 1:1
>>>>> mapping with the cpu process address space.
>>>>>
>>>>> Which means your hardware is using PASID, IOMMU based translation,
>>>>> PCI-ATS
>>>>> (address translation services) or whatever your hw calls it and
>>>>> has _no_
>>>>> device-side pagetables on top. Which from what I've seen all
>>>>> devices with
>>>>> device-memory have, simply because they need some place to store
>>>>> whether
>>>>> that memory is currently in device memory or should be translated
>>>>> using
>>>>> PASID. Currently there's no gpu that works with PASID only, but
>>>>> there are
>>>>> some on-cpu-die accelerator things that do work like that.
>>>>>
>>>>> Maybe in the future there will be some accelerators that are fully
>>>>> cpu
>>>>> cache coherent (including atomics) with something like CXL, and the
>>>>> on-device memory is managed as normal system memory with struct
>>>>> page as
>>>>> ZONE_DEVICE and accelerator va -> physical address translation is
>>>>> only
>>>>> done with PASID ... but for now I haven't seen that, definitely
>>>>> not in
>>>>> upstream drivers.
>>>>>
>>>>> And the moment you have some per-device pagetables or per-device
>>>>> memory
>>>>> management of some sort (like using gpuva mgr) then I'm 100%
>>>>> agreeing with
>>>>> Christian that the kfd SVM model is too strict and not a great idea.
>>>>
>>>> That basically means, without ATS/PRI+PASID you cannot implement a
>>>> unified memory programming model, where GPUs or accelerators access
>>>> virtual addresses without pre-registering them with an SVM API call.
>>>>
>>>> Unified memory is a feature implemented by the KFD SVM API and used
>>>> by ROCm. This is used e.g. to implement OpenMP USM (unified shared
>>>> memory). It's implemented with recoverable GPU page faults. If the
>>>> page fault interrupt handler cannot assume a shared virtual address
>>>> space, then implementing this feature isn't possible.
>>>
>>> Why not? As far as I can see the OpenMP USM is just another funky
>>> way of userptr handling.
>>>
>>> The difference is that in an userptr we assume that we always need
>>> to request the whole block A..B from a mapping while for page fault
>>> based handling it can be just any page in between A and B which is
>>> requested and made available to the GPU address space.
>>>
>>> As far as I can see there is absolutely no need for any special SVM
>>> handling.
>>
>> It does assume a shared virtual address space between CPU and GPUs.
>> There are no API calls to tell the driver that address A on the CPU
>> maps to address B on the GPU1 and address C on GPU2. The KFD SVM API
>> was designed to work with this programming model, by augmenting the
>> shared virtual address mappings with virtual address range attributes
>> that can modify the migration policy and indicate prefetching,
>> prefaulting, etc. You could think of it as madvise on steroids.
>
> Yeah, so what? In this case you just say through an IOCTL that CPU
> range A..B should map to GPU range C..D and for A/B and C/D you use
> the maximum of the address space.
What I want is that address range A..B on the CPU matches A..B on the
GPU, because I'm sharing pointers between CPU and GPU. I can't think of
any sane user mode using a unified memory programming model, that would
ever ask KFD to map unified memory mappints to a different address range
on the GPU. Adding such an ioclt is a complete waste of time, and can
only serve to add unnecessary complexity.
Regards,
Felix
>
> There is no restriction that this needs to be accurate in way. It's
> just the it can be accurate to be more efficient and eventually use
> only a fraction of the address space instead of all of it for some use
> cases.
>
> So this isn't a blocker, it's just one special use case.
>
> Regards,
> Christian.
>
>>
>> Regards,
>> Felix
>>
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Regards,
>>>> Felix
>>>>
>>>>
>>>>>
>>>>> Cheers, Sima
>>>
>
More information about the Intel-xe
mailing list