[RFC] GPU driver with separate "core" and "DRM" modules

Ben Skeggs bskeggs at nvidia.com
Mon Jun 17 19:01:09 UTC 2024


On 18/6/24 03:30, Danilo Krummrich wrote:

> On Fri, Jun 14, 2024 at 03:02:09AM +1000, Ben Skeggs wrote:
>> NVIDIA has been exploring ways to better support the effort for an
>> upstream kernel mode driver for GPUs that are capable of running GSP-RM
>> firmware, since the introduction[1] to Nova.
>>
>> Use cases have been identified for which separating the core GPU
>> programming out of the full DRM driver stack is a strong requirement
>> from our key customers.
>>
>> An upstreamed NVIDIA GPU driver should be able to support current and
>> emerging customer use cases for vGPU hosts.  NVIDIA's vGPU deployments
>> to date do not support compute or graphics functionality within the
>> hypervisor host, and have no dependency on the Linux graphics subsystem,
>> instead implementing the minimal functionality required to run vGPU
>> guest VMs.
>>
>> For security-sensitive environments such as cloud infrastructure, it's
>> important to continue support for running a minimal footprint vGPU host
>> driver in a stripped-down / barebones kernel environment.
>>
>> This can be achieved by supporting both VFIO and DRM drivers as clients
>> of a core driver, without requiring a full-fledged DRM driver (or the
>> DRM subsystem itself) to be built into the host kernel.
>>
>> A core driver would be responsible for booting and communicating with
>> GSP-RM, enumeration of HW configuration, shared/partitioned resource
>> management, exception handling, and event dispatch.
>>
>> The DRM driver would do all the standard things a DRM driver does, and
>> implement GPU memory management (TTM/HMM), KMS, command submission etc,
>> as well as providing UAPI for userspace clients.  These features would
>> be implemented using HW resources allocated from a core driver, rather
>> than the DRM driver being directly responsible for HW programming.
>>
>> As Nouveau's KMD is already split (in the logical sense) along similar
>> lines, we're using it here for the purposes of this RFC to demonstrate
>> the feasibility of such an architecture, and open it up for discussion.
> Generally, I think that approach is reasonable and I like it. There's only a few
> concerns I have for now.
>
> We've already had (and still have) quite a few difficulties due to this split in
> Nouveau. Especially when it comes to VMM and handling page tables. There are
> cases where the locking architecture must be closely aligned with the upper
> layers, i.e. the (VM_BIND) uAPI.
>
> Having a separate (local) locking architecture doesn't work out well in this
> case due to the implications of dealing with dma_fences and their signalling
> paths.
>
> Unfortunately, we can't even argue that we solved this problem in Nouveau. I
> think it's fair to say that we found ways (without rewriting / restructuring a
> lot of the VMM code to use a more global locking architecture) to make it work
> in practice, but surely there are still conditions that (at least theoretically)
> can lock things up.
>
> I'm not saying that it's impossible to work this out, but having a strong
> separation is likely to make those things quite a bit more difficult.

Yeah, I think there's a bit of work ahead to determine where exactly all 
the pieces should live.  For VMM specifically, I'd be looking more at an 
architecture where the DRM driver "owns" all the memory remaining after 
GSP has booted (or was allocated to its VFIO partition) etc, and manages 
it however it sees fit.  And, rather than calling into the core driver 
for mapping into a VMM, the DRM driver would allocate its own PDB, 
inform the core driver of its location, and manage its own page tables 
directly from there.

This is similar to how the interface between NVKM and GSP-RM works now 
at least.

I've looked a little bit recently into how to approach fixing this in 
nouveau, but don't have a solid plan yet.

>
> On the other hand this is a problem we might have to deal with either way, it
> shouldn't matter too much having separate modules for VFIO and the GPU core.
>
> Besides that, do we expect semantical changes in the firmware that can
> potentially propagate up in the following sense?
>
> [GSP firmware -> Host GPU core driver -> VFIO driver -> Guest GPU core driver]
>
> If so, how do we deal with those? In the context of ensuring compatibility, can
> we ensure this can't lead to increasing maintainance and testing effort over
> time?

That's a very good question.  I suspect it's inevitable that those type 
of changes could flow through from FW updates, and we'll need to come up 
with a plan for how to deal with them.  In general, it seems like the 
same kind of problem as with maintaining UAPI compatibility, but I'm not 
sure if there's any additional considerations for virt.

Ben.

>
> - Danilo
>
>> A link[2] to a tree containing the patches is below.
>>
>> [1] https://lore.kernel.org/all/3ed356488c9b0ca93845501425d427309f4cf616.camel@redhat.com/
>> [2] https://gitlab.freedesktop.org/bskeggs/nouveau/-/tree/00.03-module
>>
>> *** BLURB HERE ***
>>
>> Ben Skeggs (2):
>>    drm/nouveau/nvkm: export symbols needed by the drm driver
>>    drm/nouveau/nvkm: separate out into nvkm.ko
>>
>>   drivers/gpu/drm/nouveau/Kbuild                      |  4 ++--
>>   drivers/gpu/drm/nouveau/include/nvkm/core/module.h  |  3 ---
>>   drivers/gpu/drm/nouveau/nouveau_drm.c               | 10 +---------
>>   drivers/gpu/drm/nouveau/nvkm/core/driver.c          |  1 +
>>   drivers/gpu/drm/nouveau/nvkm/core/gpuobj.c          |  2 ++
>>   drivers/gpu/drm/nouveau/nvkm/core/mm.c              |  4 ++++
>>   drivers/gpu/drm/nouveau/nvkm/device/acpi.c          |  1 +
>>   drivers/gpu/drm/nouveau/nvkm/engine/gr/base.c       |  1 +
>>   drivers/gpu/drm/nouveau/nvkm/module.c               |  8 ++++++--
>>   drivers/gpu/drm/nouveau/nvkm/subdev/bios/init.c     |  1 +
>>   drivers/gpu/drm/nouveau/nvkm/subdev/bios/pll.c      |  1 +
>>   drivers/gpu/drm/nouveau/nvkm/subdev/fb/base.c       |  3 +++
>>   drivers/gpu/drm/nouveau/nvkm/subdev/gpio/base.c     |  3 +++
>>   drivers/gpu/drm/nouveau/nvkm/subdev/i2c/base.c      |  2 ++
>>   drivers/gpu/drm/nouveau/nvkm/subdev/i2c/bus.c       |  1 +
>>   drivers/gpu/drm/nouveau/nvkm/subdev/iccsense/base.c |  1 +
>>   drivers/gpu/drm/nouveau/nvkm/subdev/therm/base.c    |  1 +
>>   drivers/gpu/drm/nouveau/nvkm/subdev/therm/fan.c     |  1 +
>>   drivers/gpu/drm/nouveau/nvkm/subdev/volt/base.c     |  1 +
>>   19 files changed, 33 insertions(+), 16 deletions(-)
>>
>> -- 
>> 2.44.0
>>


More information about the dri-devel mailing list