[RFC 0/7] drm/virtio: Import scanout buffers from other devices
Kasireddy, Vivek
vivek.kasireddy at intel.com
Tue Jun 18 07:49:18 UTC 2024
Hi Gurchetan,
>
> On Thu, May 30, 2024 at 12:21 AM Kasireddy, Vivek
> <vivek.kasireddy at intel.com <mailto:vivek.kasireddy at intel.com> > wrote:
>
>
> Hi Gurchetan,
>
> >
> > On Fri, May 24, 2024 at 11:33 AM Kasireddy, Vivek
> > <vivek.kasireddy at intel.com <mailto:vivek.kasireddy at intel.com>
> <mailto:vivek.kasireddy at intel.com <mailto:vivek.kasireddy at intel.com> > >
> wrote:
> >
> >
> > Hi,
> >
> > Sorry, my previous reply got messed up as a result of HTML
> > formatting. This is
> > a plain text version of the same reply.
> >
> > >
> > >
> > > Having virtio-gpu import scanout buffers (via prime) from
> other
> > > devices means that we'd be adding a head to headless
> GPUs
> > assigned
> > > to a Guest VM or additional heads to regular GPU devices
> that
> > are
> > > passthrough'd to the Guest. In these cases, the Guest
> > compositor
> > > can render into the scanout buffer using a primary GPU
> and has
> > the
> > > secondary GPU (virtio-gpu) import it for display purposes.
> > >
> > > The main advantage with this is that the imported scanout
> > buffer can
> > > either be displayed locally on the Host (e.g, using Qemu +
> GTK
> > UI)
> > > or encoded and streamed to a remote client (e.g, Qemu +
> Spice
> > UI).
> > > Note that since Qemu uses udmabuf driver, there would
> be no
> > > copies
> > > made of the scanout buffer as it is displayed. This should
> be
> > > possible even when it might reside in device memory such
> has
> > > VRAM.
> > >
> > > The specific use-case that can be supported with this series
> is
> > when
> > > running Weston or other guest compositors with
> "additional-
> > devices"
> > > feature (./weston --drm-device=card1 --additional-
> > devices=card0).
> > > More info about this feature can be found at:
> > > https://gitlab.freedesktop.org/wayland/weston/-
> > > /merge_requests/736
> > >
> > > In the above scenario, card1 could be a dGPU or an iGPU
> and
> > card0
> > > would be virtio-gpu in KMS only mode. However, the case
> > where this
> > > patch series could be particularly useful is when card1 is a
> GPU
> > VF
> > > that needs to share its scanout buffer (in a zero-copy way)
> with
> > the
> > > GPU PF on the Host. Or, it can also be useful when the
> scanout
> > buffer
> > > needs to be shared between any two GPU devices
> (assuming
> > one of
> > > them
> > > is assigned to a Guest VM) as long as they are P2P DMA
> > compatible.
> > >
> > >
> > >
> > > Is passthrough iGPU-only or passthrough dGPU-only
> something you
> > intend to
> > > use?
> > Our main use-case involves passthrough’g a headless dGPU VF
> device
> > and sharing
> > the Guest compositor’s scanout buffer with dGPU PF device on
> the
> > Host. Same goal for
> > headless iGPU VF to iGPU PF device as well.
> >
> >
> >
> > Just to check my understanding: the same physical {i, d}GPU is
> partitioned
> > into the VF and PF, but the PF handles host-side display integration
> and
> > rendering?
> Yes, that is mostly right. In a nutshell, the same physical GPU is
> partitioned
> into one PF device and multiple VF devices. Only the PF device has
> access to
> the display hardware and can do KMS (on the Host). The VF devices
> are
> headless with no access to display hardware (cannot do KMS but can
> do render/
> encode/decode) and are generally assigned (or passthrough'd) to the
> Guest VMs.
> Some more details about this model can be found here:
> https://lore.kernel.org/dri-devel/20231110182231.1730-1-
> michal.wajdeczko at intel.com/
>
> >
> >
> > However, using a combination of iGPU and dGPU where either
> of
> > them can be passthrough’d
> > to the Guest is something I think can be supported with this
> patch
> > series as well.
> >
> > >
> > > If it's a dGPU + iGPU setup, then the way other people seem to
> do it
> > is a
> > > "virtualized" iGPU (via virgl/gfxstream/take your pick) and
> pass-
> > through the
> > > dGPU.
> > >
> > > For example, AMD seems to use virgl to allocate and import
> into
> > the dGPU.
> > >
> > > https://gitlab.freedesktop.org/mesa/mesa/-
> > /merge_requests/23896
> > >
> > > https://lore.kernel.org/all/20231221100016.4022353-1-
> > > julia.zhang at amd.com/ <http://julia.zhang@amd.com/>
> <http://julia.zhang@amd.com/>
> > >
> > >
> > > ChromeOS also uses that method (see crrev.com/c/3764931
> <http://crrev.com/c/3764931>
> > <http://crrev.com/c/3764931>
> > > <http://crrev.com/c/3764931> ) [cc: dGPU architect +Dominik
> Behr
> >> <mailto:dbehr at google.com <mailto:dbehr at google.com>
> <mailto:dbehr at google.com <mailto:dbehr at google.com> > > ]
> > >
> > > So if iGPU + dGPU is the primary use case, you should be able
> to
> > use these
> > > methods as well. The model would "virtualized iGPU" +
> > passthrough dGPU,
> > > not split SoCs.
> > In our use-case, the goal is to have only one primary GPU
> > (passthrough’d iGPU/dGPU)
> > do all the rendering (using native DRI drivers) for
> clients/compositor
> > and all the outputs
> > and share the scanout buffers with the secondary GPU (virtio-
> gpu).
> > Since this is mostly
> > how Mutter (and also Weston) work in a multi-GPU setup, I am
> not
> > sure if virgl is needed.
> >
> >
> >
> > I think you can probably use virgl with the PF and others probably
> will, but
> > supporting multiple methods in Linux is not unheard of.
> In our case, we have an alternative SR-IOV based GPU
> virtualization/partitioning
> model (as described above) where a Guest VM will have access to a
> hardware-accelerated
> GPU VF device for its rendering/encode/decode needs. So, in this
> situation, using
> virgl will become redundant and unnecessary.
>
> And, in this model, we intend to use virtio-gpu for KMS in the Guest
> VM (since the
> GPU VF device cannot do KMS) with the addition of this patchset.
> However, note that,
> since not all GPU SKUs/versions have the SRIOV capability, we plan
> on using virgl in
> those cases where it becomes necessary.
>
> >
> > Does your patchset need the Mesa kmsro patchset to function
> correctly?
> >
> > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9592
> >This patchset is an alternative proposal. So, KMSRO would not be
> >needed.
> >AFAICS, the above MR is mainly stalled because KMSRO uses dumb
> >buffers
> >which are not suitable for hardware-based rendering in all cases.
> >And, KMSRO
> >is not really helpful performance-wise with dGPUs, as it forces most
> >buffers to
> >be allocated from system memory.
>
> Previously, it was recommended when exploring VDMABUF: "the
> /important/ thing is that the driver which exports the dma-buf (and thus
> handles the mappings) must be aware of the virtualization so it can properly
> coordinate things with the host side."
>
> https://patchwork.kernel.org/project/linux-media/patch/20210203073517.1908882-3-
> vivek.kasireddy at intel.com/#23975915
Yeah, I remember that. It would be nice to have Gerd weigh in again because
this design is slightly different as it uses DMA addresses (and not pages) to make
it possible to share buffers allocated by Guest in both system memory and VRAM.
And, as mentioned earlier, with Qemu GTK UI (since it uses EGL), we Blit the
dmabuf to a Host allocated buffer managed by EGL.
But I do recognize that if we were to share the dmabuf directly with the Host
compositor (like you do with CrosVM + Wayland UI), then yes, the Guest
compositor (or probably the exporter) needs to be aware of this virtualization.
>
> So that's why the KMSRO approach was tried (virtio-gpu dumb allocations,
> not i915). But as you point out, nobody uses dumb buffers for hardware-
> based rendering.
Right, we have been using KMSRO + dumb buffers all this while but as we start
working more with dGPUs we realized that they are not great performance-wise.
So, one of my goals with this series (and also including the VFIO patch series) is to
ensure that (with one dGPU VF assigned to the Guest) the Guest allocated framebuffer
stays in VRAM (and does not migrated to Guest's system memory) as it is shared with
the dGPU PF on the Host.
>
> So, if you are going with i915 allocates + virtio-gpu imports, it should be fine
> if you fixed all the issues with i915 allocates + VDMABUF imports. It seems
> your fixes add complexity in VFIO and other places,
For iGPUs, only this series is needed; and, the VFIO patch series is needed if
we are doing dGPU to dGPU buffer sharing (P2P DMA). And, it appears, the
VFIO patches are anyway needed for RDMA/SPDK use-cases as well.
> but having virtio-gpu 3d +
> virgl allocate adds complexity to Mesa-based allocation paths (i915, amdgpu
> would all have to open virtio-gpu render node, and pick a context type etc.).
>
> I would just do virtio-gpu allocates, since it only requires user-space patches
> and no extra ioctls, but that reflects my preferences.
This patch series is also not adding any extra ioctls; and no uapi changes are needed.
The only big change that needs to happen is on the Guest compositor side but both
Mutter and Weston already have the concept of primary and secondary GPUs when
dealing with multi-GPU scenarios, so they are mostly ok.
> If the mm/VFIO/QEMU
> people are fine with your approach, I see nothing wrong with merging it.
Sima has said they are OK with the design, but they specifically asked people working
on virtio-gpu to weigh in.
>
> The one caveat is if someone uses a non-GTK/EGL host path, we'll have to pin
> memory for the lifetime of the import, since knowing RESOURCE_FLUSH is
Yeah, as discussed earlier, non-EGL host paths would probably need additional
signaling mechanism to achieve full frame-rate, when sharing the dmabuf directly
with the Host compositor.
> done is not sufficient. But if you're only using it, it shouldn't be an issue right
> now.
To ensure that I am not regressing virgl or other use-cases, I am currently limiting
this feature to only cases where blob=true and virgl=false for now. I'll send a v2 soon
with some fixes related to cleanup of imported BOs.
Thanks,
Vivek
>
>
>
> >
> >
> > If so, I would try to get that reviewed first to meet DRM
> requirements
> > (https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-
> source-
> > userspace-requirements). You might explicitly call out the design
> decision
> > you're making: ("We can probably use virgl as the virtualized iGPU
> via PF, but
> > that adds unnecessary complexity b/c ______").
> As I described above, what we have is an alternative GPU
> virtualization scheme
> where virgl is not necessary if SRIOV capability is available. And, as
> mentioned
> earlier, I have tested this series with Mutter/Gnome-shell (upstream
> master)
> (plus one small patch: https://gitlab.gnome.org/GNOME/mutter/-
> /merge_requests/3745)
> and no other changes to any other userspace components on Host
> and Guest.
>
> >
> >
> > And, doing it this way means that no other userspace
> components
> > need to be modified
> > on both the Guest and the Host.
> >
> > >
> > >
> > >
> > > As part of the import, the virtio-gpu driver shares the dma
> > > addresses and lengths with Qemu which then determines
> > whether
> > > the
> > > memory region they belong to is owned by a PCI device or
> > whether it
> > > is part of the Guest's system ram. If it is the former, it
> identifies
> > > the devid (or bdf) and bar and provides this info (along
> with
> > offsets
> > > and sizes) to the udmabuf driver. In the latter case, instead
> of
> > the
> > > the devid and bar it provides the memfd. The udmabuf
> driver
> > then
> > > creates a dmabuf using this info that Qemu shares with
> Spice
> > for
> > > encode via Gstreamer.
> > >
> > > Note that the virtio-gpu driver registers a move_notify()
> callback
> > > to track location changes associated with the scanout
> buffer and
> > > sends attach/detach backing cmds to Qemu when
> appropriate.
> > And,
> > > synchronization (that is, ensuring that Guest and Host are
> not
> > > using the scanout buffer at the same time) is ensured by
> > pinning/
> > > unpinning the dmabuf as part of plane update and using a
> fence
> > > in resource_flush cmd.
> > >
> > >
> > > I'm not sure how QEMU's display paths work, but with crosvm
> if
> > you share
> > > the guest-created dmabuf with the display, and the guest
> moves
> > the backing
> > > pages, the only recourse is the destroy the surface and show a
> > black screen
> > > to the user: not the best thing experience wise.
> > Since Qemu GTK UI uses EGL, there is a blit done from the
> guest’s
> > scanout buffer onto an EGL
> > backed buffer on the Host. So, this problem would not happen
> as of
> > now.
> >
> >
> >
> > The guest kernel doesn't know you're using the QEMU GTK UI + EGL
> host-
> > side.
> So, with blob=true, there is a dma fence in resource_flush() that gets
> associated
> with the Blit/Encode on the Host. This guest dma fence should
> eventually be signalled
> only when the Host is done using guest's scanout buffer.
>
> >
> > If somebody wants to use the virtio-gpu import mechanism with
> lower-level
> > Wayland-based display integration, then the problem would occur.
> Right, one way to address this issue is to prevent the Guest
> compositor from
> reusing the scanout buffer (until the Host is done) and forcing it to
> pick a new
> buffer (since Mesa GBM allows 4 backbuffers).
> I have tried this experiment with KMSRO and Wayland-based Qemu
> UI previously
> on iGPUs (and Weston) and noticed that the Guest FPS was getting
> halved:
> https://lore.kernel.org/qemu-devel/20210913222036.3193732-1-
> vivek.kasireddy at intel.com/
>
> and also discussed and proposed a solution which did not go
> anywhere:
> https://lore.kernel.org/dri-devel/20210913233529.3194401-1-
> vivek.kasireddy at intel.com/
>
> >
> > Perhaps, do that just to be safe unless you have performance
> concerns.
> If you meant pinning the imported scanout buffer in the Guest, then
> yes,
> that is something I am already doing in this patchset.
>
> >
> >
> > >
> > > Only amdgpu calls dma_buf_move_notfiy(..), and you're
> probably
> > testing on
> > > Intel only, so you may not be hitting that code path anyways.
> > I have tested with the Xe driver in the Guest which also calls
> > dma_buf_move_notfiy(). But
> > note that for dGPUs, both Xe and amdgpu migrate the scanout
> buffer
> > from vram to system
> > memory as part of export, because virtio-gpu is not P2P
> compatible.
> > However, I am hoping
> > to relax this (p2p check against virtio-gpu) in Xe driver if it
> detects
> > that it is running in
> > VF mode once the following patch series is merged:
> > https://lore.kernel.org/dri-devel/20240422063602.3690124-1-
> > vivek.kasireddy at intel.com/ <http://vivek.kasireddy@intel.com/>
> >
> > > I forgot the
> > > exact reason, but apparently udmabuf may not work with
> amdgpu
> > displays
> > > and it seems the virtualized iGPU + dGPU is the way to go for
> > amdgpu
> > > anyways.
> > I would really like to know why udmabuf would not work with
> > amdgpu?
> >
> >
> >
> > It's just a rumor I heard, but the idea is udmabuf would be
> imported into
> > AMDGPU_GEM_DOMAIN_CPU only.
> >
> > https://cgit.freedesktop.org/drm/drm-
> >
> misc/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c#n333
> >
> > "AMDGPU_GEM_DOMAIN_CPU: System memory that is not GPU
> accessible.
> > Memory in this pool could be swapped out to disk if there is
> pressure."
> >
> > https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html
> >
> >
> > Perhaps that limitation is artificial and unnecessary, and it may
> indeed work.
> > I don't think anybody has tried...
> Since udmabuf driver properly pins the backing pages (from memfd)
> for DMA,
> I don't see any reason why amdgpu would not be able to import.
>
> Thanks,
> Vivek
>
> >
> >
> >
> >
> > > So I recommend just pinning the buffer for the lifetime of the
> > > import for simplicity and correctness.
> > Yeah, in this patch series, the dmabuf is indeed pinned, but only
> for a
> > short duration in the Guest –
> > just until the Host is done using it (blit or encode).
> >
> > Thanks,
> > Vivek
> >
> > >
> > >
> > > This series is available at:
> > > https://gitlab.freedesktop.org/Vivek/drm-tip/-
> > > /commits/virtgpu_import_rfc
> > >
> > > along with additional patches for Qemu and Spice here:
> > > https://gitlab.freedesktop.org/Vivek/qemu/-
> > > /commits/virtgpu_dmabuf_pcidev
> > > https://gitlab.freedesktop.org/Vivek/spice/-
> > > /commits/encode_dmabuf_v4
> > >
> > > Patchset overview:
> > >
> > > Patch 1: Implement
> > > VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
> > > Patch 2-3: Helpers to initalize, import, free imported object
> > > Patch 4-5: Import and use buffers from other devices for
> > scanout
> > > Patch 6-7: Have udmabuf driver create dmabuf from PCI
> bars
> > for P2P
> > > DMA
> > >
> > > This series is tested using the following method:
> > > - Run Qemu with the following relevant options:
> > > qemu-system-x86_64 -m 4096m ....
> > > -device vfio-pci,host=0000:03:00.0
> > > -device virtio-
> > vga,max_outputs=1,blob=true,xres=1920,yres=1080
> > > -spice port=3001,gl=on,disable-ticketing=on,preferred-
> > > codec=gstreamer:h264
> > > -object memory-backend-memfd,id=mem1,size=4096M
> > > -machine memory-backend=mem1 ...
> > > - Run upstream Weston with the following options in the
> Guest
> > VM:
> > > ./weston --drm-device=card1 --additional-devices=card0
> > >
> > > where card1 is a DG2 dGPU (passthrough'd and using xe
> driver
> > in
> > > Guest VM),
> > > card0 is virtio-gpu and the Host is using a RPL iGPU.
> > >
> > > Cc: Gerd Hoffmann <kraxel at redhat.com
> <mailto:kraxel at redhat.com>
> > <mailto:kraxel at redhat.com <mailto:kraxel at redhat.com> >
> > > <mailto:kraxel at redhat.com <mailto:kraxel at redhat.com>
> <mailto:kraxel at redhat.com <mailto:kraxel at redhat.com> > > >
> > > Cc: Dongwon Kim <dongwon.kim at intel.com
> <mailto:dongwon.kim at intel.com>
> > <mailto:dongwon.kim at intel.com
> <mailto:dongwon.kim at intel.com> >
> > > <mailto:dongwon.kim at intel.com
> <mailto:dongwon.kim at intel.com>
> > <mailto:dongwon.kim at intel.com
> <mailto:dongwon.kim at intel.com> > > >
> > > Cc: Daniel Vetter <daniel.vetter at ffwll.ch
> <mailto:daniel.vetter at ffwll.ch>
> > <mailto:daniel.vetter at ffwll.ch <mailto:daniel.vetter at ffwll.ch> >
> > > <mailto:daniel.vetter at ffwll.ch
> <mailto:daniel.vetter at ffwll.ch> <mailto:daniel.vetter at ffwll.ch
> <mailto:daniel.vetter at ffwll.ch> > > >
> > > Cc: Christian Koenig <christian.koenig at amd.com
> <mailto:christian.koenig at amd.com>
> > <mailto:christian.koenig at amd.com
> <mailto:christian.koenig at amd.com> >
> > > <mailto:christian.koenig at amd.com
> <mailto:christian.koenig at amd.com>
> > <mailto:christian.koenig at amd.com
> <mailto:christian.koenig at amd.com> > > >
> > > Cc: Dmitry Osipenko <dmitry.osipenko at collabora.com
> <mailto:dmitry.osipenko at collabora.com>
> > <mailto:dmitry.osipenko at collabora.com
> <mailto:dmitry.osipenko at collabora.com> >
> > > <mailto:dmitry.osipenko at collabora.com
> <mailto:dmitry.osipenko at collabora.com>
> > <mailto:dmitry.osipenko at collabora.com
> <mailto:dmitry.osipenko at collabora.com> > > >
> > > Cc: Rob Clark <robdclark at chromium.org
> <mailto:robdclark at chromium.org>
> > <mailto:robdclark at chromium.org
> <mailto:robdclark at chromium.org> >
> > > <mailto:robdclark at chromium.org
> <mailto:robdclark at chromium.org>
> > <mailto:robdclark at chromium.org
> <mailto:robdclark at chromium.org> > > >
> > > Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com
> <mailto:thomas.hellstrom at linux.intel.com>
> > <mailto:thomas.hellstrom at linux.intel.com
> <mailto:thomas.hellstrom at linux.intel.com> >
> > > <mailto:thomas.hellstrom at linux.intel.com
> <mailto:thomas.hellstrom at linux.intel.com>
> > <mailto:thomas.hellstrom at linux.intel.com
> <mailto:thomas.hellstrom at linux.intel.com> > > >
> > > Cc: Oded Gabbay <ogabbay at kernel.org
> <mailto:ogabbay at kernel.org>
> > <mailto:ogabbay at kernel.org <mailto:ogabbay at kernel.org> >
> > > <mailto:ogabbay at kernel.org <mailto:ogabbay at kernel.org>
> <mailto:ogabbay at kernel.org <mailto:ogabbay at kernel.org> > > >
> > > Cc: Michal Wajdeczko <michal.wajdeczko at intel.com
> <mailto:michal.wajdeczko at intel.com>
> > <mailto:michal.wajdeczko at intel.com
> <mailto:michal.wajdeczko at intel.com> >
> > > <mailto:michal.wajdeczko at intel.com
> <mailto:michal.wajdeczko at intel.com>
> > <mailto:michal.wajdeczko at intel.com
> <mailto:michal.wajdeczko at intel.com> > > >
> > > Cc: Michael Tretter <m.tretter at pengutronix.de
> <mailto:m.tretter at pengutronix.de>
> > <mailto:m.tretter at pengutronix.de
> <mailto:m.tretter at pengutronix.de> >
> > > <mailto:m.tretter at pengutronix.de
> <mailto:m.tretter at pengutronix.de>
> > <mailto:m.tretter at pengutronix.de
> <mailto:m.tretter at pengutronix.de> > > >
> > >
> > > Vivek Kasireddy (7):
> > > drm/virtio: Implement
> > > VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
> > > drm/virtio: Add a helper to map and note the dma addrs
> and
> > > lengths
> > > drm/virtio: Add helpers to initialize and free the imported
> > object
> > > drm/virtio: Import prime buffers from other devices as
> guest
> > blobs
> > > drm/virtio: Ensure that bo's backing store is valid while
> > updating
> > > plane
> > > udmabuf/uapi: Add new ioctl to create a dmabuf from PCI
> bar
> > > regions
> > > udmabuf: Implement
> UDMABUF_CREATE_LIST_FOR_PCIDEV
> > ioctl
> > >
> > > drivers/dma-buf/udmabuf.c | 122
> ++++++++++++++++--
> > > drivers/gpu/drm/virtio/virtgpu_drv.h | 8 ++
> > > drivers/gpu/drm/virtio/virtgpu_plane.c | 56 ++++++++-
> > > drivers/gpu/drm/virtio/virtgpu_prime.c | 167
> > > ++++++++++++++++++++++++-
> > > drivers/gpu/drm/virtio/virtgpu_vq.c | 15 +++
> > > include/uapi/linux/udmabuf.h | 11 +-
> > > 6 files changed, 368 insertions(+), 11 deletions(-)
> > >
> > > --
> > > 2.43.0
> > >
> > >
> >
> >
>
>
More information about the dri-devel
mailing list