[virglrenderer-devel] Fwd: coherent memory access for virgl

Fri Oct 12 03:52:42 UTC 2018

Ops, sent with the wrong email account.

---------- Forwarded message ---------
From: Gurchetan Singh <gurchetansingh at google.com>
Date: Thu, Oct 11, 2018 at 8:50 PM
Subject: Re: [virglrenderer-devel] coherent memory access for virgl
To: Gerd Hoffmann <kraxel at redhat.com>
Cc: <virglrenderer-devel at lists.freedesktop.org>, Tomeu Vizoso
<tomeu.vizoso at collabora.com>, <pbonzini at redhat.com>, Dave Airlie
<airlied at gmail.com>

On Thu, Oct 11, 2018 at 3:38 AM Gerd Hoffmann <kraxel at redhat.com> wrote:
>
>   Hi,
>
> > > In case of coherent buffers the bo will be mapped directly and the unmap
> > > call is not needed to make the changes visible to the gpu.
> > >
> > > Correct?
> >
> > Yes, though gbm doesn't have a way to signal coherency.  Maybe we
> > could add gbm_bo_is_coherent(bo) on the host.
>
> Hmm, how does mesa create coherent buffers if that isn't exposed by
> libgbm?

Usually, drivers have some coherent pool of memory to allocate from.
For example, most of our Intel devices have a HW mechanism (last level
cache -- see I915_PARAM_HAS_LLC) that ensures GPU/CPU coherency:

https://cgit.freedesktop.org/mesa/mesa/tree/src/mesa/drivers/dri/i965/brw_bufmgr.c#n640

gbm doesn't expose this at the API level, but any driver that exposes
the Vulkan / GL coherent bits should have a mechanism for this.  If
the virtio-gpu protocol has this type of explicitness (like Vulkan),
there's definitely room for optimization...

> > > So, yes, we could create a gbm_bo_map/unmap like interface at virtio
> > > protocol level, so the guest would ...
> > >
> > >    (1) MAP
> > >    (2) write to mapping
> > >    (3) UNMAP
> >
> > Yes.
> >
> > > .. instead of ...
> > >
> > >    (1) ATTACH_BACKING
> > >    (2) TRANSFER_TO_HOST
> > >    (3) DETACH_BACKING
> > >
> > > I think the guest doesn't need to know which modifiers are used on the
> > > host side then, because the host-side gbm_bo_map/gbm_bo_unmap calls will
> > > tile/detile/compress/uncompress so it'll be transparent to the guest.
> >
> > Depends on what guest userspace does -- if
> > gbm_bo_create_with_modifiers is called and the wayland guest proxy
> > needs modifiers, then we'll need to know modifiers.
>
> The guest can create a linear resource, and it will if the virtio-gpu
> ksm driver doesn't advertise modifiers (beside LINEAR), even if the
> guest calls gbm_bo_create_with_modifiers(), right?

Yes.

>
>
> That doesn't prevent the host from using modifiers for the bo's
> nevertheless, correct?  That will of course need support for modifiers
> in qemu, so a scanout resource with modifiers will be displayed
> correctly.
>
> But I still don't see why the guest needs to know the modifiers.

It depends on where you want to inject KMS knowledge.  Your approach
for allocating host-optimized external memory would be:

1) gbm_bo_create() [using flags] on the guest
2) gbm_bo_create_with_modifiers() on the host.  We can convert
GBM_BO_USE_RENDERING to the list of modifiers
    supported by EGL, and GBM_BO_USE_SCANOUT to the list of modifiers
supported by KMS.

We'll have to do something like this for Android since userspace has
no concept of modifiers.  That means host wayland will have to talk to
virglrenderer.  I'm fine with both approaches, as long as we allocate
the optimal buffer for a given scenario.

>
> > > Alternatively we could map the tiled/compressed bo as-is into the guest,
> > > then have gbm_bo_map/gbm_bo_unmap calls in the guest handle the
> > > tile/detile/compress/uncompress.  Is it possible in the first place to
> > > map the raw bo on all hardware?  Would that allow to skip the roundtrip
> > > to the host for map/unmap?
> >
> > The gbm backend on the guest on is virgl.
> >
> > https://cgit.freedesktop.org/mesa/mesa/tree/src/gbm/backends/dri/gbm_dri.c#n1268
> > https://cgit.freedesktop.org/mesa/mesa/tree/include/GL/internal/dri_interface.h#n1588
> > https://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/state_trackers/dri/dri2.c#n1651
> >
> > On the host, it'll be the host DRI backend.  Unless we can somehow get
> > the host logic in the guest, we can't avoid the roundtrip for
> > non-coherent buffers.
>
> Documentation on modifiers isn't that good, the google finds me some
> articles explaining the need for them but not the inner workings.
>
> So, I've waded through the intel drivers's source code.  Seems Intel
> hardware handles this via GTT (as I understand it GGTs are gpu page
> tables, which appearently can't only map objects but also do various
> conversions like tiling).
>
> That is not something we can let the guest handle.  So,
> tile/detile/compress/uncompress will be handled by map/unmap on the
> host, anything else isn't going to fly.
>
> > > > We can add even more flags to DRM_VIRTGPU_RESOURCE_INFO (i.e,
> > > > TRANSFER_STRIDE_DIFFERENT) since for most host buffers map_stride ==
> > > > compressed_stride, so we can avoid vm-exits associated with
> > > > TRANSFER_FROM_HOST / TRANSFER_TO_HOST when only need to mmap().  Or we
> > > > can extend the protocol.
> > >
> > > Hmm, that assumes we have the guest's gbm_bo_map/gbm_bo_unmap handle
> > > tile/detile/compress/uncompress, correct?
> >
> > We will.  The flow is:
> >
> > 1) Guest queries guest EGL, gets modifier.  wayland guest proxy
> > somehow gets modifier from host KMS.
> > 2) gbm_bo_create_with_modifiers -- we'll be using the virgl DRI
> > interface, which will call essentially VIRTGPU_RESOURCE_CREATE2 (which
> > passes down modifiers).
> > 3) gbm_bo_create_with_modifiers on the host.
>
> Hmm, that also doesn't answer the question why the guest needs to know
> the modifiers.  We can let the host query the supported modifiers and
> call gbm_bo_create_with_modifiers() without the guest knowing the list
> of supported modifiers.
>
> Of course we need to keep track on the host side which resource uses
> which modifier, ...
>
> > 4) guest gbm buffer is imported to guest 3D driver, which is backed by
> > a virglrenderer resource (which has been imported in host GL)
> > 5) Texturing/Rendering ensues.
> > 6) If the guest needs to read the contents of the buffer, we can do a
> > gbm_bo_map on the host and put this into the PCI bar.
> > 7) wayland guest proxy sends buffer to display, host proxy actually displays
>
> ... so the host proxy can lookup the modifier used when passing on the
> buffer to the host compositor.
>
> Also on, on (6):  I'm not convinced yet that letting the guest access
> the gbm_bo_map() mapping directly via pci bar is actually a win.

The main added use case for the PCI bar (besides coherent buffers) is
actually 1D GL buffers.

We'll have usage hints (PIPE_USAGE_STREAM / PIPE_USAGE_STAGING /
PIPE_USAGE_DYNAMIC / PIPE_USAGE_DEFAULT -- unfortunately guest Mesa
doesn't pass them down yet). For GL_STATIC_DRAW, it would be a
definite win.

Host RGBA external textures / render targets -- the secondary possible
use case of the PCI -- are actually rarely mapped by guest user-space.
Android, for example, *very* rarely maps any buffers created with
usage bits
AHARDWAREBUFFER_USAGE_GPU_COLOR_OUTPUT |
AHARDWAREBUFFER_USAGE_GPU_SAMPLED_IMAGE (similar case with Chrome
ozone-drm, and I'm pretty sure vanilla Linux).

Android user-space does often map external YUV images when doing
SW-video decoding, but it specifies the
AHARDWAREBUFFER_USAGE_CPU_*_OFTEN bits.  For those cases, the best
option is guest memory or coherent memory.  But for HW-video decode
(libva on Intel decodes to Y-tiled for example) using host optimized
would be better.

It depends on the situation ...

> Reason is that we have quite some overhead to establish the mapping:
>
>   (1) gbm_bo_map() on the host
>   (2) qemu updating the guest address space, host kvm updating ept tables.
>   (3) guest kernel mapping it into guest userspace.

What's the QEMU function that puts host memory into the guest PCI
configuration address space?  Is it just KVM_SET_USER_MEMORY_REGION?

>
>
> The same dance in reverse order when tearing down the mapping.  And this
> is not persistent, we'll have to do that every single time the guest
> wants cpu access to the resource.
>
> Except for coherent buffers of course, where we can establish this
> mapping once, then run with it as long as the resource exists.
>
>
> cheers,
>   Gerd
>