[virglrenderer-devel] coherent memory access for virgl

Wed Sep 26 01:28:55 UTC 2018

On Tue, Sep 25, 2018 at 2:10 AM Gerd Hoffmann <kraxel at redhat.com> wrote:
>
>   Hi,
>
> > > Who will do the actual allocations?  I expect we need new virglrenderer
> > > functions for that?
> >
> > The decision to back memory via iovecs or host memory is up to the
> > VMM.
>
> What exactly do you mean with "via iovecs"?  The current way to allocate
> resources?  They are guest-allocated and the iovecs passed to
> virglrenderer point into guest memory.  So that clearly is *not* in the
> hands of the VMM.  Or do you mean something else?

I guess I'm just brainstorming about one-copy with virgl and how we
want to implement it.

The simplest case that can be improved with host memory is:

(1) Guest app maps the buffer (glMapBufferRange --> virgl_buffer_transfer_map).
      (i)  If the buffer is not marked clean (texture buffers, SSBOs)
this will trigger a TRANSFER_FROM_HOST_3D (copies++)
(2) The guest app copies the buffer data (unavoidable - copies++)
(3) Guest unmaps the buffer, triggering a TRANSFER_TO_HOST (copies++).

For host GL buffers, the copies are done in
{vrend_renderer_transfer_write_iov, vrend_renderer_transfer_send_iov}.
If there are N iovecs backing the guest resource, we will have N
copies (see vrend_write_to_iovec, vrend_read_from_iovec).

udmabuf could be helpful, since it bundles up the iovecs and it will
make the N small copies into one big copy.  udmabuf could also
eliminate some copies for textures completely.  Right now, for most
textures, virglrenderer copies iovecs into a temporary buffer (see
read_transfer_data), and then calls glTexSubImage2D*.  Just mmaping
the udmabuf and calling glTexSubImage2D* is definite win.

But making host memory guest visible will bring the worst-case buffer
copies from 3 to 1.  For textures, if we start counting when the GPU
buffer gets detiled, there will be 5 copies currently, 3 with udmabuf,
and 1 with host exposed memory.

>
>
> To make sure we all are on the same page wrt. resource allocation, the
> workflow we have now looks like this:
>
>   (1) guest virtio-gpu driver allocates resource.  Uses normal (guest) ram.
>       Resources can be scattered.
>   (2) guest driver creates resources (RESOURCE_CREATE_*).
>   (3) qemu (virgl=off) or virglrenderer (virgl=on) creates host resource.
>       virglrenderer might use a different format (tiling, ...).
>   (4) guest sets up backing storage (RESOURCE_ATTACH_BACKING).
>   (5) qemu creates a iovec for the guest resource.
>   (6) guest writes data to resource.
>   (7) guest requests a transfer (TRANSFER_TO_HOST_*).
>   (8) qemu or virglrenderer copy data from guest resource to
>       host resource, possibly converting (again tiling, ...).
>   (9) guest can use the resource now ...
>
>
> One thing I'm prototyping right now is zerocopy resources, the workflow
> changes to look like this:
>
>   (2) guest additionally sets a flag to request a zerocopy buffer.
>   (3) not needed (well, the bookkeeping part of it is still needed, but
>       it would *not* allocate a host resource).
>   (5) qemu additionally creates a host dma-buf for the guest resource
>       using the udmabuf driver.
>   (7+8) not needed.
>
> Right now I have (not tested yet) code to handle dumb buffers.
> Interfacing to guest userspace (virtio-gpu driver ioctls) is not
> there yet.  Interfacing with virglrenderer isn't there yet either.
>
> I expect that doesn't solve the coherent mapping issue.  The host gpu
> could import the dma-buf of the resource, but as it has no control over
> the allocation it might not be able to use it without copying.
>
>
> I'm not sure how the API for coherent resources should look like.
> One option I see is yet another resource flag, so the workflow would
> change like this (with virgl=on only ...):
>
>   (2) guest additionally sets a flag to request a coherent resource.
>   (3) virglrenderer would create a coherent host resource.
>   (4) guest finds some address space in the (new) pci bar and asks
>       for the resource being mapped there (new command needed for
>       this).
>   (5) qemu maps the coherent resource into the pci bar.
>   (7+8) not needed.
>
> Probably works for GL_MAP_COHERENT_BIT use case.  Dunno about vulkan.
>
> Interfaces to guest userspace and virglrenderer likewise need updates
> to support this.
>
>
> > A related question: are we going to also expose host memory to the
> > guest for the non-{GL_MAP_COHERENT_BIT,
> > VK_MEMORY_PROPERTY_HOST_COHERENT_BIT} cases?
>
> The guest should be able to do that, yes.  In case both coherent and
> zerocopy resources are supported by the host it can even pick.
>
> coherent resources will be limited though (pci bar size, also because
> we don't want allow guests allocate unlimited host memory for security
> reasons), so using them for everything is probably not a good idea.

The 64-bit BAR should be enough, especially if it's managed
intelligently.  Vulkan may take some time and I don't think stacks
with host GLES drivers support GL_MAP_COHERENT_BIT, so there will be
cases when that space goes unused.

Here's one possible flow:

i) virtio_gpu_resource_create_coherent -- for strictly coherent needs
(i.e, no unmap needed)
ii) virtio_gpu_resource_create_3d  -- may or may not be host backed
(depends on the PCI bar size, platform-specific information -- guest
doesn't need to know)

The guest would still issue the transfer ioctls for the
virtio_gpu_resource_create_3d resource, but the work performed would
be pared down when backed by host memory.

This will require increased VMM <--> virglrenderer inter-op.  Maybe
behind a flag that QEMU doesn't set, but cros_vm will.  WDTY?

>
> cheers,
>   Gerd
>