[virglrenderer-devel] multiprocess model and GL

Mon Feb 3 20:54:41 UTC 2020

On Mon, Feb 3, 2020 at 1:53 AM Gerd Hoffmann <kraxel at redhat.com> wrote:
>
> On Fri, Jan 31, 2020 at 12:00:06PM -0800, Chia-I Wu wrote:
> > On Fri, Jan 31, 2020 at 2:41 AM Gerd Hoffmann <kraxel at redhat.com> wrote:
> > >
> > >   Hi,
> > >
> > > memory-v4 branch pushed.
> > >
> > > Went with the single-ioctl approach.  Renamed back to CREATE as we don't
> > > have separate "allocate resource id" and "initialize resource" steps any
> > > more.
> > >
> > > So, virgl/vulkan resources would be created via execbuffer, get an
> > > object id attached to them so they can be referenced, then we'll create
> > > a resource from that.  The single ioctl which will generate multiple
> > > virtio commands.
> > Does it support cmd_size==0 and object_id!=0?  That is useful for
> > cases where execbuffer and resourece_create happen at different times.
>
> Not sure yet.  At the end of the day it boils down to the question
> whenever we want allow allocation via EXECBUFFER ioctl, then create
> resources later via separate CREATE_BLOB ioctl.
>
> I'd tend to support one model:  Either two ioctls, or execbuffer
> included in CREATE_BLOB.  Or would it be useful for userspace to have
> both, then pick one at runtime on a case-by-case base?
An example where the userspace driver may want two ioctls is when an
app allocates for tiled textures (VK_IMAGE_TILING_OPTIMAL).  Because
the allocations are for tiled textures, the app never maps them as it
makes no sense.  But when the allocations are also from a mappable
heap (VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT), the driver must prepare to
map them.  One way is for the driver to only EXECBUFFER in
vkAllocateMemory and to lazily RESOURE_CREATE_BLOB in vkMapMemory.

I think it is still one model.  There is just a shortcut to save the
userspace one ioctl call.  The same set of virtio commands can get
sent to the host whichever path the userspace takes.

>
> > > Dumb resources will be created with the same ioctl, just with the DUMB
> > > instead of the EXECBUFFER flag set.  The three execbuffer fields will be
> > > unused.
> > I think the three execbuffer fields can be in a union:
> >
> > union {
> >     struct {
> >       the-three-fields;
> >     } execbuffer;
> >
> >     __u32 pads[16];
> > };
> >
> > The alloc type decides which of the fields, if any, is used.  This
> > gives us some leeway when a future alloc type needs something else.
>
> Also makes the interface more clear.
>
> > > To be discussed:
> > >
> > > (1) Do we want/need both VIRTGPU_RESOURCE_FLAG_STORAGE_SHARED_ALLOW and
> > >     VIRTGPU_RESOURCE_FLAG_STORAGE_SHARED_REQUIRE?
> > The host always have direct access to the guest shmem.  I can see
> > three cases when a host accesses the shmem
> >
> >  - transfers data into and out of the guest shmem
> >  - direct access in CPU domain (CPU access or GPU access w/ userptr)
> >  - direct access in device domain (GPU access w/ udmabuf)
>
> Ok, some background is needed here I think:
>
> Guest memory is backed by memfd memory.  Guest resources are scattered
> there.  This is where udmabuf comes into play: The host can create a
> dmabuf for the scattered pages that way.
>
> The host can mmap() the dmabuf and get a linear mapping of the resource
> for cpu access.  That allows operating directly on the resources.
> virglrenderer can skip copying the iov into a linear buffer.
>
> virglrenderer could also try import the udmabuf so the gpu can directly
> access it.
>
> For the most part this is a host-side optimization.  The only reason
> the guest has to worry about this is that with a udmabuf-based shared
> mapping the host might see guest changes without explicit TRANSFER
> command.  Which breaks mesa.
>
> So my original plan was that the guest can allow to host to use this
> optimization (SHARED_ALLOW).  Then it's up to the host to figure
> whenever it actually wants create a udmabuf or not.  For small resources
> the mmap() overhead might not pay off.  Probably not so much a problem
> for vulkan thanks to memory pooling, but for opengl where every object
> has its own resource we probably want the option to *not* use a dmabuf.
>
> I some cases it might be useful for the guest to force the host using a
> udmabuf, this is what SHARED_REQUIRE is for.
>
> Question is do we want/need them both?  We could drop SHARED_ALLOW.
> In that case the guest has to decide on the mmap vs. copy performance
> traceoff and pick SHADOW or SHARED accordingly.
I think we can drop SHARED_ALLOW.  Imagine a userspace that uses
SHADOW right now and can benefit from SHARED.  If it switches to
SHARED_ALLOW, it still needs to issue the transfers which the host
will discard when udmabuf is supported.  But if it checks for and
switches to SHARED_REQUIRE, it does not need to issue the transfers at
all.

I also think SHARED_REQUIRE should be renamed to SHARED_CPU, meaning
that direct access is required and must be coherent with the CPU
domain.  The host can choose from

 - direct CPU access via iovec
 - direct CPU access via udmabuf with DMA_BUF_SYNC_START
 - direct GPU access via iovec or udmabuf with cache snooping

depending on where the resource is used.  Because the host access is
coherent with the CPU domain, the guest can always use a cached
mapping.

>
> > VIRTGPU_RESOURCE_FLAG_STORAGE_SHADOW says the host can access the
> > shmem only in response to transfer commands.  It is not very useful
> > and can probably be removed.
>
> Well, mesa breaks if the guest can see changes without explicit
> TRANSFER, so I think we will need that.
>
> > VIRTGPU_RESOURCE_FLAG_STORAGE_SHARED_CPU says the host can and must
> > access the shmem in CPU domain.  The kernel always maps the shmem
> > cached and the userspace knows it is coherent.
> >
> > VIRTGPU_RESOURCE_FLAG_STORAGE_SHARED_DEVICE says the host can and must
> > access the shmem in device domain.  The userspace can ask the kernel
> > to give it a coherent mapping or not.  For a coherent mapping, it can
> > be wc or wb depending on the platform.  For a incoherent mapping, the
> > userspace can use transfers to flush/invalidate cpu cache.
>
> On the host side both are essentially "create and use udmabuf".  So do
> we need separate CPU/DEVICE flags here?
The host gpu driver might need to enable cache snooping
(AMDGPU_PTE_SNOOPED for AMD or BYT_PTE_SNOOPED_BY_CPU_CACHES for Atom)
when the access must be coherent with the CPU domain.  With
SHARED_CPU, the guest gets cached and coherent mapping but GPU access
will be slower on AMD and Atom.  With SHARED_DEVICE, the guest chooses
from cached or coherent mapping on said platforms, but GPU access will
not be hurt.

I think SHARED_CPU is very useful because there are cases where GPU
access is not needed (e.g., VREND_RESOURCE_STORAGE_GUEST).  As for
SHARED_DEVICE, we might not even need it initially because there is
already HOSTMEM.

If there was only SHARED_REQUIRE and GPU access was allowed, it would
be equivalent to SHARED_DEVICE.

> > > (2) How to integrate gbm/gralloc allocations best?  Have a
> > >     VIRTGPU_RESOURCE_FLAG_ALLOC_GBM, then pass args in the execbuffer?
> > >     Or better have a separate RESOURCE_CREATE_GBM ioctl/command and
> > >     define everything we need in the virtio spec?
> > Instead of RESOURCE_CREATE_GBM, I would replace the three execbuffer
> > fields with a union, and add VIRTGPU_RESOURCE_FLAG_ALLOC_GBM and a new
> > field to the union.
>
> Yes, that would work too.
>
> > If we were to pass args in the execbuffer, what would be wrong with a
> > generic gpu context type that is allocation-only?
>
> Well, if you want use the gbm-allocated resources with virgl/vulkan
> anyway this introduces some overhead.  You would need two /dev/dri/card0
> handles, one in generic-gpu mode for allocation, one in virgl/vulkan
> mode, then go allocate stuff with one, then export + import into the
> other ...
When an app calls vkCreateInstance, the vulkan driver opens the DRI
device; when the app calls gbm_create_device, it also already has the
DRI device opened.  There are going to be two handles.

What you suggested allows a context type to support multiple command
stream formats (e.g., vk command stream and gbm command stream) at the
same time.  It may be a useful idea if there is a use case, but I feel
it should be supported in a form that says "context type A supports
wire format X, Y, and Z".

> Also I think GBM and DUMB resources are quite simliar.
>
> cheers,
>   Gerd
>