[virglrenderer-devel] vulkan + virgl ioctl vs command submission

Fri Feb 28 19:22:50 UTC 2020

On Fri, Feb 28, 2020 at 11:07 AM Chia-I Wu <olvaffe at gmail.com> wrote:

> On Thu, Feb 27, 2020 at 5:37 PM Dave Airlie <airlied at gmail.com> wrote:
> >
> > On Fri, 28 Feb 2020 at 08:07, Chia-I Wu <olvaffe at gmail.com> wrote:
> > >
> > > On Thu, Feb 27, 2020 at 11:45 AM Dave Airlie <airlied at gmail.com>
> wrote:
> > > >
> > > > Realised you might not be reading the list, or I asked too hard a
> question :-P
> > > Sorry that I missed this.
> > > >
> > > > On Tue, 25 Feb 2020 at 12:59, Dave Airlie <airlied at gmail.com> wrote:
> > > > >
> > > > > Okay I think I'm following along the mutiprocess model, and the
> object
> > > > > id stuff, and I'm mostly coming around to the ideas presented.
> > > > >
> > > > > One question I have is how do we envisage the userspace vulkan
> driver
> > > > > using things.
> > > > >
> > > > > I kinda feel I'm missing the difference between APIs that access
> > > > > things on the CPU side and command for accessing things on the GPU
> > > > > side in the proposal. In the gallium world the "screen" allocates
> > > > > resources (memory + properties) synchronously on the API being
> > > > > accessed, the context is then for operating on GPU side things
> where
> > > > > we batch up a command stream and it is processed async.
> > > > >
> > > > > From the Vulkan API POV the application API is multi-thread safe,
> and
> > > > > we should avoid if we can taking too many locks under the covers,
> esp
> > > > > in common paths. Vulkan applications are also encouraged to
> allocate
> > > > > memory in large chunks and subdivide them between resources.
> > > > >
> > > > > I'm concerned that we are thinking of batching allocations in the
> > > > > userspace driver (or in the kernel) and how to flush those to the
> host
> > > > > side etc. If we have two threads in userspace allocate memory from
> the
> > > > > vulkan API, and one then does a transfer into the memory, how do we
> > > > > envisage that being flushed to the host side? Like if I allocate
> > > > > memory in one thread, then create images from that memory in
> another,
> > > > > how does that work out?
> > > > >
> > >
> > > The goal of encoding vkAllocateMemory in the execbuffer command stream
> > > is not for batching.  It is to reuse the mechanism to send
> > > API-specific opaque alloc command to the host, and to allow
> > > allocations without resources (e.g., non-shareable allocations from a
> > > non-mappable heap do not need resources).
> > >
> > > In the current (but outdated) code[1], there is a per-VkInstance
> > > execbuffer command stream struct (struct vn_cs).  Encoding to the
> > > vn_cs requires a per-instance lock to be taken.  There is also a
> > > per-VkCommandBuffer vn_cs.  Encoding to that vn_cs requires no
> > > locking.  Multiple-threading is only beneficial when the app uses that
> > > to build their VkCommandBuffers.
> >
> > Imma gonna stop you there :-P, multithread vulkan apps are the normal
> > use case, not a special case. We do not design any vulkan things for
> > GL application ideas, Vulkan is different, multi-threaded command
> > buffer building is basic vulkan.
> That is how the current code looks like.  It is very naive and my
> focus was also a vk.xml parser.  I don't know if anyone has ever
> looked into the locking design (or command submission or sync
> primitives) more seriously.  This can be a good chance to work out a
> design.
>
>
> >
> > Having a per-instance lock is bad if it's being taken across multiple
> > threads in normal use cases.
> >
> > Though it's quite likely due to VM design we have to take a lock at
> > some point on those paths, it would be good to be explicit in the
> > design of the impacts of every lock. Like we will likely need locks in
> > the kernel submission paths anyways.
>
> The current design essentially looks at the first parameter (the
> dispatchable object) of a function, and if it is not externally synced
> and the function needs to be executed by the host, a cs lock is
> grabbed to encode the function.   We can add cs to more dispatchable
> objects.  But I think we are looking for ways to handle (or batch)
> functions locally to minimize locking.
>
> One idea is that, say given this sequence
>
>   {vkCreateImage, vkBindImageMemory, vkCmdCopyImage }
>
>
Android Emulator Vulkan does something similar to this in certain cases,
like translating guest vkCreateImage requests to APIs that extract
requirements along with the image:

https://android.googlesource.com/platform/external/qemu/+/refs/heads/emu-master-dev/android/android-emugl/host/libs/libOpenglRender/vulkan-registry/xml/vk.xml#6351

However, this opens up the possibility of a lot of grungy manual work. The
solution that I'm going for long term is to automatically optimize the
command protocol itself via something similar to PGO.

> Instead of grabbing the per-instance (or per-device) lock for two
> times to encode the first two functions separately, we can encode the
> first two functions lock-free to a per-image storage first, and copy
> the contents into the cs last minute.  vkCmdCopyImage is only shown as
> an example.  We need to make sure the host sees the first two
> functions before it sees vkCmdCopyImage.  It does not mean that
> vkCmdCopyImage triggers the copying and flushing.
>
> There are also cases where things can be handled inside the guest.
> When a VkDeviceMemory has a guest shmem, vkMapMemory can be guest-only
> for example.
>
>
> >
> > > But vkAllocateMemory can be changed to use a local vn_cs or a local
> > > template to be lock-free.  It will be like
> > >
> > >   mem->object_id = next_object_id();
> > >
> > >   local_cmd_templ[ALLOCATION_SIZE] = info->allocationSize;
> > >   local_cmd_templ[MEMORY_TYPE_INDEX] = info->memoryTypeIndex;
> > >   local_cmd_templ[OBJECT_ID] = mem->object_id;
> > >
> > >   // when a resource is needed;  otherwise, use EXECBUFFER instead
> > >   struct drm_virtgpu_resource_create_blob args = {
> > >     .size = info->allocationSize,
> > >     .flags = VIRTGPU_RESOURCE_FLAG_STORAGE_HOSTMEM,
> > >     .cmd_size = sizeof(local_cmd_templ),
> > >     .cmd = local_cmd_templ,
> > >     .object_id = mem->object_id
> > >   };
> > >   drmIoctl(fd, DRM_IOCTL_VIRTIO_GPU_RESOURCE_CREATE_BLOB, &args);
> > >
> > >   mem->resource_id = args.res_handle;
> > >   mem->bo = args.bo_handle;
> > >
> > > I think Gurchetan's proposal will look similar, except that the
> > > command stream will be replaced by something more flexible such that
> > > object id is optional.
> > >
> > > In the current design (v2), the host will
> > >
> > >  - allocate a VkDeviceMemory from the app's VkInstance
> >
> > VkDeviceMemory is tied to VkDevice object not VkInstance. though this
> > makes sense either way.
>
> Yeah, it is tied to VkDevice.  I had one-instance-per-process model in
> mind and wanted to show the export/import part.
>
> >
> > Okay I'm not entirely comfortable with this design yet, I probably
> > need to look at the code that's been done so far to get a better
> > feeling for it.
> Concern over resource allocation or the userspace driver?  I hope it
> is mostly the latter...
>
> >
> > With the instance_vn_cs, who flushes those to the host, how is that
> decided?
> The guest encodes functions in the order they are called (excluding
> vkCmd*).  Flushes happen in vkGet*, vk*Wait*, vkAllocateMemory,
> vkQueueSubmit, vkEndCommandBuffer, and maybe some more.  I don't think
> they are meaningful though.
>
>
> >
> > Dave.
> _______________________________________________
> virglrenderer-devel mailing list
> virglrenderer-devel at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/virglrenderer-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/virglrenderer-devel/attachments/20200228/70267955/attachment-0001.htm>