[virglrenderer-devel] vulkan + virgl ioctl vs command submission

Fri Feb 28 21:01:16 UTC 2020

On Fri, Feb 28, 2020 at 12:27 PM Chia-I Wu <olvaffe at gmail.com> wrote:

> On Fri, Feb 28, 2020 at 11:23 AM Frank Yang <lfy at google.com> wrote:
> >
> >
> >
> > On Fri, Feb 28, 2020 at 11:07 AM Chia-I Wu <olvaffe at gmail.com> wrote:
> >>
> >> On Thu, Feb 27, 2020 at 5:37 PM Dave Airlie <airlied at gmail.com> wrote:
> >> >
> >> > On Fri, 28 Feb 2020 at 08:07, Chia-I Wu <olvaffe at gmail.com> wrote:
> >> > >
> >> > > On Thu, Feb 27, 2020 at 11:45 AM Dave Airlie <airlied at gmail.com>
> wrote:
> >> > > >
> >> > > > Realised you might not be reading the list, or I asked too hard a
> question :-P
> >> > > Sorry that I missed this.
> >> > > >
> >> > > > On Tue, 25 Feb 2020 at 12:59, Dave Airlie <airlied at gmail.com>
> wrote:
> >> > > > >
> >> > > > > Okay I think I'm following along the mutiprocess model, and the
> object
> >> > > > > id stuff, and I'm mostly coming around to the ideas presented.
> >> > > > >
> >> > > > > One question I have is how do we envisage the userspace vulkan
> driver
> >> > > > > using things.
> >> > > > >
> >> > > > > I kinda feel I'm missing the difference between APIs that access
> >> > > > > things on the CPU side and command for accessing things on the
> GPU
> >> > > > > side in the proposal. In the gallium world the "screen"
> allocates
> >> > > > > resources (memory + properties) synchronously on the API being
> >> > > > > accessed, the context is then for operating on GPU side things
> where
> >> > > > > we batch up a command stream and it is processed async.
> >> > > > >
> >> > > > > From the Vulkan API POV the application API is multi-thread
> safe, and
> >> > > > > we should avoid if we can taking too many locks under the
> covers, esp
> >> > > > > in common paths. Vulkan applications are also encouraged to
> allocate
> >> > > > > memory in large chunks and subdivide them between resources.
> >> > > > >
> >> > > > > I'm concerned that we are thinking of batching allocations in
> the
> >> > > > > userspace driver (or in the kernel) and how to flush those to
> the host
> >> > > > > side etc. If we have two threads in userspace allocate memory
> from the
> >> > > > > vulkan API, and one then does a transfer into the memory, how
> do we
> >> > > > > envisage that being flushed to the host side? Like if I allocate
> >> > > > > memory in one thread, then create images from that memory in
> another,
> >> > > > > how does that work out?
> >> > > > >
> >> > >
> >> > > The goal of encoding vkAllocateMemory in the execbuffer command
> stream
> >> > > is not for batching.  It is to reuse the mechanism to send
> >> > > API-specific opaque alloc command to the host, and to allow
> >> > > allocations without resources (e.g., non-shareable allocations from
> a
> >> > > non-mappable heap do not need resources).
> >> > >
> >> > > In the current (but outdated) code[1], there is a per-VkInstance
> >> > > execbuffer command stream struct (struct vn_cs).  Encoding to the
> >> > > vn_cs requires a per-instance lock to be taken.  There is also a
> >> > > per-VkCommandBuffer vn_cs.  Encoding to that vn_cs requires no
> >> > > locking.  Multiple-threading is only beneficial when the app uses
> that
> >> > > to build their VkCommandBuffers.
> >> >
> >> > Imma gonna stop you there :-P, multithread vulkan apps are the normal
> >> > use case, not a special case. We do not design any vulkan things for
> >> > GL application ideas, Vulkan is different, multi-threaded command
> >> > buffer building is basic vulkan.
> >> That is how the current code looks like.  It is very naive and my
> >> focus was also a vk.xml parser.  I don't know if anyone has ever
> >> looked into the locking design (or command submission or sync
> >> primitives) more seriously.  This can be a good chance to work out a
> >> design.
> >>
> >>
> >> >
> >> > Having a per-instance lock is bad if it's being taken across multiple
> >> > threads in normal use cases.
> >> >
> >> > Though it's quite likely due to VM design we have to take a lock at
> >> > some point on those paths, it would be good to be explicit in the
> >> > design of the impacts of every lock. Like we will likely need locks in
> >> > the kernel submission paths anyways.
> >>
> >> The current design essentially looks at the first parameter (the
> >> dispatchable object) of a function, and if it is not externally synced
> >> and the function needs to be executed by the host, a cs lock is
> >> grabbed to encode the function.   We can add cs to more dispatchable
> >> objects.  But I think we are looking for ways to handle (or batch)
> >> functions locally to minimize locking.
> >>
> >> One idea is that, say given this sequence
> >>
> >>   {vkCreateImage, vkBindImageMemory, vkCmdCopyImage }
> >>
> >
> > Android Emulator Vulkan does something similar to this in certain cases,
> like translating guest vkCreateImage requests to APIs that extract
> requirements along with the image:
> >
> >
> https://android.googlesource.com/platform/external/qemu/+/refs/heads/emu-master-dev/android/android-emugl/host/libs/libOpenglRender/vulkan-registry/xml/vk.xml#6351
> >
> > However, this opens up the possibility of a lot of grungy manual work.
> The solution that I'm going for long term is to automatically optimize the
> command protocol itself via something similar to PGO.
> Hm, I think it is fine to manually code functions outside of vkCmd*.
> They are too vastly different to be generated.
>
> The proposed idea makes the driver more like a real driver.  When an
> object is created, it builds and embeds the HW descriptor (the
> serialized function call) in the object in the system ram.  Only when
> the HW needs the descriptor it emits the descriptor to HW.  That's the
> view of the guest driver.  To the host, the guest driver sends it
> reordered Vulkan calls.
>
> The question becomes how does the driver minimize emits (locking,
> copying descriptors into the CS, and flushing) while making sure the
> reordering is legit.  I guess that is what Dave wanted to know from
> the questions he asked, which I do not have an answer.  There are also
> some cases such as vkMapMemory or vkWaitForFences where we must or
> want to handle in the guest.
>
>
The information I'm thinking of would cover reorderings since it's
information for codegenning valid + optimal protocol for all APIs calls; it
would need to know when parameters and struct fields are created versus
used versus destroyed and when they would be able to be early-destroyed. It
would also specify/abstract what should be handled on what side.

>
> >>
> >> Instead of grabbing the per-instance (or per-device) lock for two
> >> times to encode the first two functions separately, we can encode the
> >> first two functions lock-free to a per-image storage first, and copy
> >> the contents into the cs last minute.  vkCmdCopyImage is only shown as
> >> an example.  We need to make sure the host sees the first two
> >> functions before it sees vkCmdCopyImage.  It does not mean that
> >> vkCmdCopyImage triggers the copying and flushing.
> >>
> >> There are also cases where things can be handled inside the guest.
> >> When a VkDeviceMemory has a guest shmem, vkMapMemory can be guest-only
> >> for example.
> >>
> >>
> >> >
> >> > > But vkAllocateMemory can be changed to use a local vn_cs or a local
> >> > > template to be lock-free.  It will be like
> >> > >
> >> > >   mem->object_id = next_object_id();
> >> > >
> >> > >   local_cmd_templ[ALLOCATION_SIZE] = info->allocationSize;
> >> > >   local_cmd_templ[MEMORY_TYPE_INDEX] = info->memoryTypeIndex;
> >> > >   local_cmd_templ[OBJECT_ID] = mem->object_id;
> >> > >
> >> > >   // when a resource is needed;  otherwise, use EXECBUFFER instead
> >> > >   struct drm_virtgpu_resource_create_blob args = {
> >> > >     .size = info->allocationSize,
> >> > >     .flags = VIRTGPU_RESOURCE_FLAG_STORAGE_HOSTMEM,
> >> > >     .cmd_size = sizeof(local_cmd_templ),
> >> > >     .cmd = local_cmd_templ,
> >> > >     .object_id = mem->object_id
> >> > >   };
> >> > >   drmIoctl(fd, DRM_IOCTL_VIRTIO_GPU_RESOURCE_CREATE_BLOB, &args);
> >> > >
> >> > >   mem->resource_id = args.res_handle;
> >> > >   mem->bo = args.bo_handle;
> >> > >
> >> > > I think Gurchetan's proposal will look similar, except that the
> >> > > command stream will be replaced by something more flexible such that
> >> > > object id is optional.
> >> > >
> >> > > In the current design (v2), the host will
> >> > >
> >> > >  - allocate a VkDeviceMemory from the app's VkInstance
> >> >
> >> > VkDeviceMemory is tied to VkDevice object not VkInstance. though this
> >> > makes sense either way.
> >>
> >> Yeah, it is tied to VkDevice.  I had one-instance-per-process model in
> >> mind and wanted to show the export/import part.
> >>
> >> >
> >> > Okay I'm not entirely comfortable with this design yet, I probably
> >> > need to look at the code that's been done so far to get a better
> >> > feeling for it.
> >> Concern over resource allocation or the userspace driver?  I hope it
> >> is mostly the latter...
> >>
> >> >
> >> > With the instance_vn_cs, who flushes those to the host, how is that
> decided?
> >> The guest encodes functions in the order they are called (excluding
> >> vkCmd*).  Flushes happen in vkGet*, vk*Wait*, vkAllocateMemory,
> >> vkQueueSubmit, vkEndCommandBuffer, and maybe some more.  I don't think
> >> they are meaningful though.
> >>
> >>
> >> >
> >> > Dave.
> >> _______________________________________________
> >> virglrenderer-devel mailing list
> >> virglrenderer-devel at lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/virglrenderer-devel
> >
> > _______________________________________________
> > virglrenderer-devel mailing list
> > virglrenderer-devel at lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/virglrenderer-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/virglrenderer-devel/attachments/20200228/8a1575c8/attachment-0001.htm>