[virglrenderer-devel] A bit of performance analysis

Mon Sep 10 19:00:09 UTC 2018

On Mon, 10 Sep 2018 at 20:17, Gert Wollny <gert.wollny at collabora.com> wrote:
>
> Hi Dave,
>
> I hope you don't mind that I add the list again.

(oops gmail did it, totally meant to be on list).
>
> Am Samstag, den 08.09.2018, 08:08 +1000 schrieb Dave Airlie:
>
> > So vtest with fd passing is always going to be horrible, since it
> > does a full swrast rendering to display stuff, which means it reads
> > back the image using readpixels and sends it to the X server using
> > PutImage. I have a branch that uses dma-buf passing for the display
> > texture, it had some other issues but it mostly worked.
> Indeed - I thought it would simply draw to the host display and didn't
> any perf analysi to check, but yeah, now I see that on r600 the virgl
> test server is spending 13 % in readpixles and is also waiting a lot.
>
> So it would seem that on Intel (i.e. with shared graphics memory)
> readpixel performs way better.

Yup I'm sure readpixels could be improved on r600 in some cases,
but I think making vtest better is a better investment of time. At least
with the dma-buf fd passing things were a lot more even.

> > > On the guest side things look a bit different. Here for the valley
> > > benchmark more then 33% of the time is spend in and below
> > > entry_SYSCALL_64 mostly initiated by mesa map_buffer_range
> > > (glMapBufferRange) / unmap_buffer:
> > >
> > >  32.12% entry_SYSCALL_64
> > >     - 31.96% do_syscall_64
> > >       - 23.46%  _x64_sys_ioctl
> > >         - 23.35% ksys_ioctl
> > >           - 22.35% do_vfs_ioctl
> > >             - 21.89% drm_ioctl
> > >               - 20.40% drm_ioctl_kernel
> > >                 + 7.47% virtio_gpu_wait_ioctl
> > >                 + 5.73% virtio_gpu_transfer_to_host_ioctl
> > >                 + 4.58% virtio_gpu_transfer_from_host_ioctl
> > >                   1.63% virtio_gpu_execbuffer_ioctl
> > >       + 5.06% __x64_sys_nanosleep
> > >       + 2.35% __x64_sys_futex
> >
> > Yeah waiting on mapping for previous execution to complete,
> > coherent and persistent might help here depends on the app.
> I was wondering whether for write only access one could avoid the whole
> mapping here and just send the data (at least as long as we don't have
> coherent or persistent memory that would make it ).

The waiting on mapping is a GL level expectation, we should already
avoid waiting in cases where is isn't required,
(though you'd want to get some backtraces in userspace from the waits,
to confirm that).

> I forgot to mention part of the guest config: It is X11 running
> blackbox (so no compositing), but I guess what we are interested in is
> the performance that the user sees, i.e. including all the presentation
> overhead.

Oh totally, but it just means that comparing the host vs the guest isn't
as simple as 1:1 execution of the same app, lots of things could affect
the end result.

One area I know we will do badly is anything with queries and waiting,
at the moment due the lack of any select mechanism for GL sync objects
we rely on the polling of sync objects and queries from qemu, this happens
from a timer that could probably be tuned, or we could create a GL spec
that gives an fd to add to select/poll when things occur so we can block
on it.

Dave.