[virglrenderer-devel] A bit of performance analysis

Mon Sep 10 10:16:12 UTC 2018

Hi Dave, 

I hope you don't mind that I add the list again. 

Am Samstag, den 08.09.2018, 08:08 +1000 schrieb Dave Airlie:

> So vtest with fd passing is always going to be horrible, since it
> does a full swrast rendering to display stuff, which means it reads
> back the image using readpixels and sends it to the X server using
> PutImage. I have a branch that uses dma-buf passing for the display
> texture, it had some other issues but it mostly worked.
Indeed - I thought it would simply draw to the host display and didn't
any perf analysi to check, but yeah, now I see that on r600 the virgl
test server is spending 13 % in readpixles and is also waiting a lot.  

So it would seem that on Intel (i.e. with shared graphics memory)
readpixel performs way better. 

[..]

> > 
> > On the guest side things look a bit different. Here for the valley
> > benchmark more then 33% of the time is spend in and below
> > entry_SYSCALL_64 mostly initiated by mesa map_buffer_range
> > (glMapBufferRange) / unmap_buffer:
> > 
> >  32.12% entry_SYSCALL_64
> >     - 31.96% do_syscall_64
> >       - 23.46%  _x64_sys_ioctl
> >         - 23.35% ksys_ioctl
> >           - 22.35% do_vfs_ioctl
> >             - 21.89% drm_ioctl
> >               - 20.40% drm_ioctl_kernel
> >                 + 7.47% virtio_gpu_wait_ioctl
> >                 + 5.73% virtio_gpu_transfer_to_host_ioctl
> >                 + 4.58% virtio_gpu_transfer_from_host_ioctl
> >                   1.63% virtio_gpu_execbuffer_ioctl
> >       + 5.06% __x64_sys_nanosleep
> >       + 2.35% __x64_sys_futex
> 
> Yeah waiting on mapping for previous execution to complete,
> coherent and persistent might help here depends on the app.
I was wondering whether for write only access one could avoid the whole
mapping here and just send the data (at least as long as we don't have
coherent or persistent memory that would make it ).

[...]

> > Instrumenting on Intel/Ubuntu reveals another hotspot in the guest
> > kernel's iowrite16 (self ca. 25%) that is not as prominent on the
> > AMD/Gentoo system (self ca. 3%) (VM in both cases a Ubuntu bionic
> > with the latests Ubuntu (cosmic) 4.17.0. kernel).
> 
> iowrite16 is the kick virtio queue I think.
[...]

> > It would be interesting to know what benchmarking tools others are
> > using. From Google I heard about glbench, but I'm unable to
> > actually find it. Maybe this benchmark now uses a new name?
> 
> Yeah finding decent workloads and getting info out of them is a bit
> hard,
> 
> like comparing a game in a full Linux VM vs a game running native on
> the host is hard, as the presentation overhead is always going to be
> there, there might be ways to reduce the number of copies on that
> path, but if have full host X stack and full guest X stack and
> compositors on both you really have to set things up to make sure you
> get apples vs apples.

I forgot to mention part of the guest config: It is X11 running
blackbox (so no compositing), but I guess what we are interested in is
the performance that the user sees, i.e. including all the presentation
overhead. 

Best, 
Gert