Remote display with 3D acceleration using Wayland/Weston

Fri Dec 16 09:06:51 UTC 2016

On Thu, 15 Dec 2016 09:55:44 -0600
DRC <dcommander at users.sourceforge.net> wrote:

> On 12/15/16 3:01 AM, Pekka Paalanen wrote:
> > I assure you, this is a limitation of the RDP-backend itself. Nothing
> > outside of Weston creates this restriction.
> > 
> > The current RDP-backed is written to set up and use only the Pixman
> > renderer. Pixman renderer is a software renderer, and will not
> > initialize EGL in the compositor. Therefore no support for hardware
> > accelerated OpenGL gets advertised to clients, and clients fall back to
> > software GL.
> > 
> > You can fix this purely by modifying libweston/compositor-rdp.c file,
> > writing the support for initializing the GL-renderer. Then you get
> > hardware accelerated GL support for all Wayland clients without any
> > other modifications anywhere.
> > 
> > Why that has not been done already is because it was thought that
> > having clients using hardware OpenGL while the compositor is not cannot
> > be performant enough to justify the effort. Also, it pulls in the
> > dependency to EGL and GL libs, which are huge. Obviously your use case
> > is different and this rationale does not apply.  
> 
> Like many things, it depends on the application.  GLXgears may not
> perform better in a hardware-accelerated remote 3D environment vs. using
> software OpenGL, but real-world applications with larger geometries
> certainly will.  In a VirtualGL environment, the overhead is per-frame
> rather than per-primitive, so geometric throughput is essentially as
> fast as it would be in the local case (the OpenGL applications are still
> using direct rendering.)  The main performance limiters are pixel
> readback and transmission.  Modern GPUs have pretty fast readback--
> 800-1000 Mpixels/sec in the case of a mid-range Quadro, for instance, if
> you use synchronous readback.  VirtualGL uses PBO readback, which is a
> bit slower than synchronous readback but which uses practically zero CPU
> cycles and does not block at the driver level (this is what enables many
> users to share the same GPU without conflict.)  VGL also uses a frame
> queueing/spoiling system to send the 3D frames from the rendering thread
> into another thread for transmission and/or display, so it can be
> displaying or transmitting the last frame while the application renders
> the next frame.  TurboVNC (and most other X proxies that people use with
> VGL) is based on libjpeg-turbo, which can compress JPEG images at
> hundreds of Mpixels/sec on modern CPUs.  In total, you can pretty easily
> push 60+ Megapixels/sec with perceptually lossless image quality to
> clients on even a 100 Megabit network, and 20 Megapixels/sec across a 10
> Megabit network (with reduced quality.)  Our biggest success stories are
> large companies who have replaced their 3D workstation infrastructure
> with 8 or 10 beefy servers running VirtualGL+TurboVNC with laptop
> clients running the TurboVNC Viewer.  In most cases, they claim that the
> perceived performance is as good as or better than their old workstations.
> 
> To put some numbers on this, our GLXspheres benchmark uses a geometry
> size that is relatively small (~60,000 polygons) but still a lot more
> realistic than GLXgears (which has a polygon count only in the hundreds,
> if I recall correctly.)  When running on a 1920x1200 remote display
> session (TurboVNC), this benchmark will perform at about 14 Hz with
> llvmpipe but 43 Hz with VirtualGL.  So software OpenGL definitely does
> slow things down, even with a relatively modest geometry size and in an
> environment where there is a lot of per-frame overhead.

Hi,

indeed, those are use cases I (we?) have not thought about. Our
thinking has largely revolved around the idea that reading back a
buffer from gfx memory into system memory is prohibitively slow. And in
many cases it is, but not if you want to remote.

Another thought was that if the clients can use hardware GL, then why
would the compositor not use hardware paths all the way to the scanout?
So the case has been largely ignored.

It is very interesting to hear about the numbers!

> > The hardest part in adding the support to the RDP-backend is
> > implementing the buffer content access efficiently. RDP requires pixel
> > data in system memory so the CPU can read it, but GL-renderer has all
> > pixel data in graphics memory which often cannot be directly read by
> > the CPU. Accessing that pixel data requires a copy (glReadPixels), and
> > there is nowadays a helper: weston_surface_copy_content(), however the
> > function is not efficient and is so far meant only for debugging and
> > testing.  
> 
> I could probably reuse some of the VirtualGL code for this, since it
> already does a good job of buffer management.
> 
> Thanks so much for all of the helpful info.  I guess I have my work cut
> out for me.  :|

I should probably tell a little more, because what I explained above is
a simplification due to using a single path for all buffer types.

Currently we have essentially three different possible buffer types
underlying a wl_buffer protocol object: wl_shm, EGL-based, and
wp_linux_dmabuf. Proprietary graphics stacks may add their own
proprietary types, too, but should usually opt of the EGL path because
anything else would need explicitly adding support in all compositors
and toolkits.

- wl_shm buffers are pieces of system memory that you can always mmap
  for fast CPU access. Weston's GL-renderer does that and uses
  glTexImage2D to create a GL texture from it. If you then need the
  pixels in CPU domain again, uploading to GL was a mistake in the
  first place. Instead, one would want to just keep hold of the buffer
  as long as the compositor is going to need it. Weston's
  Pixman-renderer does keep a hold, but the GL-renderer will release
  the buffer as soon as it has been copied to GL (unless a backend sets
  a flag to keep it).

- EGL-based buffers are totally opaque. Normally you import them in EGL
  to get an EGLImage, which you then bind as a GL texture. Then you
  have to work with the GL texture, read it back any way you can.
  Alternatively, it might be possible to import the buffer in GBM and
  export it as a dmabuf, to be used like the wp_linux_dmabuf type.

- wp_linux_dmabuf buffers are dmabufs: 1-4 dmabuf file descriptors plus
  metadata. There is no single sure guaranteed way to access the pixel
  contents of such buffers. Some dmabufs might be able to be mmapped to
  the CPU address space, but this usually works only with integrated
  graphics if the producer was a GPU. You can also attempt to import
  dmabufs into EGL, to get an EGLImage and use a GL texture. Because
  nothing is guaranteed per se, creating wp_linux_dmabuf buffers
  involves a step where the compositor needs to confirm that the buffer
  will actually work before the client is allowed to use it. Weston
  implements only the EGL path (and direct DRM/KMS scanout) for dmabufs.

Applying to all buffer types, we have the slew of different pixel
formats. The compositor advertises the supported pixel formats in a
buffer type specific way for each type. Buffers that are used with GPUs
often have also a non-linear layout which you must decode if you want
to and are able to mmap the storage for direct CPU access. The benefit
of going through Weston's GL-renderer is that it converts everything
into a single pixel format with linear layout, at a cost of course.
Otherwise you need to handle all format issues yourself.

Lastly, and I believe this is the most sad part for you, is that NVIDIA
proprietary drivers do not work (the way we would like).

NVIDIA has been proposing for years a solution that is completely
different to anything explained above: EGLStreams, and for the same
amount of years, the community has been unimpressed with the design.
Anyway, NVIDIA did implement their design and even wrote patches for
Weston which we have not merged. Other compositors (e.g. Mutter) may
choose to support EGLStreams as a temporary solution.

The real solution effort to the buffer allocation and interoperability
issue was kicked off in XDC2016, and is now being tracked at:
https://github.com/cubanismo/allocator

However, it may take a long time before that effort results in
something one could actually run, and even more to stabilize it and get
into distributions. That's why some people are willing to bite the
bullet and implement the alternative path with EGLStreams only to
support the NVIDIA proprietary drivers.

Thanks,
pq
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 801 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/wayland-devel/attachments/20161216/57747304/attachment.sig>