[Mesa-dev] [RFC 1/2] gallium: add renderonly driver

Mon Oct 26 05:42:45 PDT 2015

Hi Lucas,

On 19 October 2015 at 10:25, Lucas Stach <l.stach at pengutronix.de> wrote:
> Am Sonntag, den 18.10.2015, 21:41 +0200 schrieb Christian Gmeiner:
>> 2015-10-16 15:31 GMT+02:00 Thierry Reding <thierry.reding at gmail.com>:
>> > Gallium driver is rather complicated, but that's due to the fact that
>> > graphics drivers are complex beasts.
>> >
>> > That said, I think for some areas it might be beneficial to have helpers
>> > to reduce the amount of duplication. However I think at this point in
>> > time we haven't had enough real-world exposure for this kind of driver
>> > to know what the requirements are. For that reason I think it is
>> > premature to use a generic midlayer such as this. Yes, I know that the
>> > alternative is roughly 2000 lines of code per driver, but on one hand
>> > that's nothing compared to the amount of code required by a proper GPU
>> > driver and on the other hand this will (ideally) be temporary until we
>> > get a better picture of where things need to go. At which point it may
>> > become more obvious how we can solve the boilerplate problem while at
>> > the same time avoiding the restrictions imposed by the midlayer.
>>
>> For the moment I have no problem to go that way, but I am not sure how long
>> we should wait to collect the needed requirements. Do you know of any other
>> renderonly gpu coming in the near future (with opensource drivers)?

Well, everything on ARM, and any multi-GPU/hybrid/Prime situation on Intel.

>> I am really no expert in this area but the most interesting part for a
>> renderonly
>> GPU midlayer is "how to display the rendering result?".
>> So we need to support different tiling formats and if a intermediate buffer is
>> needed (for IP combination where GPU or scanout format are not compatible).
>>
>> For etnaviv we need to support all this cases (with the GC2000+ coming the next
>> months we can render into the scanout buffer directly - same as tegra).

Luckily etnaviv+imx-drm in the current generation seems to be a
glaring special case. Especially with higher resolutions becoming the
norm (1080p being non-negotiable and 4K working its way into even the
smallest mobile systems, e.g. Mali-450 on Amlogic, where the 6xx+
family is around 4 years old), intermediate copies just aren't viable
anymore.

I'd suggest keeping this as a total special case hidden internally,
quite apart from the general attempt to resolve the disjoint
GPU/scanout issue.

>> I know that it is quite hard to design a good api/midlayer/whatever to be future
>> proof for the next 1-2 years without (bigger) refactoring work. But at
>> some point
>> we need to jump into the cold water, start to design it and make use of it.
>> I am open to any discussion about software architecture and requirements.
>
> The problems I see with this series are not technical/implementation
> issues, but IMHO this whole device fusing is wrong from an architectural
> point of view.

Absolutely agreed.

> This series implements a way to fuse multiple DRM devices into a single
> EGL device, solely on the base that current EGL users expect that a
> single device is able to render and scanout at the same time. This is a
> flawed premise on itself and this series actively tries to keep the
> illusion that this is true, even for SoC and prime systems. While doing
> so it is putting a lot device specific knowledge into the implementation
> of the midlayer/helpers.
>
> IMHO we should not try to keep that illusion, but work out a way for
> client applications to deal with the situation of having multiple
> scanout/render devices in a easy way.
>
> I mean having render/scanout on different devices isn't exactly news, in
> fact we already have to deal with that for the various prime laptops.
> Currently all the complexity of those setups is hidden in the X server,
> but that one will go away and we have to find a way to make it easy for
> the various wayland compositors to deal with that.

So far we agree ...

> My current proposal would be as follows:
>
> 1. Implement etnaviv/nouveau on tegra as EGL devices that are only able
> to render into offscreen surfaces.
>
> 2. Implement an easy way for compositors to discover EGL devices and
> their capabilities. (There are already patches for EGL device and
> related stuff on the ML)

Indeed. The EGL_EXT_device_base series does need a respin for the new
libdrm device-discovery API, but it would still be nice to get review
here. Having this at least allows us the correct primitives (in the
sense of the driver having all required information), in that gbm has
a handle to the KMS/presentation device, and EGL then has a handle to
the EGL/render device.

> 3. Implement a generic KMS EGL device that is only able to prove
> scanout. Have EGL_EXT_stream_consumer_egloutput [1] working on top of
> that.
>
> 4. Implement EGL_KHR_stream_producer_eglsurface for the render only EGL
> devices.

At this point I violently disagree. My view on EGLStreams is roughly
the same as that of STREAMS (or OpenMAX). I can see the attraction for
driver writers, but that no actual consumer has stepped up to demand
it is pretty telling in and of itself.

The biggest problem with EGLStreams is how much it hides from you -
enough to be a total deal-breaker. Firstly, the EGLStreams + EGLOutput
model totally precludes the use of atomic modesetting when pulling
separate surfaces into planes, or trying to drive multiple outputs
together. Secondly, streams have a single fixed latency value across
the entire pipeline, which will rarely hold true; especially if you're
doing GPU blits for presentation, where you're subject to the current
load on the GPU.

These two are enough to preclude it from any compositor which aims to
give a good media pipeline, as it precludes the use of zerocopy
overlays in favour of a forced blit (again with large media sizes,
this may well be enough to prevent you from displaying at the native
framerate), and the loss of timing information means you can't do
accurate A/V sync.

Tying presentation tightly to EGLSurfaces is also an issue for clients
who need fine-grained control over their display/presentation
pipeline. Chrome when running under ChromeOS/Freon (i.e. as a native
DRM client) handles all its own buffer allocation, so it can downclock
and go for longer pipelines with higher latency when possible (e.g.
media pipelines where you have long buffers and the size of latency is
much less important than its predictability), but in latency-critical
situations such as scrolling, switch to a much shorter pipeline for
more aggressive response time. Surfaces make this very difficult, and
Surfaces+Streams even more so.

I sympathise with the core problem behind streams, but ultimately the
answer has to be to have _less_ presentation inside EGL, not more.
EGLSurface and SwapBuffers are not, and will never be, a good enough
pipeline for display presentation. I cannot see Weston ever supporting
it, as it would be a series of steps backwards.

> Details of the hardware like the need to detile the render buffer can be
> negotiated as part of the EGL stream handshaking.
>
> This will give a generic compositor an easy way to deal with the
> situation of multiple render/scanout devices without the need to know
> about hardware details. I know that this isn't the easy way to go, but
> it solves the problem in a generic way and we don't run into the same
> problem again once someone attaches a discreet GPU to one of those SoCs.
>
> I would like to hear what others are thinking about that.

Sorry, but I just can't support this.

I'd suggest (again with the position that the etnaviv+imx intermediate
blit should be a hidden special case) that a good starting point to
look at would be gbm itself, which attempts to solve the same problem:
negotiating buffer allocation so a surface is compatible for both
render and scanout. Obviously EGLDevice is needed to implement this
properly, perhaps with a driver-provided helper that attempts to
resolve devices when it's not in use (e.g. if you're aiming at
tegra-drm and there's a nouveau device available, suggest that for
render).

gbm itself has a few issues, mind:
  - the only point at which you have both device handles is when you
render through an EGLSurface: the non-Surface gbm allocation APIs
don't let you work through EGL (and the EGL dmabuf import does not
support DRM format modifiers, but that's easily enough solved)
  - the Surface APIs do not allow smart clients to explicitly control
buffer allocation and pipelining (i.e. allocate buffers separately and
then inject them for Surface rendering)
  - the gbm implementation expects that the DRI driver knows what
scanout alloc means, which is not really possible to know in all
situations
  - the usage flags are not fine-grained enough, lacking e.g.
disambiguation of multiple GPUs, support for external media-decode (or
encode for Miracast) devices, etc
  - gbm_bo_import(GBM_BO_IMPORT_WL_BUFFER) does not belong in gbm at
all, but it was easier than creating new EGL API ...

But even with all that, it's architecturally the correct starting
point, and allows for compositors to be more smart. Dropping everyone
to the lowest common denominator cripples consumers such as Weston and
Freon, in favour of hypothetical trivial systems. Driving DRM/KMS
isn't particularly simple anyway, and projects like libweston allow
external clients to take advantage of those smarts whilst not being
too deeply tied to what we want to do.

A good approach would probably be to resolve gbm's internal API issues
first - get EGL_EXT_device_base merged, and give gbm a pluggable
internal interface so that the independent GPU/scanout components can
negotiate acceptable format+modifier+pitch triplets. After that's
done, attacking the more difficult external API issues should give us
a good base for handling external buffer allocation. If someone is
desperate to use (implement) Streams, they could then make a complete
Streams implementation for KMS as an EGLOutput consumer, on top of
gbm.

Lastly, as a somewhat nitpicky point, importing a buffer and then
separately modifying its tiling flags is not a great idea
architecturally, and has been shown (I know it's come up in ChromeOS
profiling) to cause performance issues. Rather than doing a naïve
import and treating tiling as some magic driver-specific thing, using
DRM format modifiers during the import allows you to do tiling import
correctly the first time, including a full validation of the
applicable restrictions.

Thanks a lot for your work and all the thoughtful comments here. This
has been a pain point for a while now, though atomic modesetting and
other work within Wayland/Weston itself have more pressing, so I
haven't been able to expend much any time on it. I'm really glad to
see someone having picked this up though, and if done properly,
everyone with an ARM or a multi-GPU system will thank you. :)

Cheers,
Daniel