How to handle disconnection of eDP panels due to dynamic display mux switches

Wed Apr 1 08:14:34 UTC 2020

On Tue, 31 Mar 2020 20:59:39 -0500
Daniel Dadap <ddadap at nvidia.com> wrote:

> On 3/30/20 10:11 AM, Jani Nikula wrote:
> > On Fri, 27 Mar 2020, Daniel Dadap <ddadap at nvidia.com> wrote:  
> >> A number of hybrid GPU notebook computer designs with dual (integrated
> >> plus discrete) GPUs are equipped with multiplexers (muxes) that allow
> >> display panels to be driven by either the integrated GPU or the discrete
> >> GPU. Typically, this is a selection that can be made at boot time as a
> >> menu option in the system firmware's setup screen, and the mux selection
> >> stays fixed for as long as the system is running and persists across
> >> reboots until it is explicitly changed. However, some muxed hybrid GPU
> >> systems have dynamically switchable muxes which can be switched while
> >> the system is running.
> >>
> >> NVIDIA is exploring the possibility of taking advantage of dynamically
> >> switchable muxes to enhance the experience of using a hybrid GPU system.
> >> For example, on a system configured for PRIME render offloading, it may
> >> be possible to keep the discrete GPU powered down and use the integrated
> >> GPU for rendering and displaying the desktop when no applications are
> >> using the discrete GPU, and dynamically switch the panel to be driven
> >> directly by the discrete GPU when render-offloading a fullscreen
> >> application.
> >>
> >> We have been conducting some experiments on systems with dynamic muxes,
> >> and have found some limitations that would need to be addressed in order
> >> to support use cases like the one suggested above:
> >>
> >> * In at least the i915 DRM-KMS driver, and likely in other DRM-KMS
> >> drivers as well, eDP panels are assumed to be always connected. This
> >> assumption is broken when the panel is muxed away, which can cause
> >> problems. A typical symptom is i915 repeatedly attempting to retrain the
> >> link, severely impacting system performance and printing messages like
> >> the following every five seconds or so:
> >>
> >> [drm:intel_dp_start_link_train [i915]] *ERROR* failed to enable link
> >> training
> >> [drm] Reducing the compressed framebuffer size. This may lead to less
> >> power savings than a non-reduced-size. Try to increase stolen memory
> >> size if available in BIOS.
> >>
> >> This symptom might occur if something causes the DRM-KMS driver to probe
> >> the display while it's muxed away, for example a modeset or DPMS state
> >> change.
> >>
> >> * When switching the mux back to a GPU that was previously driving a
> >> mode, it is necessary to at the very least retrain DP links to restore
> >> the previously displayed image. In a proof of concept I have been
> >> experimenting with, I am able to accomplish this from userspace by
> >> triggering DPMS off and then back on again; however, it would be good to
> >> have an in-kernel API to request that an output owned by a DRM-KMS
> >> driver be refreshed to resume driving a mode on a disconnected and
> >> reconnected display. This API would need to be accessible from outside
> >> of the DRM-KMS driver handling the output. One reason it would be good
> >> to do this within the kernel, rather than rely on e.g. DPMS operations
> >> in the xf86-video-modesetting driver, is that it would be useful for
> >> restoring the console if X crashes or is forcefully killed while the mux
> >> is switched to a GPU other than the one which drives the console.
> >>
> >> Basically, we'd like to be able to do the following:
> >>
> >> 1) Communicate to a DRM-KMS driver that an output is disconnected and
> >> can't be used. Ideally, DRI clients such as X should still see the
> >> output as being connected, so user applications don't need to keep track
> >> of the change.  
> > I think everything will be much easier if you provide a way for
> > userspace to control the muxing using the KMS API, and not lie to the
> > userspace about what's going on.
> >
> > You're not actually saying what component you think should control the
> > muxing.
> >
> > Why should the drivers keep telling the userspace the output is
> > connected when it's not? Obviously the userspace should also switch to
> > using a different output on a different GPU, right? Or are you planning
> > some proprietary behind the scenes hack for discrete?  
> 
> 
> The desire to lie to userspace is driven mainly by trying to avoid 
> interactions from desktop environments / window managers reacting to the 
> display going away. Many desktops will do things like try to migrate 
> windows in response to a change in the current display configuration, 
> and updating all of them to avoid doing so when a display appears to 
> disappear from one GPU and reappear on another GPU seems harder than 
> allowing userspace to believe that nothing has changed. I wouldn't mind 
> if e.g. X drivers were in on the lie, and the lie boundary shifts to 
> RandR, but it would be nice to avoid having to deal with the fallout of 
> desktop environments handling displays apparently vanishing and 
> re-appearing.

Hi,

I love the general idea of using the mux to optimize hardware usage,
but I really do not like the initial software design proposal.

I'm afraid that lying is going to lead to a disaster eventually, instead of
needing to fix a more obvious shortcoming in a simpler way today - or
rather, implementing a new feature taking advantage of the new
capabilities. Lying would lock the whole graphics stack to the single
design you thought of today, stopping any novel ways of using the
feature from appearing later.

Bypassing the desktop or the display server is practically always the
wrong design, whether it is this, color management, or whatever.

> The particular use case we're envisioning here is as follows:
> 
> * GPU A drives an X protocol screen which hosts a desktop session.
> Applications are rendered on GPU A by default. The mux is switched to 
> GPU A by default.
> * GPU B drives a GPU screen that can be used as a PRIME render offload 
> source. Applications rendered on GPU B can run in windows presented by 
> GPU A via PRIME render offloading.
> * If an application rendered on GPU B and presented on GPU A becomes 
> fullscreen, the mux can switch to GPU B and GPU B can present the 
> application directly for as long as the application remains in the 
> foreground and fullscreen.
> * The mux switches back to GPU A and the application presents via GPU A 
> and render offloading if it transitions to a window or another window 
> occludes it.

I do not see how you could ever pull that off without patching all
display servers to specifically support that use case (think of Wayland
architecture here). When the mux is switched, the userspace also needs
to switch posting DRM FBs from DRM KMS device A to DRM KMS device B.

What you describe is totally fine for a Wayland display server to do
automatically, and in the Wayland architecture there really is no other
component that could even attempt to do it. In fact, a Wayland
display server is the only component in the architecture that actually
needs to do anything about it to make it work. Your use case is a
perfect fit in the Wayland architecture, if the mux is controlled
exclusively by the display server and no-one lies.

My suggestion is to just trigger the exact same hotplug actions that
physically disconnecting a display cable from one card and plugging it
into another card does already. It's not like that is going to ever
happen beyond a display server's back, it is the display server itself
toggling the mux, so it knows to handle it right. What the display
server *does* need to know before-hand is exactly which connectors the
mux affects.

So yeah, I am implying that having any side-band to access to the mux
directly, bypassing the running display server, is a bad idea. Just
like we do not allow non-DRM-master programs to set modes via KMS while
a display server has DRM-master thinking it is controlling the displays.

If it is useful for Xorg to lie to the X11 RandR clients, then ok, I
don't care about that. That's up to Xorg. RandR itself is already a
kind of a side-band for setting modes and whatnot behind the X11
desktop environment's back. Wayland architecture does not have that
problem, and I don't want that problem to appear either.

This optimization in general, not the mux toggling part, would be
already extremely useful with eGPUs[1]. Assume you have a game rendering
on eGPU and a display connected to the eGPU showing the game. A naive
display server, who has a client rendering on an eGPU, will transfer
the client frames to the iGPU for composition and then again back to
the eGPU for display. If the client frame can be shown directly on the
eGPU display, there is no need to transfer the frame back and forth
over the bus. So I bet display servers will be gaining that
optimization logic sooner or later (if they care about the eGPU use
case), and it does not seem that making the same logic apply to mux
switching would be too much work on top.

Thanks,
pq

[1] https://gitlab.gnome.org/GNOME/mutter/-/issues/348

> I think DRI3 render offload works a bit differently, but hopefully the 
> high-level concept is somewhat applicable to that framework as well.
> 
> As for what should be controlling the muxing, I suppose that depends on 
> what you mean by controlling:
> 
> If you mean controlling the mux device itself, that should be a platform 
> driver that offers an API to execute the mux switch itself. The existing 
> vga-switcheroo framework would be a natural fit, but it would need some 
> substantial changes in order to support this sort of use case. I've 
> described some of the challenges we've observed so far in my response to 
> Daniel Vetter.
> 
> If you mean what should drive the policy of when automatic mux switches 
> occur, it would have to be something that is aware of what at least one 
> of the GPUs is displaying. It could be one of the GPU drivers, or a 
> client of the GPU drivers, e.g. X11 or a Wayland compositor.
> 
> For the proof of concept experiments we are currently conducting, both 
> of these roles are currently performed by components of the NVIDIA 
> proprietary GPU stack, but the functionality could be moved to another 
> component (e.g. vga-switcheroo, X11, server-side GLVND, ???) if the 
> necessary functionality becomes supported in the future.
> 
> 
> > BR,
> > Jani.
> >  
> >> 2) Request that a mode that was previously driven on a disconnected
> >> output be driven again upon reconnection.
> >>
> >> If APIs to do the above are already available, I wasn't able to find
> >> information about them. These could be handled as separate APIs, e.g.,
> >> one to set connected/disconnected state and another to restore an
> >> output, or as a single API, e.g., signal a disconnect or reconnect,
> >> leaving it up to the driver receiving the signal to set the appropriate
> >> internal state and restore the reconnected output. Another possibility
> >> would be an API to disable and enable individual outputs from outside of
> >> the DRM-KMS driver that owns them. I'm curious to hear the thoughts of
> >> the DRM subsystem maintainers and contributors on what the best approach
> >> to this would be.
> >>
> >> _______________________________________________
> >> dri-devel mailing list
> >> dri-devel at lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/dri-devel  
> > --
> > Jani Nikula, Intel Open Source Graphics Center  
> _______________________________________________
> dri-devel mailing list
> dri-devel at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20200401/4cc4d140/attachment-0001.sig>