drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1

Fri Mar 8 12:43:07 UTC 2024

Hi Abhinav, Rob, Dmitry and Kuogee,

On Tue, Feb 27, 2024 at 02:33:48PM +0100, Johan Hovold wrote:

> Since 6.8-rc1 I have seen (and received reports) of hard resets of the
> Lenovo ThinkPad X13s after connecting and disconnecting an external
> display.
> 
> I have triggered this on a simple disconnect while in a VT console, but
> also when stopping Xorg after having repeatedly connected and
> disconnected an external display and tried to enable it using xrandr.
> 
> In the former case, the last (custom debug) messages printed over an SSH
> session were once:
> 
>     [  948.416358] usb 5-1: USB disconnect, device number 3
>     [  948.443496] msm_dpu ae01000.display-controller: msm_fbdev_client_hotplug
>     [  948.443723] msm-dp-display ae98000.displayport-controller: dp_power_clk_enable - type = 1, enable = 0
>     [  948.443872] msm-dp-display ae98000.displayport-controller: dp_ctrl_phy_exit
>     [  948.445117] msm-dp-display ae98000.displayport-controller: dp_ctrl_phy_exit - done
>     
> and then the hypervisor resets the machine.

Has there been any progress on tracking down the reset on disconnect
issue? I was expecting you to share your findings here so that we can
determine if the rest of the runtime PM series needs to be reverted or
not before 6.8 is released (possibly on Sunday).

I really did not want to spend more on this driver than I already have
this cycle, but the lack of (visible) progress has again forced me to do
so.

It's quite likely that the resets are indeed a regression caused by the
runtime PM series as the bus clocks were not disabled on disconnect
before that one was merged in 6.8-rc1.

In a VT console, the device is now runtime suspended immediately on
disconnect, while in X, it currently remains active until X is killed,
which is consistent with what I reported above.

We now also have Bjorn's call trace from a different Qualcomm platform
triggering an unclocked access on disconnect, something which would
trigger a reset by the hypervisor on sc8280xp platforms like the X13s:

[ 2030.379417] SError Interrupt on CPU0, code 0x00000000be000000 -- SError
[ 2030.379425] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219
[ 2030.379430] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT)
[ 2030.379435] Workqueue: events output_poll_execute [drm_kms_helper]
[ 2030.379495] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 2030.379501] pc : el1_interrupt+0x28/0xc0
[ 2030.379514] lr : el1h_64_irq_handler+0x18/0x24
[ 2030.379520] sp : ffff80008129b700
[ 2030.379523] x29: ffff80008129b700 x28: ffff7a0a031aa200 x27: ffff7a0af768c3c0
[ 2030.379530] x26: 0000000000000000 x25: ffff7a0a024bb568 x24: ffff7a0a031aa200
[ 2030.379537] x23: 0000000060400005 x22: ffffd2a4a10c871c x21: ffff80008129b8a0
[ 2030.379544] x20: ffffd2a4a00002f0 x19: ffff80008129b750 x18: 0000000000000000
[ 2030.379549] x17: 0000000000000000 x16: ffffd2a4a10c21d4 x15: 0000000000000000
[ 2030.379555] x14: 0000000000000000 x13: 000000000001acf6 x12: 0000000000000000
[ 2030.379560] x11: 0000000000000001 x10: ffff7a0af623f680 x9 : 0000000100000001
[ 2030.379565] x8 : 00000000000000c0 x7 : 0000000000000000 x6 : 000000000000003f
[ 2030.379570] x5 : ffff7a0a003c6a70 x4 : 000000000000001f x3 : ffffd2a48d6722dc
[ 2030.379576] x2 : 0000000000000002 x1 : ffffd2a4a00002f0 x0 : ffff80008129b750
[ 2030.379583] Kernel panic - not syncing: Asynchronous SError Interrupt
[ 2030.379586] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219
[ 2030.379590] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT)
[ 2030.379592] Workqueue: events output_poll_execute [drm_kms_helper]
[ 2030.379642] Call trace:
[ 2030.379644]  dump_backtrace+0xec/0x108
[ 2030.379654]  show_stack+0x18/0x24
[ 2030.379659]  dump_stack_lvl+0x40/0x84
[ 2030.379664]  dump_stack+0x18/0x24
[ 2030.379668]  panic+0x130/0x34c
[ 2030.379673]  nmi_panic+0x44/0x90
[ 2030.379679]  arm64_serror_panic+0x68/0x74
[ 2030.379683]  do_serror+0xc4/0xcc
[ 2030.379686]  el1h_64_error_handler+0x34/0x4c
[ 2030.379692]  el1h_64_error+0x64/0x68
[ 2030.379696]  el1_interrupt+0x28/0xc0
[ 2030.379700]  el1h_64_irq_handler+0x18/0x24
[ 2030.379706]  el1h_64_irq+0x64/0x68
[ 2030.379710]  _raw_spin_unlock_irq+0x20/0x48
[ 2030.379718]  wait_for_common+0xb4/0x16c
[ 2030.379722]  wait_for_completion_timeout+0x14/0x20
[ 2030.379727]  dp_ctrl_push_idle+0x34/0x8c [msm]
[ 2030.379844]  dp_bridge_atomic_disable+0x18/0x24 [msm]
[ 2030.379959]  drm_atomic_bridge_chain_disable+0x6c/0xb4 [drm]
[ 2030.380150]  drm_atomic_helper_commit_modeset_disables+0x174/0x57c [drm_kms_helper]
[ 2030.380200]  msm_atomic_commit_tail+0x1b4/0x474 [msm]
[ 2030.380316]  commit_tail+0xa4/0x158 [drm_kms_helper]
[ 2030.380369]  drm_atomic_helper_commit+0x24c/0x26c [drm_kms_helper]
[ 2030.380418]  drm_atomic_commit+0xa8/0xd4 [drm]
[ 2030.380529]  drm_client_modeset_commit_atomic+0x16c/0x244 [drm]
[ 2030.380641]  drm_client_modeset_commit_locked+0x50/0x168 [drm]
[ 2030.380753]  drm_client_modeset_commit+0x2c/0x54 [drm]
[ 2030.380865]  __drm_fb_helper_restore_fbdev_mode_unlocked+0x60/0xa4 [drm_kms_helper]
[ 2030.380915]  drm_fb_helper_hotplug_event+0xe0/0xf4 [drm_kms_helper]
[ 2030.380965]  msm_fbdev_client_hotplug+0x28/0xc8 [msm]
[ 2030.381081]  drm_client_dev_hotplug+0x94/0x118 [drm]
[ 2030.381192]  output_poll_execute+0x214/0x26c [drm_kms_helper]
[ 2030.381241]  process_scheduled_works+0x19c/0x2cc
[ 2030.381249]  worker_thread+0x290/0x3cc
[ 2030.381255]  kthread+0xfc/0x184
[ 2030.381260]  ret_from_fork+0x10/0x20

The above could happen if the convoluted hotplug state machine breaks
down so that the device is runtime suspended before
dp_bridge_atomic_disable() is called.

For some reason, possibly due to unrelated changes in timing, possibly
after the hotplug revert, I am no longer able to reproduce the reset
with 6.8-rc7 on the X13s.

I am however able to get the hotplug state machine to leak
runtime PM reference counts which prevents it from ever being suspended
(e.g. by disconnecting slowly so that we get multiple connect and
disconnect events). This can manifest itself as a hotplug event which is
processed after disconnecting the display:

	[drm:dp_panel_read_sink_caps [msm]] *ERROR* read dpcd failed -110

I would not be surprised at all if there's a sequence of events that
leads to an unbalanced put instead (and the stack trace and observations
above make this appear likely). 

Were you able to determine what events lead to the premature disabling
of the bus clocks on the RB3?

Have you got any reason to believe that the revert of the hotplug
notification patch may in anyway prevent that? Or is it just papering
over the issue?

> Hotplug in Xorg seems to work worse than it did with 6.7, which also had
> some issues. Connecting a display once seems to work fine, but trying to
> re-enable a reconnected display using xrandr sometimes does not work at
> all, while with 6.7 it usually worked on the second xrandr execution.
> 
> xrandr reports the reconnected display as disconnected:

> Running 'xrandr --output DP-2 --auto' 2-3 times makes xrandr report the
> display as connected, but the display is still blank (unlike with 6.7).

As I mentioned elsewhere, the revert of e467e0bde881 ("drm/msm/dp: use
drm_bridge_hpd_notify() to report HPD status changes") in rc7 does seem
to help with the hotplug detect issues that I could reproduce (in VT
console, X and wayland).

The question is whether we should revert the whole runtime PM series so
that the bus clock is left on, which should prevent any resets on
disconnect.

Without any analysis from you or reason to be believe the issue to have
been resolved, I'm inclined to just go ahead and revert it. It clearly
had not been tested enough before being merged and I'm quite frustrated
with how this has been handled.

Johan