[PATCH] drm/msm/dp: move link_ready out of HPD event thread

Wed Mar 13 17:24:08 UTC 2024

On 3/13/2024 1:18 AM, Johan Hovold wrote:
> On Tue, Mar 12, 2024 at 10:39:46AM -0700, Abhinav Kumar wrote:
>> On 3/12/2024 9:59 AM, Johan Hovold wrote:
> 
>>>> Heh. This is getting ridiculous. I just tried running with this patch
>>>> and it again breaks hotplug detect in a VT console and in X (where I
>>>> could enable a reconnected external display by running xrandr twice
>>>> before).
>>>>
>>>> So, please, do not apply this one.
>>>
>>> To make things worse, I indeed also hit the reset when disconnecting
>>> after such a failed hotplug.
> 
>> Ack, I will hold off till I analyze your issues more which you have
>> listed in separate replies. Especially about the spurious connect, I
>> believe you are trying to mention that, by adding logs, you are able to
>> delay the processing of a connect event to *make* it like a spurious
>> one? In case, I got this part wrong, can you pls explain the spurious
>> connect scenario again?
> 
> No, I only mentioned the debug printks in passing as instrumentation
> like that may affect race conditions (but I'm also hitting the resets
> also with no printks in place).
> 
> The spurious connect event comes directly from the pmic firmware, and
> even if we may optimise things by implementing some kind of debounce,
> the hotplug implementation needs to be robust enough to not kill the
> machine if such an event gets through.
> 
> Basically what I see is that during physical disconnect there can be
> multiple hpd notify events (e.g. connect, disconnect, connect):
> 
> [  146.910195] usb 5-1: USB disconnect, device number 4
> [  146.931026] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 2
> [  146.934785] msm-dp-display ae98000.displayport-controller: dp_hpd_unplug_handle
> [  146.938114] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 1
> [  146.940245] [CONNECTOR:35:DP-2] status updated from disconnected to connected
> [  146.955193] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 0, status = 2
> 
> And it is the spurious connect event while the link is being tore down
> that triggers the hotplug processing that leads to the reset.
> 
> Similarly, I've seen spurious disconnect events while the plug in being
> inserted.
> 

This is quite weird and also explains why most of the issues were seen 
only with sc8280xp. pmic spurious events are busting the hpd logic.

Agreed, that DP driver should be robust enough to handle this but this 
will also bust usermode to send down unnecessary frames. Someone should 
address why these spurious events are coming.

>> A short response on why this change was made is that commit can be
>> issued by userspace or the fbdev client. So userspace involvement only
>> makes commit happen from a different path. It would be incorrect to
>> assume the issues from the earlier bug and the current one are different
>> only because there was userspace involvement in that one and not this.
>>
>> Because in the end, it manifests itself in the same way that
>> atomic_enable() did not go through after an atomic_disable() and the
>> next atomic_disable() crashes.
> 
> Right, but your proposed fix would not actually fix anything and judging
> from the sparse commit message and diff itself it is clearly only meant
> to mitigate the case where user space is involved, which is *not* the
> case here.
> 

No, I think there is some disconnect in the way you are reading that 
patch perhaps due to some missing details OR I am missing your point.

Like I said, drm_atomic_commit() can be issued by userspace or the fbdev 
client in the driver. Thats the only userspace involvement.

Now, why the patch was made or was expected to work.

There can be a race condition between the time the DP driver gets the 
hpd disconnect event and when the hpd thread processes that event 
allowing the commit to sneak in. This is something which has always been 
there even without pm_runtime series and remains even today.

In this race condition, the setting of "link_ready" to false can be a 
bit delayed if we go through the HPD event processing increasing the 
race condition window.

If link_ready is false, atomic_check() fails, thereby failing any 
commits and hence not allowing the atomic_disable() / atomic_enable() 
cycle and hence avoiding this reset.

The patch is moving the setting of link_ready to false earlier by not 
putting it through the HPD event thread and hence trying to reduce the 
window of the issue.

> Johan