[PATCH] drm/probe-helper: fix output polling not resuming after HPD IRQ storm
nicusor.huhulea at siemens.com
nicusor.huhulea at siemens.com
Wed Aug 6 14:27:13 UTC 2025
>>-----Original Message-----
>>From: Dmitry Baryshkov <dmitry.baryshkov at oss.qualcomm.com>
>>Sent: Wednesday, August 6, 2025 3:34 PM
>>To: Huhulea, Nicusor Liviu (FT FDS CES LX PBU 1)
>><nicusor.huhulea at siemens.com>
>>Cc: imre.deak at intel.com; stable at vger.kernel.org; dri-
>>devel at lists.freedesktop.org; intel-gfx at lists.freedesktop.org; cip-dev at lists.cip-
>>project.org; shradhagupta at linux.microsoft.com; jouni.hogander at intel.com;
>>neil.armstrong at linaro.org; jani.nikula at linux.intel.com;
>>maarten.lankhorst at linux.intel.com; mripard at kernel.org; tzimmermann at suse.de;
>>airlied at gmail.com; daniel at ffwll.ch; joonas.lahtinen at linux.intel.com;
>>rodrigo.vivi at intel.com; laurentiu.palcu at oss.nxp.com; Hombourger, Cedric (FT
>>FDS CES LX) <cedric.hombourger at siemens.com>; Bobade, Shrikant Krishnarao
>>(FT FDS CES LX PBU 1) <shrikant.bobade at siemens.com>
>>Subject: Re: [PATCH] drm/probe-helper: fix output polling not resuming after HPD
>>IRQ storm
>>
>>On Wed, Aug 06, 2025 at 12:02:02PM +0000, nicusor.huhulea at siemens.com
>>wrote:
>>>
>>>
>>> >>-----Original Message-----
>>> >>From: Imre Deak <imre.deak at intel.com>
>>> >>Sent: Tuesday, August 5, 2025 6:20 PM
>>> >>To: Dmitry Baryshkov <dmitry.baryshkov at oss.qualcomm.com>
>>> >>Cc: Huhulea, Nicusor Liviu (FT FDS CES LX PBU 1)
>>> >><nicusor.huhulea at siemens.com>; stable at vger.kernel.org; dri-
>>> >>devel at lists.freedesktop.org; intel-gfx at lists.freedesktop.org;
>>> >>cip-dev at lists.cip- project.org; shradhagupta at linux.microsoft.com;
>>> >>jouni.hogander at intel.com; neil.armstrong at linaro.org;
>>> >>jani.nikula at linux.intel.com; maarten.lankhorst at linux.intel.com;
>>> >>mripard at kernel.org; tzimmermann at suse.de; airlied at gmail.com;
>>> >>daniel at ffwll.ch; joonas.lahtinen at linux.intel.com;
>>> >>rodrigo.vivi at intel.com; laurentiu.palcu at oss.nxp.com; Hombourger,
>>> >>Cedric (FT FDS CES LX) <cedric.hombourger at siemens.com>; Bobade,
>>> >>Shrikant Krishnarao (FT FDS CES LX PBU 1)
>>> >><shrikant.bobade at siemens.com>
>>> >>Subject: Re: [PATCH] drm/probe-helper: fix output polling not
>>> >>resuming after HPD IRQ storm
>>> >>
>>> >>On Tue, Aug 05, 2025 at 01:46:51PM +0300, Dmitry Baryshkov wrote:
>>> >>> On Mon, Aug 04, 2025 at 11:13:59PM +0300, Nicusor Huhulea wrote:
>>> >>> > A regression in output polling was introduced by commit
>>> >>> > 4ad8d57d902fbc7c82507cfc1b031f3a07c3de6e
>>> >>> > ("drm: Check output polling initialized before disabling") in
>>> >>> > the 6.1.y stable
>>> >>tree.
>>> >>> > As a result, when the i915 driver detects an HPD IRQ storm and
>>> >>> > attempts to switch from IRQ-based hotplug detection to polling,
>>> >>> > output polling
>>> >>fails to resume.
>>> >>> >
>>> >>> > The root cause is the use of dev->mode_config.poll_running. Once
>>> >>> > poll_running is set (during the first connector detection) the
>>> >>> > calls to drm_kms_helper_poll_enable(), such as
>>> >>> > intel_hpd_irq_storm_switch_to_polling() fails to schedule
>>> >>> > output_poll_work as
>>> >>expected.
>>> >>> > Therefore, after an IRQ storm disables HPD IRQs, polling does
>>> >>> > not start,
>>> >>breaking hotplug detection.
>>> >>>
>>> >>> Why doesn't disable path use drm_kms_helper_poll_disable() ?
>>> >>
>>> >>In general i915 doesn't disable polling as a whole after an IRQ
>>> >>storm and a 2 minute delay (or runtime resuming), since on some
>>> >>other connectors the polling may be still required.
>>> >>
>>> >>Also, in the 6.1.y stable tree drm_kms_helper_poll_disable() is:
>>> >>
>>> >> if (drm_WARN_ON(dev, !dev->mode_config.poll_enabled))
>>> >> return;
>>> >>
>>> >>
>>> >> cancel_delayed_work_sync(&dev->mode_config.output_poll_work);
>>> >>
>>> >>so calling that wouldn't really fix the problem, which is clearly
>>> >>just an incorrect backport of the upstream commit 5abffb66d12bcac8 ("drm:
>>> >>Check output polling initialized before disabling") to the v6.1.y
>>> >>stable tree by commit 4ad8d57d902f ("drm: Check output polling
>>> >>initialized before disabling") in v6.1.y.
>>> >>
>>> >>The upstream commit did not add the check for
>>> >>dev->mode_config.poll_running in drm_kms_helper_poll_enable(), the
>>> >>condition was only part of the diff context in the commit. Hence
>>> >>adding the condition in the 4ad8d57d902f backport commit was incorrect.
>>> >>
>>> >>> > The fix is to remove the dev->mode_config.poll_running in the
>>> >>> > check condition, ensuring polling is always scheduled as requested.
>>> I'm not a full-time kernel developer, but I spent some time trying to really
>>understand both the rationale and the effects of commit "Check output polling
>>initialized before disabling."
>>> Here's how I see the issue:
>>> That commit was introduced mainly as a defensive measure, to protect
>>> drivers such as hyperv-drm that do not initialize connector polling.
>>> In those drivers, calling drm_kms_helper_poll_disable() without proper
>>> polling setup could trigger warnings or even stack traces, since there
>>> are no outputs to poll and the polling helpers don't apply. From what
>>> I understand, the poll_running variable was added to prevent cases
>>> where polling was accidentally disabled for drivers that were never
>>> set up for it. However, this fix ended up creating a new class of
>>> breakage, which I have observed and describe here.
>>
>>
>>It was added to implement the very simple logic: If something isn't initialized,
>>enabling or disabling it is an error. If something isn't enabled, disabling it is an
>>error. If something isn't disabled, enabling it is an error.
>>
>>>
>>> To me, the core logic should be simple: if polling is needed, then we should
>>allow it to proceed (regardless of whether it's been scheduled previously).
>>>
>>> Looking at the drm_kms_helper_poll_disable() if (drm_WARN_ON(dev,
>>> !dev->mode_config.poll_enabled))
>>> return;
>>>
>>> If the driver never enabled polling (that is, drm_kms_helper_poll_enable() was
>>never called), then the disable operation is effectively a no-op-totally safe for
>>drivers like hyperv-drm.
>>>
>>> And in the enable function drm_kms_helper_poll_enable(..):
>>> if (drm_WARN_ON_ONCE(dev, !dev->mode_config.poll_enabled) ||
>>> - !drm_kms_helper_poll || dev->mode_config.poll_running)
>>> + !drm_kms_helper_poll)
>>> return;
>>
>>Why?
Then we have two perspectives/views and I acknowledge them both:
A: My view is that "if pooling is needed, then just poll" and I consider to be the pragmatic way and robust in the face of the hardware. If the state needs to be active, ensure it's active regardless of previous state.
>>
>>> The main thing being guarded here is whether polling has actually been initialized:
>>> 1.For polling drivers like i915, removing poll_running from the enable
>>> path is both safe and necessary: it fixes the regression with HPD
>>> polling after IRQ storms
>>
>>I believe in paired calls. If you want to use drm_kms_helper_poll_enable(), it should
>>be previously drm_kms_helper_poll_disable()d. If you have manually disabled the
>>IRQ, it should also be manually enabled.
>>
>>Pairing the calls makes life much much easier.
B: Totally agree with you that strict pairing of enable/disable calls has many advantages.
This brings us to the key question: What should change - the API, or i915?
And based on what Imre described regarding i915's design, and as much as I reviewed and understood myself, i915 by design purposefully triggers polling dynamically in response to disabling
IRQs, reflecting the real hardware behavior. I believe that requiring strict paring of enable/disable calls would complicate i915's hotplug logic and as this bug demonstrates could lead to real regressions in production systems.
My option would be the pragmatic behavior. For LTS and stable kernels I favor fixing the immediately observed regression, even if that means relaxing the strictness of the DRM helper. And later LTS should have a fuller upstream refactor (such as with drm_kms_helper_poll_reschedule(), other improvements, etc) can be safely backported.
Nicu
>>
>>> 2.For non-polling drivers like hyperv-drm, the existing checks on
>>> poll_enabled in both enable and disable functions are sufficient.
>>> Removing poll_running doesn't affect these drivers-they don't go
>>> through the polling code paths, so no polling gets scheduled or
>>> canceled by mistake
>>>
>>> Therefore based on my understanding and testing removing poll_running guard
>>not only fixes a real bug but also does not introduce new regressions for drivers
>>that don't use output polling.
>>
>>--
>>With best wishes
>>Dmitry
More information about the Intel-gfx
mailing list