[Intel-gfx] [PATCH] drm/i915: Fix NPD in PMU during driver teardown
Umesh Nerlige Ramappa
umesh.nerlige.ramappa at intel.com
Fri Jul 1 18:09:47 UTC 2022
On Fri, Jul 01, 2022 at 09:37:20AM +0100, Tvrtko Ursulin wrote:
>
>On 01/07/2022 01:11, Umesh Nerlige Ramappa wrote:
>>On Thu, Jun 30, 2022 at 09:00:28PM +0000, Stuart Summers wrote:
>>>In the driver teardown, we are unregistering the gt prior
>>>to unregistering the PMU. This means there is a small window
>>>of time in which the application can request metrics from the
>>>PMU, some of which are calling into the uapi engines list,
>>>while the engines are not available. In this case we can
>>>see null pointer dereferences.
>>>
>>>Fix this ordering in both the driver load and unload sequences.
>>>
>>>Additionally add a check for engine presence to prevent this
>>>NPD in the event this ordering is accidentally reversed. Print
>>>a debug message indicating when they aren't available.
>>>
>>>v1: Actually address the driver load/unload ordering issue
>>>
>>>Signed-off-by: Stuart Summers <stuart.summers at intel.com>
>>>---
>>
>>I thought this is likely happening because intel_gpu_top is running
>>in the background when i915 is unloaded. I tried a quick repro, I
>>don't see the unload succeed ("fatal module in use", not sure if
>>this was a partial unload), but when I try to kill intel_gpu_top, I
>>get an NPD. This is in the event disable path - i915_pmu_event_stop
>>-> i915_pmu_disable.
>
>So i915 failed to unload (as expected - with perf events open we
>elevate the module ref count via i915_pmu_event_init -> drm_dev_get),
>then you quit intel_gpu_top and get NPD?
I was just trying to point out an instance of the failure that I saw
when running this execution sequence. This is without any of Stuart's
changes.
> On the engine lookup?
Don't know if it is specifically engine lookup, but it's in the path of
i915_pmu_event_stop which executes on killing intel_gpu_top.
> With the re-ordered init/fini sequence as from this patch?
No, without any changes from this thread.
>
>With elevated module count there shouldn't be any unloading happening
>so I am intrigued.
>
>>It's likely that you are seeing a different path (unload) leading to
>>the same issue.
>>
>>I think in i915_pmu_disable/disable should be aware of
>>event->hw.state and or pmu->closed states before accessing the
>>event. Maybe like,
>>
>>if (event->hw.state != PERF_HES_STOPPED && is_engine_event(event)) {
>>
>>@Tvrtko, wondering if this case is tested by
>>igt at perf_pmu@module-unload.
>
>A bit yes. From what Stuart wrote it seems the test would need to be
>extended to cover the case where PMU is getting opened while module
>unload is in progress.
>
>But the NPD you saw is for the moment confusing so I don't know what
>is happening.
I guess it's a separate issue. I can check if Stuart's patches resolve
it and if not, will create a bug.
Regards,
Umesh
>
>>I am not clear if we should use event->hw.state or pmu->closed here
>>and if/how they are related. IMO, for this issue, the engine check
>>is good enough too, so we don't really need the pmu state checks.
>>Thoughts?
>
>Engine check at the moment feels like papering.
>
>Indeed as you say I think the pmu->closed might be the solution.
>Perhaps the race is as mentioned above. PMU open happening in parallel
>to unload..
>
>If the sequence of events userspace triggers is:
>
> i915_pmu_event_init
> i915_pmu_event_start
> i915_pmu_enable
> i915_pmu_event_read
>
>I guess pmu->closed can get set halfway in i915_pmu_event_init. What
>would be the effect of that.. We'd try to get a module reference while
>in the process of unloading. Which is probably very bad.. So possibly
>a final check on pmu->close is needed there. Ho hum.. can it be made
>safe is the question.
>
>It doesn't explain the NPD on Ctrl-C though.. intel_gpu_top keeps the
>evens open all the time. So I think more info needed, for me at least.
>
>Regards,
>
>Tvrtko
>
>>
>>Regards,
>>Umesh
>>
>>>drivers/gpu/drm/i915/i915_driver.c | 11 ++---
>>>drivers/gpu/drm/i915/i915_pmu.c | 72 +++++++++++++++++-------------
>>>2 files changed, 48 insertions(+), 35 deletions(-)
>>>
>>>diff --git a/drivers/gpu/drm/i915/i915_driver.c
>>>b/drivers/gpu/drm/i915/i915_driver.c
>>>index deb8a8b76965..ee4dcb85d206 100644
>>>--- a/drivers/gpu/drm/i915/i915_driver.c
>>>+++ b/drivers/gpu/drm/i915/i915_driver.c
>>>@@ -717,7 +717,6 @@ static void i915_driver_register(struct
>>>drm_i915_private *dev_priv)
>>> struct drm_device *dev = &dev_priv->drm;
>>>
>>> i915_gem_driver_register(dev_priv);
>>>- i915_pmu_register(dev_priv);
>>>
>>> intel_vgpu_register(dev_priv);
>>>
>>>@@ -731,11 +730,12 @@ static void i915_driver_register(struct
>>>drm_i915_private *dev_priv)
>>> i915_debugfs_register(dev_priv);
>>> i915_setup_sysfs(dev_priv);
>>>
>>>+ intel_gt_driver_register(to_gt(dev_priv));
>>>+
>>> /* Depends on sysfs having been initialized */
>>>+ i915_pmu_register(dev_priv);
>>> i915_perf_register(dev_priv);
>>>
>>>- intel_gt_driver_register(to_gt(dev_priv));
>>>-
>>> intel_display_driver_register(dev_priv);
>>>
>>> intel_power_domains_enable(dev_priv);
>>>@@ -762,11 +762,12 @@ static void i915_driver_unregister(struct
>>>drm_i915_private *dev_priv)
>>>
>>> intel_display_driver_unregister(dev_priv);
>>>
>>>- intel_gt_driver_unregister(to_gt(dev_priv));
>>>-
>>> i915_perf_unregister(dev_priv);
>>>+ /* GT should be available until PMU is gone */
>>> i915_pmu_unregister(dev_priv);
>>>
>>>+ intel_gt_driver_unregister(to_gt(dev_priv));
>>>+
>>> i915_teardown_sysfs(dev_priv);
>>> drm_dev_unplug(&dev_priv->drm);
>>>
>>>diff --git a/drivers/gpu/drm/i915/i915_pmu.c
>>>b/drivers/gpu/drm/i915/i915_pmu.c
>>>index 958b37123bf1..796a1d8e36f2 100644
>>>--- a/drivers/gpu/drm/i915/i915_pmu.c
>>>+++ b/drivers/gpu/drm/i915/i915_pmu.c
>>>@@ -670,21 +670,28 @@ static void i915_pmu_enable(struct
>>>perf_event *event)
>>> if (is_engine_event(event)) {
>>> u8 sample = engine_event_sample(event);
>>> struct intel_engine_cs *engine;
>>>-
>>>- engine = intel_engine_lookup_user(i915,
>>>- engine_event_class(event),
>>>- engine_event_instance(event));
>>>-
>>>- BUILD_BUG_ON(ARRAY_SIZE(engine->pmu.enable_count) !=
>>>- I915_ENGINE_SAMPLE_COUNT);
>>>- BUILD_BUG_ON(ARRAY_SIZE(engine->pmu.sample) !=
>>>- I915_ENGINE_SAMPLE_COUNT);
>>>- GEM_BUG_ON(sample >= ARRAY_SIZE(engine->pmu.enable_count));
>>>- GEM_BUG_ON(sample >= ARRAY_SIZE(engine->pmu.sample));
>>>- GEM_BUG_ON(engine->pmu.enable_count[sample] == ~0);
>>>-
>>>- engine->pmu.enable |= BIT(sample);
>>>- engine->pmu.enable_count[sample]++;
>>>+ u8 class = engine_event_class(event);
>>>+ u8 instance = engine_event_instance(event);
>>>+
>>>+ engine = intel_engine_lookup_user(i915, class, instance);
>>>+ if (engine) {
>>>+ BUILD_BUG_ON(ARRAY_SIZE(engine->pmu.enable_count) !=
>>>+ I915_ENGINE_SAMPLE_COUNT);
>>>+ BUILD_BUG_ON(ARRAY_SIZE(engine->pmu.sample) !=
>>>+ I915_ENGINE_SAMPLE_COUNT);
>>>+ GEM_BUG_ON(sample >=
>>>+ ARRAY_SIZE(engine->pmu.enable_count));
>>>+ GEM_BUG_ON(sample >=
>>>+ ARRAY_SIZE(engine->pmu.sample));
>>>+ GEM_BUG_ON(engine->pmu.enable_count[sample] == ~0);
>>>+
>>>+ engine->pmu.enable |= BIT(sample);
>>>+ engine->pmu.enable_count[sample]++;
>>>+ } else {
>>>+ drm_dbg(&i915->drm,
>>>+ "Invalid engine event: { class:%d, inst:%d }\n",
>>>+ class, instance);
>>>+ }
>>> }
>>>
>>> spin_unlock_irqrestore(&pmu->lock, flags);
>>>@@ -714,21 +721,26 @@ static void i915_pmu_disable(struct
>>>perf_event *event)
>>> if (is_engine_event(event)) {
>>> u8 sample = engine_event_sample(event);
>>> struct intel_engine_cs *engine;
>>>-
>>>- engine = intel_engine_lookup_user(i915,
>>>- engine_event_class(event),
>>>- engine_event_instance(event));
>>>-
>>>- GEM_BUG_ON(sample >= ARRAY_SIZE(engine->pmu.enable_count));
>>>- GEM_BUG_ON(sample >= ARRAY_SIZE(engine->pmu.sample));
>>>- GEM_BUG_ON(engine->pmu.enable_count[sample] == 0);
>>>-
>>>- /*
>>>- * Decrement the reference count and clear the enabled
>>>- * bitmask when the last listener on an event goes away.
>>>- */
>>>- if (--engine->pmu.enable_count[sample] == 0)
>>>- engine->pmu.enable &= ~BIT(sample);
>>>+ u8 class = engine_event_class(event);
>>>+ u8 instance = engine_event_instance(event);
>>>+
>>>+ engine = intel_engine_lookup_user(i915, class, instance);
>>>+ if (engine) {
>>>+ GEM_BUG_ON(sample >= ARRAY_SIZE(engine->pmu.enable_count));
>>>+ GEM_BUG_ON(sample >= ARRAY_SIZE(engine->pmu.sample));
>>>+ GEM_BUG_ON(engine->pmu.enable_count[sample] == 0);
>>>+
>>>+ /*
>>>+ * Decrement the reference count and clear the enabled
>>>+ * bitmask when the last listener on an event goes away.
>>>+ */
>>>+ if (--engine->pmu.enable_count[sample] == 0)
>>>+ engine->pmu.enable &= ~BIT(sample);
>>>+ } else {
>>>+ drm_dbg(&i915->drm,
>>>+ "Invalid engine event: { class:%d, inst:%d }\n",
>>>+ class, instance);
>>>+ }
>>> }
>>>
>>> GEM_BUG_ON(bit >= ARRAY_SIZE(pmu->enable_count));
>>>--
>>>2.25.1
>>>
More information about the Intel-gfx
mailing list