[Intel-gfx] [PATCH] drm/i915: Fix NPD in PMU during driver teardown

Umesh Nerlige Ramappa umesh.nerlige.ramappa at intel.com
Tue Jul 12 21:03:58 UTC 2022


On Mon, Jul 04, 2022 at 09:31:55AM +0100, Tvrtko Ursulin wrote:
>
>On 01/07/2022 15:54, Summers, Stuart wrote:
>>On Fri, 2022-07-01 at 09:37 +0100, Tvrtko Ursulin wrote:
>>>On 01/07/2022 01:11, Umesh Nerlige Ramappa wrote:
>>>>On Thu, Jun 30, 2022 at 09:00:28PM +0000, Stuart Summers wrote:
>>>>>In the driver teardown, we are unregistering the gt prior
>>>>>to unregistering the PMU. This means there is a small window
>>>>>of time in which the application can request metrics from the
>>>>>PMU, some of which are calling into the uapi engines list,
>>>>>while the engines are not available. In this case we can
>>>>>see null pointer dereferences.
>>>>>
>>>>>Fix this ordering in both the driver load and unload sequences.
>>>>>
>>>>>Additionally add a check for engine presence to prevent this
>>>>>NPD in the event this ordering is accidentally reversed. Print
>>>>>a debug message indicating when they aren't available.
>>>>>
>>>>>v1: Actually address the driver load/unload ordering issue
>>>>>
>>>>>Signed-off-by: Stuart Summers <stuart.summers at intel.com>
>>>>>---
>>>>
>>>>I thought this is likely happening because intel_gpu_top is running
>>>>in
>>>>the background when i915 is unloaded. I tried a quick repro, I
>>>>don't see
>>>>the unload succeed ("fatal module in use", not sure if this was a
>>>>partial unload), but when I try to kill intel_gpu_top, I get an
>>>>NPD.
>>>>This is in the event disable path - i915_pmu_event_stop ->
>>>>i915_pmu_disable.
>>>
>>>So i915 failed to unload (as expected - with perf events open we
>>>elevate
>>>the module ref count via i915_pmu_event_init -> drm_dev_get), then
>>>you
>>>quit intel_gpu_top and get NPD? On the engine lookup? With the
>>>re-ordered init/fini sequence as from this patch?
>>>
>>>With elevated module count there shouldn't be any unloading happening
>>>so
>>>I am intrigued.
>>>
>>>>It's likely that you are seeing a different path (unload) leading
>>>>to the
>>>>same issue.
>>>>
>>>>I think in i915_pmu_disable/disable should be aware of event-
>>>>>hw.state
>>>>and or pmu->closed states before accessing the event. Maybe like,
>>>>
>>>>if (event->hw.state != PERF_HES_STOPPED && is_engine_event(event))
>>>>{
>>>>
>>>>@Tvrtko, wondering if this case is tested by igt at perf
>>>>_pmu at module-unload.
>>>
>>>A bit yes. From what Stuart wrote it seems the test would need to be
>>>extended to cover the case where PMU is getting opened while module
>>>unload is in progress.
>>>
>>>But the NPD you saw is for the moment confusing so I don't know what
>>>is
>>>happening.
>>>
>>>>I am not clear if we should use event->hw.state or pmu->closed here
>>>>and
>>>>if/how they are related. IMO, for this issue, the engine check is
>>>>good
>>>>enough too, so we don't really need the pmu state checks.
>>>>Thoughts?
>>>
>>>Engine check at the moment feels like papering.
>>>
>>>Indeed as you say I think the pmu->closed might be the solution.
>>>Perhaps
>>>the race is as mentioned above. PMU open happening in parallel to
>>>unload..
>>>
>>>If the sequence of events userspace triggers is:
>>>
>>>    i915_pmu_event_init
>>>    i915_pmu_event_start
>>>    i915_pmu_enable
>>>    i915_pmu_event_read
>>>
>>>I guess pmu->closed can get set halfway in i915_pmu_event_init. What
>>>would be the effect of that.. We'd try to get a module reference
>>>while
>>>in the process of unloading. Which is probably very bad.. So possibly
>>>a
>>>final check on pmu->close is needed there. Ho hum.. can it be made
>>>safe
>>>is the question.
>>>
>>>It doesn't explain the NPD on Ctrl-C though.. intel_gpu_top keeps
>>>the
>>>evens open all the time. So I think more info needed, for me at
>>>least.
>>
>>So one thing here is this doesn't have to do with module unload, but
>>module unbind specifically (while perf is open). I don't know if the
>>NPD from Umesh is the same as what we're seeing here. I'd really like
>>to separate these unless you know for sure that's related. Also it
>>would be interesting to know if this patch fixes your issue as well.
>>
>>I still think the re-ordering in i915_driver.c should be enough and we
>>shouldn't need to check pmu->closed. The unregister should be enough to
>>ensure the perf tools are notified that new events aren't allowed, and
>>at that time the engine structures are still intact. And even if for
>>some reason the perf code still calls in to our function pointers, we
>>have these engine checks as a failsafe.
>>
>>I'm by the way uploading one more version here with a drm_WARN_ONCE
>>instead of the debug print.
>
>Problem is I am not a fan of papering so lets get to the bottom of the 
>issue first. (In the meantime simple patch to re-order driver fini is 
>okay since that seems obvious enough, I tnink.)
>
>We need to see call traces from any oopses and try to extend perf_pmu 
>to catch them. And we need to understand the problem, if it is a real 
>problem, which I laid out last week about race between module unload 
>and elevating the module use count from our perf event init.
>
>Without understanding the details of possible failure mode flows we 
>don't know how much the papering with engine checks solves and how 
>much it leaves broken.
>
>If you guys are too busy to tackle that I'll put it onto myself, but 
>help would certainly be appreciated.

Looks like Stuart/Chris are pointing towards the unbind as an issue.

I ran this sequence and only the modprobe showed an error (FATAL: ...  
still in use). What happens with the unbind. Should pmu also handle the 
unbind somehow?

- run intel_gpu_top
- unbind
- modprobe -r i915
- kill intel_gpu_top.

Thanks,
Umesh

>
>Regards,
>
>Tvrtko


More information about the Intel-gfx mailing list