[PATCH 2/2] drm/i915/pmu: Connect engine busyness stats from GuC to pmu
Umesh Nerlige Ramappa
umesh.nerlige.ramappa at intel.com
Sat Oct 22 00:21:00 UTC 2022
On Fri, Oct 21, 2022 at 09:42:53AM +0100, Tvrtko Ursulin wrote:
>
>On 27/10/2021 01:48, Umesh Nerlige Ramappa wrote:
>
>[snip]
>
>>+static void guc_timestamp_ping(struct work_struct *wrk)
>>+{
>>+ struct intel_guc *guc = container_of(wrk, typeof(*guc),
>>+ timestamp.work.work);
>>+ struct intel_uc *uc = container_of(guc, typeof(*uc), guc);
>>+ struct intel_gt *gt = guc_to_gt(guc);
>>+ intel_wakeref_t wakeref;
>>+ unsigned long flags;
>>+ int srcu, ret;
>>+
>>+ /*
>>+ * Synchronize with gt reset to make sure the worker does not
>>+ * corrupt the engine/guc stats.
>>+ */
>>+ ret = intel_gt_reset_trylock(gt, &srcu);
>>+ if (ret)
>>+ return;
>>+
>>+ spin_lock_irqsave(&guc->timestamp.lock, flags);
>>+
>>+ with_intel_runtime_pm(>->i915->runtime_pm, wakeref)
>>+ __update_guc_busyness_stats(guc);
>
>Spotted one splat today: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12268/bat-adlp-4/igt@i915_pm_rpm@basic-pci-d3-state.html
>
>Could be that reset lock needs to be inside the rpm get. Haven't really though about it much, could you please check?
>
><4> [300.214744]
><4> [300.214753] ======================================================
><4> [300.214755] WARNING: possible circular locking dependency detected
><4> [300.214758] 6.1.0-rc1-CI_DRM_12268-g86e8558e3283+ #1 Not tainted
><4> [300.214761] ------------------------------------------------------
><4> [300.214762] kworker/10:1H/265 is trying to acquire lock:
><4> [300.214765] ffffffff8275e560 (fs_reclaim){+.+.}-{0:0}, at: __kmem_cache_alloc_node+0x27/0x170
><4> [300.214780]
>but task is already holding lock:
><4> [300.214782] ffffc900013e7e78 ((work_completion)(&(&guc->timestamp.work)->work)){+.+.}-{0:0}, at: process_one_work+0x1eb/0x5b0
><4> [300.214793]
>which lock already depends on the new lock.
><4> [300.214794]
>the existing dependency chain (in reverse order) is:
><4> [300.214796]
>-> #2 ((work_completion)(&(&guc->timestamp.work)->work)){+.+.}-{0:0}:
><4> [300.214801] lock_acquire+0xd3/0x310
><4> [300.214806] __flush_work+0x77/0x4e0
><4> [300.214811] __cancel_work_timer+0x14e/0x1f0
><4> [300.214815] intel_guc_submission_reset_prepare+0x7a/0x420 [i915]
><4> [300.215119] intel_uc_reset_prepare+0x44/0x50 [i915]
><4> [300.215360] reset_prepare+0x21/0x80 [i915]
><4> [300.215561] intel_gt_reset+0x143/0x340 [i915]
><4> [300.215757] intel_gt_reset_global+0xeb/0x160 [i915]
><4> [300.215946] intel_gt_handle_error+0x2c2/0x410 [i915]
><4> [300.216137] intel_gt_debugfs_reset_store+0x59/0xc0 [i915]
><4> [300.216333] i915_wedged_set+0xc/0x20 [i915]
><4> [300.216513] simple_attr_write+0xda/0x100
><4> [300.216520] full_proxy_write+0x4e/0x80
><4> [300.216525] vfs_write+0xe3/0x4e0
><4> [300.216531] ksys_write+0x57/0xd0
><4> [300.216535] do_syscall_64+0x37/0x90
><4> [300.216542] entry_SYSCALL_64_after_hwframe+0x63/0xcd
><4> [300.216549]
>-> #1 (>->reset.mutex){+.+.}-{3:3}:
><4> [300.216556] lock_acquire+0xd3/0x310
><4> [300.216559] i915_gem_shrinker_taints_mutex+0x2d/0x50 [i915]
i915_gem_shrinker_taints_mutex seems to have something to do with
fs_reclaim and so does the stack #0. Any idea what this early init is
doing? Can this code also result in a gt_wedged case because that might
explain the stack #2 which is a reset.
><4> [300.216799] intel_gt_init_reset+0x61/0x80 [i915]
><4> [300.217018] intel_gt_common_init_early+0x10c/0x190 [i915]
><4> [300.217227] intel_root_gt_init_early+0x44/0x60 [i915]
><4> [300.217434] i915_driver_probe+0x9ab/0xf30 [i915]
><4> [300.217615] i915_pci_probe+0xa5/0x240 [i915]
><4> [300.217796] pci_device_probe+0x95/0x110
><4> [300.217803] really_probe+0xd6/0x350
><4> [300.217811] __driver_probe_device+0x73/0x170
><4> [300.217816] driver_probe_device+0x1a/0x90
><4> [300.217821] __driver_attach+0xbc/0x190
><4> [300.217826] bus_for_each_dev+0x72/0xc0
><4> [300.217831] bus_add_driver+0x1bb/0x210
><4> [300.217835] driver_register+0x66/0xc0
><4> [300.217841] 0xffffffffa093001f
><4> [300.217844] do_one_initcall+0x53/0x2f0
><4> [300.217849] do_init_module+0x45/0x1c0
><4> [300.217855] load_module+0x1d5e/0x1e90
><4> [300.217859] __do_sys_finit_module+0xaf/0x120
><4> [300.217864] do_syscall_64+0x37/0x90
><4> [300.217869] entry_SYSCALL_64_after_hwframe+0x63/0xcd
><4> [300.217875]
>-> #0 (fs_reclaim){+.+.}-{0:0}:
><4> [300.217880] validate_chain+0xb3d/0x2000
><4> [300.217884] __lock_acquire+0x5a4/0xb70
><4> [300.217888] lock_acquire+0xd3/0x310
><4> [300.217891] fs_reclaim_acquire+0xa1/0xd0
fs_reclaim ^
><4> [300.217896] __kmem_cache_alloc_node+0x27/0x170
><4> [300.217899] __kmalloc+0x43/0x1a0
><4> [300.217903] acpi_ns_internalize_name+0x44/0x9f
><4> [300.217909] acpi_ns_get_node_unlocked+0x6b/0xd7
><4> [300.217914] acpi_ns_get_node+0x3b/0x54
><4> [300.217918] acpi_get_handle+0x89/0xb7
><4> [300.217922] acpi_has_method+0x1c/0x40
><4> [300.217928] acpi_pci_set_power_state+0x42/0xf0
><4> [300.217935] pci_power_up+0x20/0x1a0
><4> [300.217940] pci_pm_default_resume_early+0x9/0x30
><4> [300.217945] pci_pm_runtime_resume+0x29/0x90
><4> [300.217948] __rpm_callback+0x3d/0x110
><4> [300.217954] rpm_callback+0x58/0x60
><4> [300.217959] rpm_resume+0x548/0x760
><4> [300.217963] __pm_runtime_resume+0x42/0x80
><4> [300.217968] __intel_runtime_pm_get+0x19/0x80 [i915]
><4> [300.218170] guc_timestamp_ping+0x63/0xc0 [i915]
><4> [300.218467] process_one_work+0x272/0x5b0
><4> [300.218472] worker_thread+0x37/0x370
><4> [300.218477] kthread+0xed/0x120
><4> [300.218481] ret_from_fork+0x1f/0x30
><4> [300.218485]
If the suspend has completed, not sure why guc_timestamp_ping is getting
called and resulting in pci_power_up in this stack. The park should have
synchronously canceled the ping worker. Strange.
>other info that might help us debug this:
><4> [300.218487] Chain exists of:
> fs_reclaim --> >->reset.mutex --> (work_completion)(&(&guc->timestamp.work)->work)
><4> [300.218495] Possible unsafe locking scenario:
><4> [300.218497] CPU0 CPU1
><4> [300.218499] ---- ----
><4> [300.218501] lock((work_completion)(&(&guc->timestamp.work)->work));
><4> [300.218505] lock(>->reset.mutex);
><4> [300.218509] lock((work_completion)(&(&guc->timestamp.work)->work));
><4> [300.218512] lock(fs_reclaim);
><4> [300.218515]
> *** DEADLOCK ***
Still looking into it, could use some help with the above questions.
Thanks,
Umesh
>
>Regards,
>
>Tvrtko
More information about the dri-devel
mailing list