<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
Am 17.12.24 um 01:26 schrieb Matthew Brost:<br>
<blockquote type="cite" cite="mid:Z2DFKfqzhlyHOjpd@lstrano-desk.jf.intel.com">
<pre class="moz-quote-pre" wrap="">On Fri, Nov 22, 2024 at 02:36:59PM +0000, Tvrtko Ursulin wrote:
</pre>
<blockquote type="cite">[SNIP]<span style="white-space: pre-wrap">
</span>
<blockquote type="cite">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap=""></pre>
<blockquote type="cite">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Do we have system wide workqueues for that? It seems a bit
overkill that amdgpu has to allocate one on his own.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
I wondered the same but did not find any. Only ones I am aware
of are system_wq&co created in workqueue_init_early().
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Gentle ping on this. I don't have any better ideas that creating a
new wq.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
It took me a moment to realize, but I now think this warning message is
a false positive.
What happens is that the code calls cancel_delayed_work_sync().
If the work item never run because of lack of memory then it can just be
canceled.
If the work item is running then we will block for it to finish.
</pre>
</blockquote>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Apologies for the late reply. Alex responded to another thread and CC'd
me, which reminded me to reply here.
The execution of the non-reclaim worker could have led to a few scenarios:
- It might have triggered reclaim through its own memory allocation.</pre>
</blockquote>
<br>
That is unrelated and has nothing todo with WQ_MEM_RECLAIM.<br>
<br>
What we should do is to make sure that the lockdep annotation covers
all workers who play a role in fence signaling.<br>
<br>
<span style="white-space: pre-wrap">
</span>
<blockquote type="cite" cite="mid:Z2DFKfqzhlyHOjpd@lstrano-desk.jf.intel.com">
<pre class="moz-quote-pre" wrap="">- It could have been running and then context-switched out, with reclaim
being triggered elsewhere in the mean time, pausing the execution of
the non-reclaim worker.</pre>
</blockquote>
<br>
As far as I know non-reclaim workers are not paused because a
reclaim worker is running, that would be really new to me.<br>
<br>
What happens is that here (from workqueue.c):<br>
<br>
<pre style="box-sizing: inherit; font-family: "Ubuntu Mono", monospace; font-size: 0.9em; line-height: 1.2; padding: 1em 0px 4em; color: rgb(0, 0, 0); margin: 0px; min-height: 100%; tab-size: 8; white-space: pre; word-spacing: 0px; word-break: normal; overflow-wrap: normal; hyphens: none; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><span id="codeline-3414" style="box-sizing: inherit; vertical-align: top; display: block; padding-left: 1em;"><span class="cm" style="box-sizing: inherit; vertical-align: top; color: slategray; font-style: italic;"> * Workqueue rescuer thread function. There's one rescuer for each</span>
</span><span id="codeline-3415" style="box-sizing: inherit; vertical-align: top; display: block; padding-left: 1em;"><span class="cm" style="box-sizing: inherit; vertical-align: top; color: slategray; font-style: italic;"> * workqueue which has WQ_MEM_RECLAIM set.</span>
</span><span id="codeline-3416" style="box-sizing: inherit; vertical-align: top; display: block; padding-left: 1em;"><span class="cm" style="box-sizing: inherit; vertical-align: top; color: slategray; font-style: italic;"> *</span>
</span><span id="codeline-3417" style="box-sizing: inherit; vertical-align: top; display: block; padding-left: 1em;"><span class="cm" style="box-sizing: inherit; vertical-align: top; color: slategray; font-style: italic;"> * Regular work processing on a pool may block trying to create a new</span>
</span><span id="codeline-3418" style="box-sizing: inherit; vertical-align: top; display: block; padding-left: 1em;"><span class="cm" style="box-sizing: inherit; vertical-align: top; color: slategray; font-style: italic;"> * worker which uses GFP_KERNEL allocation which has slight chance of</span>
</span><span id="codeline-3419" style="box-sizing: inherit; vertical-align: top; display: block; padding-left: 1em;"><span class="cm" style="box-sizing: inherit; vertical-align: top; color: slategray; font-style: italic;"> * developing into deadlock if some works currently on the same queue</span>
</span><span id="codeline-3420" style="box-sizing: inherit; vertical-align: top; display: block; padding-left: 1em;"><span class="cm" style="box-sizing: inherit; vertical-align: top; color: slategray; font-style: italic;"> * need to be processed to satisfy the GFP_KERNEL allocation. This is</span>
</span><span id="codeline-3421" style="box-sizing: inherit; vertical-align: top; display: block; padding-left: 1em;"><span class="cm" style="box-sizing: inherit; vertical-align: top; color: slategray; font-style: italic;"> * the problem rescuer solves.</span></span></pre>
<blockquote type="cite" cite="mid:Z2DFKfqzhlyHOjpd@lstrano-desk.jf.intel.com">
<pre class="moz-quote-pre" wrap="">In either case, during reclaim, if you wait on a DMA fence that depends
on the DRM scheduler worker,and that worker attempts to flush the above
non-reclaim worker, it will result in a deadlock.</pre>
</blockquote>
<br>
Well that is only partially correct.<br>
<br>
It's true that the worker we wait for can't wait for DMA-fence or do
memory allocations who wait for DMA-fences. But WQ_MEM_RECLAIM is
not related to any DMA fence annotation.<br>
<br>
What happens instead is that the kernel always keeps a kernel thread
pre-allocated so that it can guarantee that the worker can start
without allocating memory.<br>
<br>
As soon as the worker runs there shouldn't be any difference in the
handling as far as I know.<br>
<br>
<blockquote type="cite" cite="mid:Z2DFKfqzhlyHOjpd@lstrano-desk.jf.intel.com">
<pre class="moz-quote-pre" wrap="">The annotation appears correct to me, and I believe Tvrtko's patch is
indeed accurate. For what it's worth, we encountered several similar
bugs in Xe that emerged once we added the correct work queue
annotations.</pre>
</blockquote>
<br>
I think you mean something different. This is the lockdep annotation
for the workers and not WQ_MEM_RECLAIM.<br>
<br>
Regards,<br>
Christian.<br>
<br>
<span style="white-space: pre-wrap">
</span>
<blockquote type="cite" cite="mid:Z2DFKfqzhlyHOjpd@lstrano-desk.jf.intel.com">
<blockquote type="cite">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">There is no need to use WQ_MEM_RECLAIM for the workqueue or do I miss
something?
If I'm not completely mistaken you stumbled over a bug in the warning
code instead :)
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Hmm your thinking sounds convincing.
Adding Tejun if he has time to help brainstorm this.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Tejun could likely provide insight into whether my above assessment is
correct.</pre>
</blockquote>
<blockquote type="cite" cite="mid:Z2DFKfqzhlyHOjpd@lstrano-desk.jf.intel.com">
<pre class="moz-quote-pre" wrap="">
Matt
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Question is - does check_flush_dependency() need to skip the !WQ_MEM_RECLAIM
flushing WQ_MEM_RECLAIM warning *if* the work is already running *and* it
was called from cancel_delayed_work_sync()?
Regards,
Tvrtko
</pre>
<blockquote type="cite">
<blockquote type="cite">
<blockquote type="cite">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Apart from that looks good to me.
Regards,
Christian.
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">
Signed-off-by: Tvrtko Ursulin <a class="moz-txt-link-rfc2396E" href="mailto:tvrtko.ursulin@igalia.com"><tvrtko.ursulin@igalia.com></a>
References: 746ae46c1113 ("drm/sched: Mark scheduler
work queues with WQ_MEM_RECLAIM")
Fixes: a6149f039369 ("drm/sched: Convert drm scheduler
to use a work queue rather than kthread")
Cc: <a class="moz-txt-link-abbreviated" href="mailto:stable@vger.kernel.org">stable@vger.kernel.org</a>
Cc: Matthew Brost <a class="moz-txt-link-rfc2396E" href="mailto:matthew.brost@intel.com"><matthew.brost@intel.com></a>
Cc: Danilo Krummrich <a class="moz-txt-link-rfc2396E" href="mailto:dakr@kernel.org"><dakr@kernel.org></a>
Cc: Philipp Stanner <a class="moz-txt-link-rfc2396E" href="mailto:pstanner@redhat.com"><pstanner@redhat.com></a>
Cc: Alex Deucher <a class="moz-txt-link-rfc2396E" href="mailto:alexander.deucher@amd.com"><alexander.deucher@amd.com></a>
Cc: Christian König <a class="moz-txt-link-rfc2396E" href="mailto:christian.koenig@amd.com"><christian.koenig@amd.com></a>
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 25
+++++++++++++++++++++++++
drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 5 +++--
3 files changed, 30 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 7645e498faa4..a6aad687537e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -268,6 +268,8 @@ extern int amdgpu_agp;
extern int amdgpu_wbrf;
+extern struct workqueue_struct *amdgpu_reclaim_wq;
+
#define AMDGPU_VM_MAX_NUM_CTX 4096
#define AMDGPU_SG_THRESHOLD (256*1024*1024)
#define AMDGPU_WAIT_IDLE_TIMEOUT_IN_MS 3000
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 38686203bea6..f5b7172e8042 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -255,6 +255,8 @@ struct amdgpu_watchdog_timer
amdgpu_watchdog_timer = {
.period = 0x0, /* default to 0x0 (timeout disable) */
};
+struct workqueue_struct *amdgpu_reclaim_wq;
+
/**
* DOC: vramlimit (int)
* Restrict the total amount of VRAM in MiB for
testing. The default is 0 (Use full VRAM).
@@ -2971,6 +2973,21 @@ static struct pci_driver
amdgpu_kms_pci_driver = {
.dev_groups = amdgpu_sysfs_groups,
};
+static int amdgpu_wq_init(void)
+{
+ amdgpu_reclaim_wq =
+ alloc_workqueue("amdgpu-reclaim", WQ_MEM_RECLAIM, 0);
+ if (!amdgpu_reclaim_wq)
+ return -ENOMEM;
+
+ return 0;
+}
+
+static void amdgpu_wq_fini(void)
+{
+ destroy_workqueue(amdgpu_reclaim_wq);
+}
+
static int __init amdgpu_init(void)
{
int r;
@@ -2978,6 +2995,10 @@ static int __init amdgpu_init(void)
if (drm_firmware_drivers_only())
return -EINVAL;
+ r = amdgpu_wq_init();
+ if (r)
+ goto error_wq;
+
r = amdgpu_sync_init();
if (r)
goto error_sync;
@@ -3006,6 +3027,9 @@ static int __init amdgpu_init(void)
amdgpu_sync_fini();
error_sync:
+ amdgpu_wq_fini();
+
+error_wq:
return r;
}
@@ -3017,6 +3041,7 @@ static void __exit amdgpu_exit(void)
amdgpu_acpi_release();
amdgpu_sync_fini();
amdgpu_fence_slab_fini();
+ amdgpu_wq_fini();
mmu_notifier_synchronize();
amdgpu_xcp_drv_release();
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index 2f3f09dfb1fd..f8fd71d9382f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -790,8 +790,9 @@ void amdgpu_gfx_off_ctrl(struct
amdgpu_device *adev, bool enable)
AMD_IP_BLOCK_TYPE_GFX, true))
adev->gfx.gfx_off_state = true;
} else {
- schedule_delayed_work(&adev->gfx.gfx_off_delay_work,
- delay);
+ queue_delayed_work(amdgpu_reclaim_wq,
+ &adev->gfx.gfx_off_delay_work,
+ delay);
}
}
} else {
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
</pre>
</blockquote>
</blockquote>
</blockquote>
<pre class="moz-quote-pre" wrap="">
</pre>
</blockquote>
</blockquote>
</blockquote>
<br>
</body>
</html>