<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<br>
<br>
<div class="moz-cite-prefix">On 9/8/2023 10:16 PM, Umesh Nerlige
Ramappa wrote:<br>
</div>
<blockquote type="cite" cite="mid:20230909051645.1653965-1-umesh.nerlige.ramappa@intel.com">
<pre class="moz-quote-pre" wrap="">The worker is canceled in the __gt_park path, but we still see it
running sometimes during suspend. This is likely because some code is
getting a gt wakeref in the __gt_park path.</pre>
</blockquote>
This possible root-cause doesn't seem plausible to me, because a gt
wakeref would cause an unpark, so taking it within the park would
probably cause a deadlock. Is it not more likely that the worker
re-queued itself?<br>
<br>
<span style="white-space: pre-wrap">
</span>
<blockquote type="cite" cite="mid:20230909051645.1653965-1-umesh.nerlige.ramappa@intel.com">
<pre class="moz-quote-pre" wrap="">
Only update stats if gt is awake. If not, intel_guc_busyness_park would
have already updated the stats. Note that we do not requeue the worker
if gt is not awake since intel_guc_busyness_unpark would do that at some
point.
If the gt was parked longer than time taken for GT timestamp to roll
over, we ignore those rollovers since we don't care about tracking the
exact GT time. We only care about roll overs when the gt is active and
running workloads.
Closes: <a class="moz-txt-link-freetext" href="https://gitlab.freedesktop.org/drm/intel/-/issues/7077">https://gitlab.freedesktop.org/drm/intel/-/issues/7077</a></pre>
</blockquote>
This needs a fixes tag. Also, I'm not 100% sure but I believe we
prefer "Link" to "Closes".<br>
<br>
<span style="white-space: pre-wrap">
</span>
<blockquote type="cite" cite="mid:20230909051645.1653965-1-umesh.nerlige.ramappa@intel.com">
<pre class="moz-quote-pre" wrap="">Signed-off-by: Umesh Nerlige Ramappa <a class="moz-txt-link-rfc2396E" href="mailto:umesh.nerlige.ramappa@intel.com"><umesh.nerlige.ramappa@intel.com></a>
---
.../gpu/drm/i915/gt/uc/intel_guc_submission.c | 27 ++++++++++++++++---
1 file changed, 23 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index e250bedf90fb..df31d6047ce9 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1457,10 +1457,27 @@ static void guc_timestamp_ping(struct work_struct *wrk)
struct intel_uc *uc = container_of(guc, typeof(*uc), guc);
struct intel_gt *gt = guc_to_gt(guc);
struct intel_context *ce;
- intel_wakeref_t wakeref;
unsigned long index;
int srcu, ret;
+ /*
+ * The worker is canceled in the __gt_park path, but we still see it
+ * running sometimes during suspend. This is likely because some code
+ * is getting a gt wakeref in the __gt_park path.</pre>
</blockquote>
<br>
Same comment from before about this explanation. I would just remove
this part from the comment.<br>
<br>
<blockquote type="cite" cite="mid:20230909051645.1653965-1-umesh.nerlige.ramappa@intel.com">
<pre class="moz-quote-pre" wrap="">
+ *
+ * Only update stats if gt is awake. If not, intel_guc_busyness_park
+ * would have already updated the stats. Note that we do not requeue the
+ * worker in this case since intel_guc_busyness_unpark would do that at
+ * some point.
+ *
+ * If the gt was parked longer than time taken for GT timestamp to roll
+ * over, we ignore those rollovers since we don't care about tracking
+ * the exact GT time. We only care about roll overs when the gt is
+ * active and running workloads.
+ */
+ if (!intel_gt_pm_get_if_awake(gt))
+ return;
+</pre>
</blockquote>
<br>
Do we need to drop the _sync from the busyness stats worker parking
if we take the gt_pm wakeref in here, instead of an rpm one? because
if the gt_pm_put below causes a park and the park waits on this
worker to complete then we'll deadlock.<br>
<br>
<blockquote type="cite" cite="mid:20230909051645.1653965-1-umesh.nerlige.ramappa@intel.com">
<pre class="moz-quote-pre" wrap="">
/*
* Synchronize with gt reset to make sure the worker does not
* corrupt the engine/guc stats. NB: can't actually block waiting
@@ -1468,17 +1485,19 @@ static void guc_timestamp_ping(struct work_struct *wrk)
* this worker thread if started. So waiting would deadlock.
*/
ret = intel_gt_reset_trylock(gt, &srcu);
- if (ret)
+ if (ret) {
+ intel_gt_pm_put(gt);
return;
+ }
- with_intel_runtime_pm(>->i915->runtime_pm, wakeref)
- __update_guc_busyness_stats(guc);
+ __update_guc_busyness_stats(guc);
/* adjust context stats for overflow */
xa_for_each(&guc->context_lookup, index, ce)
guc_context_update_stats(ce);
intel_gt_reset_unlock(gt, srcu);
+ intel_gt_pm_put(gt);</pre>
</blockquote>
<br>
I think this needs to go after the queuing, because it could cause a
park and if it does we don't want to re-queue the worker immediately
after, while if we queue it before then the park will cancel it.<br>
Non-blocking style comment: with gt_pm_put the last thing in
function, you can also transform that early return in a "goto put;"
and have a single place for the gt_put.<br>
<br>
Daniele<br>
<br>
<blockquote type="cite" cite="mid:20230909051645.1653965-1-umesh.nerlige.ramappa@intel.com">
<pre class="moz-quote-pre" wrap="">
guc_enable_busyness_worker(guc);
}
</pre>
</blockquote>
<br>
</body>
</html>