[PATCH 6/7] drm/xe/client: Print runtime to fdinfo

Thu Apr 18 23:12:43 UTC 2024

On Tue, Apr 16, 2024 at 10:11:30PM -0500, Lucas De Marchi wrote:
>On Tue, Apr 16, 2024 at 04:20:54PM GMT, Umesh Nerlige Ramappa wrote:
>>On Mon, Apr 15, 2024 at 08:04:53PM -0700, Lucas De Marchi wrote:
>>>Print the accumulated runtime, per client, when printing fdinfo.
>>>Each time a query is done it first does 2 things:
>>>
>>>1) loop through all the exec queues for the current client and
>>> accumulates the runtime, per engine class. CTX_TIMESTAMP is used for
>>> that, being read from the context image.
>>>
>>>2) Read a "GPU timestamp" that can be used for considering "how much GPU
>>> time has passed" and that has the same unit/ref-clock as the one
>>> recording the runtime. RING_TIMESTAMP is used for that via MMIO.
>>>
>>>This second part is done once per engine class, since it's a register
>>>that is replicated on all engines. It is however the same stamp. At
>>>least for the current GPUs this was tested one. It may be simplified,
>>>but in order to play safe and avoid the cases the clock is different in
>>>future for primary/media GTs, or across engine classes, just read it per
>>>class.
>>>
>>>This is exported to userspace as 2 numbers in fdinfo:
>>>
>>>	drm-engine-<class>: <GPU_TIMESTAMP> <RUNTIME> ticks
>>>
>>>Userspace is expected to collect at least 2 samples, which allows to
>>>know the engine busyness as per:
>>>
>>>		    RUNTIME1 - RUNTIME0
>>>	busyness = ---------------------
>>>			  T1 - T0
>>>
>>>When calculating the overall system busyness, userspace can loop through
>>>all the clients and add up all the numbers.  Since the GPU timestamp
>>>will be a little bit different, it's expected some fluctuation on
>>>accuracy, but that may be improved with a better hardware/GuC interface
>>>in future, maintaining the UAPI.
>>>
>>>Another thing to point out is that it's expected that userspace reads
>>>any 2 samples every few seconds.  Given the update frequency of the
>>>counters involved and that CTX_TIMESTAMP is 32b, it is expect to wrap
>>>every 25 ~ 30 seconds.
>>
>>I think the worker I had implemented in my original patches (as well 
>>as i915 implementation) may have been confusing.
>>
>>32 bit counter ticking at 19.2 MHz should wrap around in 223.696 
>>seconds. The worker on the other hand was overcompensating for it 
>>and running at 1/8th of this wrap around time.
>
>now I'm not sure if I did the math wrong or if I replicated something
>you told me :)... What's the reason to overcompensate like that?

I think I left it as is when I got the change from the original author.  
Anything less than the overflow time should do.

>
>My main issue with the worker is not to have the worker while we have a
>gputop-like app running. That should be totally fine. My main issue is
>doing it regardless of having a process querying it, which would account
>for ~99.9% of the time.

If the use cases Tvrtko is mentioning exist (low frequency queries), 
then I am afraid we don't have another way. We could try to tightly 
control when the worker runs. Assuming worker runs every 200s, for 
example:

1) initialize the worker on first queue create (or first job scheduled)
2) reset worker on every query from the process. this should prevent it 
from running when there are frequent queries.
3) stop/start worker on suspend/resume (I don't see park/unpark in XE)
4) stop worker on last queue destroy (OR last job completion)

I agree this can get hairy very soon and we have had all sorts of issues 
in i915 with the worker. If we can get the UMD tools to compensate for 
this, that would be my first choice too.

Umesh
>
>>
>>>This could be mitigated by adding a workqueue to
>>>accumulate the counters every so often, but it's additional complexity
>>>for something that is done already by userspace every few seconds in
>>>tools like gputop (from igt), htop, nvtop, etc.
>>>
>>>Test-with: https://lore.kernel.org/igt-dev/20240405060056.59379-1-lucas.demarchi@intel.com/
>>>Signed-off-by: Lucas De Marchi <lucas.demarchi at intel.com>
>>>---
>>>drivers/gpu/drm/xe/xe_drm_client.c | 81 +++++++++++++++++++++++++++++-
>>>1 file changed, 80 insertions(+), 1 deletion(-)
>>>
>>>diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c
>>>index 08f0b7c95901..79eb453bfb14 100644
>>>--- a/drivers/gpu/drm/xe/xe_drm_client.c
>>>+++ b/drivers/gpu/drm/xe/xe_drm_client.c
>>>@@ -2,6 +2,7 @@
>>>/*
>>>* Copyright © 2023 Intel Corporation
>>>*/
>>>+#include "xe_drm_client.h"
>>>
>>>#include <drm/drm_print.h>
>>>#include <drm/xe_drm.h>
>>>@@ -12,7 +13,10 @@
>>>#include "xe_bo.h"
>>>#include "xe_bo_types.h"
>>>#include "xe_device_types.h"
>>>-#include "xe_drm_client.h"
>>>+#include "xe_exec_queue.h"
>>>+#include "xe_gt.h"
>>>+#include "xe_hw_engine.h"
>>>+#include "xe_pm.h"
>>>#include "xe_trace.h"
>>>
>>>/**
>>>@@ -179,6 +183,80 @@ static void show_meminfo(struct drm_printer *p, struct drm_file *file)
>>>	}
>>>}
>>>
>>>+static const u64 class_to_mask[] = {
>>>+        [XE_ENGINE_CLASS_RENDER] = XE_HW_ENGINE_RCS_MASK,
>>>+        [XE_ENGINE_CLASS_VIDEO_DECODE] = XE_HW_ENGINE_VCS_MASK,
>>>+        [XE_ENGINE_CLASS_VIDEO_ENHANCE] = XE_HW_ENGINE_VECS_MASK,
>>>+        [XE_ENGINE_CLASS_COPY] = XE_HW_ENGINE_BCS_MASK,
>>>+        [XE_ENGINE_CLASS_OTHER] = XE_HW_ENGINE_GSCCS_MASK,
>>>+        [XE_ENGINE_CLASS_COMPUTE] = XE_HW_ENGINE_CCS_MASK,
>>>+};
>>>+
>>>+static void show_runtime(struct drm_printer *p, struct drm_file *file)
>>>+{
>>>+	struct xe_file *xef = file->driver_priv;
>>>+	struct xe_device *xe = xef->xe;
>>>+	struct xe_gt *gt;
>>>+	struct xe_hw_engine *hwe;
>>>+	struct xe_exec_queue *q;
>>>+	unsigned long i, id_hwe, id_gt, capacity[XE_ENGINE_CLASS_MAX] = { };
>>>+	u64 gpu_timestamp, engine_mask = 0;
>>>+	bool gpu_stamp = false;
>>>+
>>>+	xe_pm_runtime_get(xe);
>>>+
>>>+	mutex_lock(&xef->exec_queue.lock);
>>>+	xa_for_each(&xef->exec_queue.xa, i, q)
>>>+		xe_exec_queue_update_runtime(q);
>>>+	mutex_unlock(&xef->exec_queue.lock);
>>>+
>>>+	for_each_gt(gt, xe, id_gt)
>>>+		engine_mask |= gt->info.engine_mask;
>>>+
>>>+	BUILD_BUG_ON(ARRAY_SIZE(class_to_mask) != XE_ENGINE_CLASS_MAX);
>>>+	for (i = 0; i < XE_ENGINE_CLASS_MAX; i++)
>>>+		capacity[i] = hweight64(engine_mask & class_to_mask[i]);
>>>+
>>>+	/*
>>>+	 * Iterate over all engines, printing the accumulated
>>>+	 * runtime for this xef per engine class
>>>+	 */
>>>+	for_each_gt(gt, xe, id_gt) {
>>>+		xe_force_wake_get(gt_to_fw(gt), XE_FW_GT);
>>>+		for_each_hw_engine(hwe, gt, id_hwe) {
>>>+			const char *class_name;
>>>+
>>>+			if (!capacity[hwe->class])
>>>+				continue;
>>>+
>>>+			/*
>>>+			 * Use any (first) engine to have a timestamp to be used every
>>>+			 * time
>>>+			 */
>>>+			if (!gpu_stamp) {
>>
>>gpu_stamp is not required if the intention is to read the timestamp 
>>once per class (as per commit message) because you are already 
>>setting capacity to 0 below.
>
>yeah... I think I changed my mind and forgot to update the commit
>message.
>
>thanks
>Lucas De Marchi
>
>>
>>Umesh
>>
>>>+				gpu_timestamp = xe_hw_engine_read_timestamp(hwe);
>>>+				gpu_stamp = true;
>>>+			}
>>>+
>>>+			class_name = xe_hw_engine_class_to_str(hwe->class);
>>>+
>>>+			drm_printf(p, "drm-engine-%s:\t%llu %llu ticks\n",
>>>+				   class_name, gpu_timestamp,
>>>+				   xef->runtime[hwe->class]);
>>>+
>>>+			if (capacity[hwe->class] > 1)
>>>+				drm_printf(p, "drm-engine-capacity-%s:\t%lu\n",
>>>+					   class_name, capacity[hwe->class]);
>>>+
>>>+			/* engine class already handled, skip next iterations */
>>>+			capacity[hwe->class] = 0;
>>>+		}
>>>+		xe_force_wake_put(gt_to_fw(gt), XE_FW_GT);
>>>+	}
>>>+
>>>+	xe_pm_runtime_get(xe);
>>>+}
>>>+
>>>/**
>>>* xe_drm_client_fdinfo() - Callback for fdinfo interface
>>>* @p: The drm_printer ptr
>>>@@ -192,5 +270,6 @@ static void show_meminfo(struct drm_printer *p, struct drm_file *file)
>>>void xe_drm_client_fdinfo(struct drm_printer *p, struct drm_file *file)
>>>{
>>>	show_meminfo(p, file);
>>>+	show_runtime(p, file);
>>>}
>>>#endif
>>>-- 
>>>2.43.0
>>>