[PATCH v2 6/6] drm/xe/client: Print runtime to fdinfo

Wed May 8 20:53:43 UTC 2024

On Wed, May 08, 2024 at 09:23:17AM GMT, Tvrtko Ursulin wrote:
>
>On 07/05/2024 22:35, Lucas De Marchi wrote:
>>On Fri, Apr 26, 2024 at 11:47:37AM GMT, Tvrtko Ursulin wrote:
>>>
>>>On 24/04/2024 00:56, Lucas De Marchi wrote:
>>>>Print the accumulated runtime for client when printing fdinfo.
>>>>Each time a query is done it first does 2 things:
>>>>
>>>>1) loop through all the exec queues for the current client and
>>>>   accumulate the runtime, per engine class. CTX_TIMESTAMP is used for
>>>>   that, being read from the context image.
>>>>
>>>>2) Read a "GPU timestamp" that can be used for considering "how much GPU
>>>>   time has passed" and that has the same unit/refclock as the one
>>>>   recording the runtime. RING_TIMESTAMP is used for that via MMIO.
>>>>
>>>>Since for all current platforms RING_TIMESTAMP follows the same
>>>>refclock, just read it once, using any first engine.
>>>>
>>>>This is exported to userspace as 2 numbers in fdinfo:
>>>>
>>>>    drm-cycles-<class>: <RUNTIME>
>>>>    drm-total-cycles-<class>: <TIMESTAMP>
>>>>
>>>>Userspace is expected to collect at least 2 samples, which allows to
>>>>know the client engine busyness as per:
>>>>
>>>>            RUNTIME1 - RUNTIME0
>>>>    busyness = ---------------------
>>>>              T1 - T0
>>>>
>>>>Another thing to point out is that it's expected that userspace will
>>>>read any 2 samples every few seconds.  Given the update frequency of the
>>>>counters involved and that CTX_TIMESTAMP is 32-bits, the counter for
>>>>each exec_queue can wrap around (assuming 100% utilization) after ~200s.
>>>>The wraparound is not perceived by userspace since it's just accumulated
>>>>for all the exec_queues in a 64-bit counter), but the measurement will
>>>>not be accurate if the samples are too far apart.
>>>>
>>>>This could be mitigated by adding a workqueue to accumulate the counters
>>>>every so often, but it's additional complexity for something that is
>>>>done already by userspace every few seconds in tools like gputop (from
>>>>igt), htop, nvtop, etc with none of them really defaulting to 1 sample
>>>>per minute or more.
>>>>
>>>>Signed-off-by: Lucas De Marchi <lucas.demarchi at intel.com>
>>>>---
>>>> Documentation/gpu/drm-usage-stats.rst       |  16 ++-
>>>> Documentation/gpu/xe/index.rst              |   1 +
>>>> Documentation/gpu/xe/xe-drm-usage-stats.rst |  10 ++
>>>> drivers/gpu/drm/xe/xe_drm_client.c          | 138 +++++++++++++++++++-
>>>> 4 files changed, 162 insertions(+), 3 deletions(-)
>>>> create mode 100644 Documentation/gpu/xe/xe-drm-usage-stats.rst
>>>>
>>>>diff --git a/Documentation/gpu/drm-usage-stats.rst 
>>>>b/Documentation/gpu/drm-usage-stats.rst
>>>>index 6dc299343b48..421766289b78 100644
>>>>--- a/Documentation/gpu/drm-usage-stats.rst
>>>>+++ b/Documentation/gpu/drm-usage-stats.rst
>>>>@@ -112,6 +112,17 @@ larger value within a reasonable period. 
>>>>Upon observing a value lower than what
>>>> was previously read, userspace is expected to stay with that 
>>>>larger previous
>>>> value until a monotonic update is seen.
>>>>+- drm-total-cycles-<keystr>: <uint>
>>>>+
>>>>+Engine identifier string must be the same as the one specified in the
>>>>+drm-cycles-<keystr> tag and shall contain the total number 
>>>>cycles for the given
>>>>+engine.
>>>>+
>>>>+This is a timestamp in GPU unspecified unit that matches the 
>>>>update rate
>>>>+of drm-cycles-<keystr>. For drivers that implement this 
>>>>interface, the engine
>>>>+utilization can be calculated entirely on the GPU clock domain, without
>>>>+considering the CPU sleep time between 2 samples.
>>>
>>>Two opens.
>>>
>>>1)
>>>Do we need to explicity document that drm-total-cycles and 
>>>drm-maxfreq are mutually exclusive?
>>
>>so userspace has a fallback mechanism to calculate utilization depending
>>on what keys are available?
>
>No, to document all three at once do not make sense. Or at least are 
>not expected. Or you envisage someone might legitimately emit all 
>three? I don't see what would be the semantics. When we have 
>cycles+maxfreq the latter is in Hz. And when we have cycles+total then 
>it is unitless. All three?

I don't follow what you mean here. *cycles* is actually a unit.

The engine spent 10 cycles running this context (drm-cycles). In the
same period there were 100 cycles available (drm-total-cycles). Current
frequency is X MHz. Max frequency is Y MHz. For me all of them make
sense if one wants to mix them together. For xe it doesn't make sense
because the counter backing drm-cycles and drm-total-cycles is unrelated
to the engine frequency.

I can add something in the doc that we do not expected to see all of them
together until we see a usecase. Each driver may implement a subset.

>
>>>2)
>>>Should drm-total-cycles for now be documents as driver specific?
>>
>>you mean to call it xe-total-cycles?
>
>Yes but it is not an ask, just an open.

Ok, my opinion is that we shouldn't. Just like we have drm-cycles today
implemented by some drivers, but not all. I'd consider the drm-curfreq,
not documented in the drm layer as something to be fixed or migrated to
a driver-only interface (probably not possible anymore as it'd break the
uapi).  Problem I see with turning it into xe-total-cycles, is that the
moment another driver decide to implement they will either have to use
xe- prefix or xe will need to start publishing both keys.
As said above, I can document that it's not expected to use both total
and maxfreq as it's currently the case.

>
>>>I have added some more poeple in the cc who were involved with 
>>>driver fdinfo implementations if they will have an opinion.
>>>
>>>I would say potentially yes, and promote it to common if more than 
>>>one driver would use it.
>>>
>>>For instance I see panfrost has the driver specific drm-curfreq 
>>>(although isn't documenting it fully in panfrost.rst). And I have 
>>>to say it is somewhat questionable to expose the current frequency 
>>>per fdinfo per engine but not my call.
>>
>>aren't all of Documentation/gpu/drm-usage-stats.rst optional that
>>driver may or may not implement? When you say driver-specific I'd think
>>more of the ones not using <drm> as prefix as e.g. amd-*.
>>
>>I think drm-cycles + drm-total-cycles is just an alternative
>>implementation for engine utilization. Like drm-cycles + drm-maxfreq
>>already is an alternative to drm-engine and is not implemented by e.g.
>>amdgpu/i915.
>>
>>I will submit a new version of the entire patch series to get the ball
>>rolling, but let's keep this open for now.
>>
>><...>
>>
>>>>+static void show_runtime(struct drm_printer *p, struct drm_file *file)
>>>>+{
>>>>+    struct xe_file *xef = file->driver_priv;
>>>>+    struct xe_device *xe = xef->xe;
>>>>+    struct xe_gt *gt;
>>>>+    struct xe_hw_engine *hwe;
>>>>+    struct xe_exec_queue *q;
>>>>+    unsigned long i, id_hwe, id_gt, 
>>>>capacity[XE_ENGINE_CLASS_MAX] = { };
>>>>+    u64 gpu_timestamp, engine_mask = 0;
>>>>+    bool gpu_stamp = false;
>>>>+
>>>>+    xe_pm_runtime_get(xe);
>>>>+
>>>>+    /* Accumulate all the exec queues from this client */
>>>>+    mutex_lock(&xef->exec_queue.lock);
>>>>+    xa_for_each(&xef->exec_queue.xa, i, q)
>>>>+        xe_exec_queue_update_runtime(q);
>>>>+    mutex_unlock(&xef->exec_queue.lock);
>>>>+
>>>>+
>>>>+    /* Calculate capacity of each engine class */
>>>>+    BUILD_BUG_ON(ARRAY_SIZE(class_to_mask) != XE_ENGINE_CLASS_MAX);
>>>>+    for_each_gt(gt, xe, id_gt)
>>>>+        engine_mask |= gt->info.engine_mask;
>>>>+    for (i = 0; i < XE_ENGINE_CLASS_MAX; i++)
>>>>+        capacity[i] = hweight64(engine_mask & class_to_mask[i]);
>>>
>>>FWIW the above two loops are static so could store capacity in 
>>>struct xe_device.
>>
>>yes, but just creating a cache in xe of something derived from gt is not
>>something to consider lightly. Particularly considering the small number
>>of xe->info.gt_count we have. For something that runs only when someone
>>cat the fdinfo, this doesn't seem terrible.
>>
>>>
>>>>+
>>>>+    /*
>>>>+     * Iterate over all engines, printing the accumulated
>>>>+     * runtime for this client, per engine class
>>>>+     */
>>>>+    for_each_gt(gt, xe, id_gt) {
>>>>+        xe_force_wake_get(gt_to_fw(gt), XE_FW_GT);
>>>>+        for_each_hw_engine(hwe, gt, id_hwe) {
>>>>+            const char *class_name;
>>>>+
>>>>+            if (!capacity[hwe->class])
>>>>+                continue;
>>>>+
>>>>+            /*
>>>>+             * Use any (first) engine to have a timestamp to be 
>>>>used every
>>>>+             * time
>>>>+             */
>>>>+            if (!gpu_stamp) {
>>>>+                gpu_timestamp = xe_hw_engine_read_timestamp(hwe);
>>>>+                gpu_stamp = true;
>>>>+            }
>>>>+
>>>>+            class_name = xe_hw_engine_class_to_str(hwe->class);
>>>>+
>>>>+            drm_printf(p, "drm-cycles-%s:\t%llu\n",
>>>>+                   class_name, xef->runtime[hwe->class]);
>>>>+            drm_printf(p, "drm-total-cycles-%s:\t%llu\n",
>>>>+                   class_name, gpu_timestamp);
>>>>+
>>>>+            if (capacity[hwe->class] > 1)
>>>>+                drm_printf(p, "drm-engine-capacity-%s:\t%lu\n",
>>>>+                       class_name, capacity[hwe->class]);
>>>>+
>>>>+            /* engine class already handled, skip next iterations */
>>>>+            capacity[hwe->class] = 0;
>>>>+        }
>>>>+        xe_force_wake_put(gt_to_fw(gt), XE_FW_GT);
>>>>+    }
>>>
>>>More FWIW and AFAICT, could just walk the "list" of classes 
>>>instead of
>>
>>xe_force_wake_get() is per gt, so the alternative would be... loop
>>through the gts to get all forcewakes, loop through all engine 
>>classes, loop
>>again through all gts to put the forcewake. And we also need to consider
>>that an engine class may not be available in all GTs... example:
>>vcs/vecs in MTL and later, so we need to track it globally across GTs
>>anyway.
>
>Forcewake is only needed once for the gpu_timestamp, no? At least I 
>don't see any other potential hardware access in the loop. Hence I 
>thought if you could have a known engine to get the timestamp outside 
>the loop, you could then run a flat loop (over classes) avoiding the 
>per gt fw dance. Your choice ofc.

makes sense... I will try this and run some tests.

thanks
Lucas De Marchi