[Intel-gfx] [PATCH] i915/query: Correlate engine and cpu timestamps with better accuracy
Lionel Landwerlin
lionel.g.landwerlin at intel.com
Thu Mar 4 09:45:47 UTC 2021
On 04/03/2021 10:58, Chris Wilson wrote:
> Quoting Lionel Landwerlin (2021-03-04 08:28:59)
>> On 04/03/2021 02:09, Chris Wilson wrote:
>>> Quoting Umesh Nerlige Ramappa (2021-03-03 21:28:00)
>>>> Perf measurements rely on CPU and engine timestamps to correlate
>>>> events of interest across these time domains. Current mechanisms get
>>>> these timestamps separately and the calculated delta between these
>>>> timestamps lack enough accuracy.
>>>>
>>>> To improve the accuracy of these time measurements to within a few us,
>>>> add a query that returns the engine and cpu timestamps captured as
>>>> close to each other as possible.
>>>>
>>>> v2: (Tvrtko)
>>>> - document clock reference used
>>>> - return cpu timestamp always
>>>> - capture cpu time just before lower dword of cs timestamp
>>>>
>>>> v3: (Chris)
>>>> - use uncore-rpm
>>>> - use __query_cs_timestamp helper
>>>>
>>>> v4: (Lionel)
>>>> - Kernel perf subsytem allows users to specify the clock id to be used
>>>> in perf_event_open. This clock id is used by the perf subsystem to
>>>> return the appropriate cpu timestamp in perf events. Similarly, let
>>>> the user pass the clockid to this query so that cpu timestamp
>>>> corresponds to the clock id requested.
>>>>
>>>> v5: (Tvrtko)
>>>> - Use normal ktime accessors instead of fast versions
>>>> - Add more uApi documentation
>>>>
>>>> v6: (Lionel)
>>>> - Move switch out of spinlock
>>>>
>>>> v7: (Chris)
>>>> - cs_timestamp is a misnomer, use cs_cycles instead
>>>> - return the cs cycle frequency as well in the query
>>>>
>>>> v8:
>>>> - Add platform and engine specific checks
>>>>
>>>> v9: (Lionel)
>>>> - Return 2 cpu timestamps in the query - captured before and after the
>>>> register read
>>>>
>>>> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
>>>> ---
>>>> drivers/gpu/drm/i915/i915_query.c | 144 ++++++++++++++++++++++++++++++
>>>> include/uapi/drm/i915_drm.h | 47 ++++++++++
>>>> 2 files changed, 191 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c
>>>> index fed337ad7b68..acca22ee6014 100644
>>>> --- a/drivers/gpu/drm/i915/i915_query.c
>>>> +++ b/drivers/gpu/drm/i915/i915_query.c
>>>> @@ -6,6 +6,8 @@
>>>>
>>>> #include <linux/nospec.h>
>>>>
>>>> +#include "gt/intel_engine_pm.h"
>>>> +#include "gt/intel_engine_user.h"
>>>> #include "i915_drv.h"
>>>> #include "i915_perf.h"
>>>> #include "i915_query.h"
>>>> @@ -90,6 +92,147 @@ static int query_topology_info(struct drm_i915_private *dev_priv,
>>>> return total_length;
>>>> }
>>>>
>>>> +typedef u64 (*__ktime_func_t)(void);
>>>> +static __ktime_func_t __clock_id_to_func(clockid_t clk_id)
>>>> +{
>>>> + /*
>>>> + * Use logic same as the perf subsystem to allow user to select the
>>>> + * reference clock id to be used for timestamps.
>>>> + */
>>>> + switch (clk_id) {
>>>> + case CLOCK_MONOTONIC:
>>>> + return &ktime_get_ns;
>>>> + case CLOCK_MONOTONIC_RAW:
>>>> + return &ktime_get_raw_ns;
>>>> + case CLOCK_REALTIME:
>>>> + return &ktime_get_real_ns;
>>>> + case CLOCK_BOOTTIME:
>>>> + return &ktime_get_boottime_ns;
>>>> + case CLOCK_TAI:
>>>> + return &ktime_get_clocktai_ns;
>>>> + default:
>>>> + return NULL;
>>>> + }
>>>> +}
>>>> +
>>>> +static inline int
>>>> +__read_timestamps(struct intel_uncore *uncore,
>>>> + i915_reg_t lower_reg,
>>>> + i915_reg_t upper_reg,
>>>> + u64 *cs_ts,
>>>> + u64 *cpu_ts,
>>>> + __ktime_func_t cpu_clock)
>>>> +{
>>>> + u32 upper, lower, old_upper, loop = 0;
>>>> +
>>>> + upper = intel_uncore_read_fw(uncore, upper_reg);
>>>> + do {
>>>> + cpu_ts[0] = cpu_clock();
>>>> + lower = intel_uncore_read_fw(uncore, lower_reg);
>>>> + cpu_ts[1] = cpu_clock();
>>>> + old_upper = upper;
>>>> + upper = intel_uncore_read_fw(uncore, upper_reg);
>>> Both register reads comprise the timestamp returned to userspace, so
>>> presumably you want cpu_ts[] to wrap both.
>>>
>>> do {
>>> old_upper = upper;
>>>
>>> cpu_ts[0] = cpu_clock();
>>> lower = intel_uncore_read_fw(uncore, lower_reg);
>>> upper = intel_uncore_read_fw(uncore, upper_reg);
>>> cpu_ts[1] = cpu_clock();
>>> } while (upper != old_upper && loop++ < 2);
>> Actually if we want the best accuracy we can just deal with the lower dword.
> Accuracy of what? The lower dword read perhaps, or the accuracy of the
> sample point for the combined reads for the timestamp, which is closer
> to an external observer (cpu_clock() implies reference to an external
> observer).
>
> The two clock samples are not even necessarily closely related due to the
> nmi adjustments. If you wanted an unadjusted elapsed time for the read
> you can use local_clock() then return the chosen cpu_clock() before plus
> the elapsed delta from around the read as the estimated error.
>
> cpu_ts[1] = local_clock();
> cpu_ts[0] = cpu_clock();
> lower = intel_uncore_read_fw(uncore, lower_reg);
> cpu_ts[1] = local_clock() - cpu_ts[1];
> -Chris
Thanks,
I meant the accuracy of having 2 samples GPU/CPU as close as possible.
Avoiding to account another register read in there is nice.
My testing was also mostly done with CLOCK_MONOTONIC_RAW which doesn't
seem to be adjusted like CLOCK_MONOTONIC so maybe that why I didn't see
the issue.
-Lionel
More information about the Intel-gfx
mailing list