[Intel-gfx] [PATCH] i915/query: Correlate engine and cpu timestamps with better accuracy

Lionel Landwerlin lionel.g.landwerlin at intel.com
Thu Mar 4 09:45:47 UTC 2021


On 04/03/2021 10:58, Chris Wilson wrote:
> Quoting Lionel Landwerlin (2021-03-04 08:28:59)
>> On 04/03/2021 02:09, Chris Wilson wrote:
>>> Quoting Umesh Nerlige Ramappa (2021-03-03 21:28:00)
>>>> Perf measurements rely on CPU and engine timestamps to correlate
>>>> events of interest across these time domains. Current mechanisms get
>>>> these timestamps separately and the calculated delta between these
>>>> timestamps lack enough accuracy.
>>>>
>>>> To improve the accuracy of these time measurements to within a few us,
>>>> add a query that returns the engine and cpu timestamps captured as
>>>> close to each other as possible.
>>>>
>>>> v2: (Tvrtko)
>>>> - document clock reference used
>>>> - return cpu timestamp always
>>>> - capture cpu time just before lower dword of cs timestamp
>>>>
>>>> v3: (Chris)
>>>> - use uncore-rpm
>>>> - use __query_cs_timestamp helper
>>>>
>>>> v4: (Lionel)
>>>> - Kernel perf subsytem allows users to specify the clock id to be used
>>>>     in perf_event_open. This clock id is used by the perf subsystem to
>>>>     return the appropriate cpu timestamp in perf events. Similarly, let
>>>>     the user pass the clockid to this query so that cpu timestamp
>>>>     corresponds to the clock id requested.
>>>>
>>>> v5: (Tvrtko)
>>>> - Use normal ktime accessors instead of fast versions
>>>> - Add more uApi documentation
>>>>
>>>> v6: (Lionel)
>>>> - Move switch out of spinlock
>>>>
>>>> v7: (Chris)
>>>> - cs_timestamp is a misnomer, use cs_cycles instead
>>>> - return the cs cycle frequency as well in the query
>>>>
>>>> v8:
>>>> - Add platform and engine specific checks
>>>>
>>>> v9: (Lionel)
>>>> - Return 2 cpu timestamps in the query - captured before and after the
>>>>     register read
>>>>
>>>> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
>>>> ---
>>>>    drivers/gpu/drm/i915/i915_query.c | 144 ++++++++++++++++++++++++++++++
>>>>    include/uapi/drm/i915_drm.h       |  47 ++++++++++
>>>>    2 files changed, 191 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c
>>>> index fed337ad7b68..acca22ee6014 100644
>>>> --- a/drivers/gpu/drm/i915/i915_query.c
>>>> +++ b/drivers/gpu/drm/i915/i915_query.c
>>>> @@ -6,6 +6,8 @@
>>>>    
>>>>    #include <linux/nospec.h>
>>>>    
>>>> +#include "gt/intel_engine_pm.h"
>>>> +#include "gt/intel_engine_user.h"
>>>>    #include "i915_drv.h"
>>>>    #include "i915_perf.h"
>>>>    #include "i915_query.h"
>>>> @@ -90,6 +92,147 @@ static int query_topology_info(struct drm_i915_private *dev_priv,
>>>>           return total_length;
>>>>    }
>>>>    
>>>> +typedef u64 (*__ktime_func_t)(void);
>>>> +static __ktime_func_t __clock_id_to_func(clockid_t clk_id)
>>>> +{
>>>> +       /*
>>>> +        * Use logic same as the perf subsystem to allow user to select the
>>>> +        * reference clock id to be used for timestamps.
>>>> +        */
>>>> +       switch (clk_id) {
>>>> +       case CLOCK_MONOTONIC:
>>>> +               return &ktime_get_ns;
>>>> +       case CLOCK_MONOTONIC_RAW:
>>>> +               return &ktime_get_raw_ns;
>>>> +       case CLOCK_REALTIME:
>>>> +               return &ktime_get_real_ns;
>>>> +       case CLOCK_BOOTTIME:
>>>> +               return &ktime_get_boottime_ns;
>>>> +       case CLOCK_TAI:
>>>> +               return &ktime_get_clocktai_ns;
>>>> +       default:
>>>> +               return NULL;
>>>> +       }
>>>> +}
>>>> +
>>>> +static inline int
>>>> +__read_timestamps(struct intel_uncore *uncore,
>>>> +                 i915_reg_t lower_reg,
>>>> +                 i915_reg_t upper_reg,
>>>> +                 u64 *cs_ts,
>>>> +                 u64 *cpu_ts,
>>>> +                 __ktime_func_t cpu_clock)
>>>> +{
>>>> +       u32 upper, lower, old_upper, loop = 0;
>>>> +
>>>> +       upper = intel_uncore_read_fw(uncore, upper_reg);
>>>> +       do {
>>>> +               cpu_ts[0] = cpu_clock();
>>>> +               lower = intel_uncore_read_fw(uncore, lower_reg);
>>>> +               cpu_ts[1] = cpu_clock();
>>>> +               old_upper = upper;
>>>> +               upper = intel_uncore_read_fw(uncore, upper_reg);
>>> Both register reads comprise the timestamp returned to userspace, so
>>> presumably you want cpu_ts[] to wrap both.
>>>
>>>          do {
>>>                  old_upper = upper;
>>>
>>>                  cpu_ts[0] = cpu_clock();
>>>                  lower = intel_uncore_read_fw(uncore, lower_reg);
>>>                  upper = intel_uncore_read_fw(uncore, upper_reg);
>>>                  cpu_ts[1] = cpu_clock();
>>>          } while (upper != old_upper && loop++ < 2);
>> Actually if we want the best accuracy we can just deal with the lower dword.
> Accuracy of what? The lower dword read perhaps, or the accuracy of the
> sample point for the combined reads for the timestamp, which is closer
> to an external observer (cpu_clock() implies reference to an external
> observer).
>
> The two clock samples are not even necessarily closely related due to the
> nmi adjustments. If you wanted an unadjusted elapsed time for the read
> you can use local_clock() then return the chosen cpu_clock() before plus
> the elapsed delta from around the read as the estimated error.
>
> cpu_ts[1] = local_clock();
> cpu_ts[0] = cpu_clock();
> lower = intel_uncore_read_fw(uncore, lower_reg);
> cpu_ts[1] = local_clock() - cpu_ts[1];
> -Chris

Thanks,


I meant the accuracy of having 2 samples GPU/CPU as close as possible.

Avoiding to account another register read in there is nice.


My testing was also mostly done with CLOCK_MONOTONIC_RAW which doesn't 
seem to be adjusted like CLOCK_MONOTONIC so maybe that why I didn't see 
the issue.


-Lionel



More information about the Intel-gfx mailing list