[Intel-gfx] [PATCH] i915/query: Correlate engine and cpu timestamps with better accuracy

Lionel Landwerlin lionel.g.landwerlin at intel.com
Wed Mar 3 16:28:22 UTC 2021


On 03/03/2021 18:27, Umesh Nerlige Ramappa wrote:
> On Wed, Mar 03, 2021 at 11:21:39AM +0200, Lionel Landwerlin wrote:
>> On 03/03/2021 02:12, Umesh Nerlige Ramappa wrote:
>>> On Tue, Mar 02, 2021 at 10:35:19PM +0200, Lionel Landwerlin wrote:
>>>> Thanks a bunch for sharing this!
>>>>
>>>> On 02/03/2021 20:29, Umesh Nerlige Ramappa wrote:
>>>>> Perf measurements rely on CPU and engine timestamps to correlate
>>>>> events of interest across these time domains. Current mechanisms get
>>>>> these timestamps separately and the calculated delta between these
>>>>> timestamps lack enough accuracy.
>>>>>
>>>>> To improve the accuracy of these time measurements to within a few 
>>>>> us,
>>>>> add a query that returns the engine and cpu timestamps captured as
>>>>> close to each other as possible.
>>>>>
>>>>> v2: (Tvrtko)
>>>>> - document clock reference used
>>>>> - return cpu timestamp always
>>>>> - capture cpu time just before lower dword of cs timestamp
>>>>>
>>>>> v3: (Chris)
>>>>> - use uncore-rpm
>>>>> - use __query_cs_timestamp helper
>>>>>
>>>>> v4: (Lionel)
>>>>> - Kernel perf subsytem allows users to specify the clock id to be 
>>>>> used
>>>>>   in perf_event_open. This clock id is used by the perf subsystem to
>>>>>   return the appropriate cpu timestamp in perf events. Similarly, let
>>>>>   the user pass the clockid to this query so that cpu timestamp
>>>>>   corresponds to the clock id requested.
>>>>>
>>>>> v5: (Tvrtko)
>>>>> - Use normal ktime accessors instead of fast versions
>>>>> - Add more uApi documentation
>>>>>
>>>>> v6: (Lionel)
>>>>> - Move switch out of spinlock
>>>>>
>>>>> v7: (Chris)
>>>>> - cs_timestamp is a misnomer, use cs_cycles instead
>>>>> - return the cs cycle frequency as well in the query
>>>>>
>>>>> v8:
>>>>> - Add platform and engine specific checks
>>>>>
>>>>> Signed-off-by: Umesh Nerlige Ramappa 
>>>>> <umesh.nerlige.ramappa at intel.com>
>>>>> ---
>>>>>  drivers/gpu/drm/i915/i915_query.c | 140 
>>>>> ++++++++++++++++++++++++++++++
>>>>>  include/uapi/drm/i915_drm.h       |  43 +++++++++
>>>>>  2 files changed, 183 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/i915_query.c 
>>>>> b/drivers/gpu/drm/i915/i915_query.c
>>>>> index fed337ad7b68..763f0f918065 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_query.c
>>>>> +++ b/drivers/gpu/drm/i915/i915_query.c
>>>>> @@ -6,6 +6,8 @@
>>>>>  #include <linux/nospec.h>
>>>>> +#include "gt/intel_engine_pm.h"
>>>>> +#include "gt/intel_engine_user.h"
>>>>>  #include "i915_drv.h"
>>>>>  #include "i915_perf.h"
>>>>>  #include "i915_query.h"
>>>>> @@ -90,6 +92,143 @@ static int query_topology_info(struct 
>>>>> drm_i915_private *dev_priv,
>>>>>      return total_length;
>>>>>  }
>>>>> +typedef u64 (*__ktime_func_t)(void);
>>>>> +static __ktime_func_t __clock_id_to_func(clockid_t clk_id)
>>>>> +{
>>>>> +    /*
>>>>> +     * Use logic same as the perf subsystem to allow user to 
>>>>> select the
>>>>> +     * reference clock id to be used for timestamps.
>>>>> +     */
>>>>> +    switch (clk_id) {
>>>>> +    case CLOCK_MONOTONIC:
>>>>> +        return &ktime_get_ns;
>>>>> +    case CLOCK_MONOTONIC_RAW:
>>>>> +        return &ktime_get_raw_ns;
>>>>> +    case CLOCK_REALTIME:
>>>>> +        return &ktime_get_real_ns;
>>>>> +    case CLOCK_BOOTTIME:
>>>>> +        return &ktime_get_boottime_ns;
>>>>> +    case CLOCK_TAI:
>>>>> +        return &ktime_get_clocktai_ns;
>>>>> +    default:
>>>>> +        return NULL;
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +static inline int
>>>>> +__read_timestamps(struct intel_uncore *uncore,
>>>>> +          i915_reg_t lower_reg,
>>>>> +          i915_reg_t upper_reg,
>>>>> +          u64 *cs_ts,
>>>>> +          u64 *cpu_ts,
>>>>> +          __ktime_func_t cpu_clock)
>>>>> +{
>>>>> +    u32 upper, lower, old_upper, loop = 0;
>>>>> +
>>>>> +    upper = intel_uncore_read_fw(uncore, upper_reg);
>>>>> +    do {
>>>>> +        *cpu_ts = cpu_clock();
>>>>> +        lower = intel_uncore_read_fw(uncore, lower_reg);
>>>>> +        old_upper = upper;
>>>>> +        upper = intel_uncore_read_fw(uncore, upper_reg);
>>>>> +    } while (upper != old_upper && loop++ < 2);
>>>>
>>>>
>>>> With the 2 cpu timestamps things I mentioned below, this would be
>>>>
>>>>
>>>> do {
>>>>
>>>>     *cpu_ts0 = cpu_clock();
>>>>
>>>>     lower = intel_uncore_read_fw(uncore, lower_reg);
>>>>
>>>>     *cpu_ts1 = cpu_clock();
>>>>
>>>>     upper = intel_uncore_read_fw(uncore, upper_reg);
>>>>
>>>> } while (upper != old_upper && loop++ < 2);
>>>>
>>>>
>>>>> +
>>>>> +    *cs_ts = (u64)upper << 32 | lower;
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>>> +
>>>>> +static int
>>>>> +__query_cs_cycles(struct intel_engine_cs *engine,
>>>>> +          u64 *cs_ts, u64 *cpu_ts,
>>>>> +          __ktime_func_t cpu_clock)
>>>>> +{
>>>>> +    struct intel_uncore *uncore = engine->uncore;
>>>>> +    enum forcewake_domains fw_domains;
>>>>> +    u32 base = engine->mmio_base;
>>>>> +    intel_wakeref_t wakeref;
>>>>> +    int ret;
>>>>> +
>>>>> +    fw_domains = intel_uncore_forcewake_for_reg(uncore,
>>>>> +                            RING_TIMESTAMP(base),
>>>>> +                            FW_REG_READ);
>>>>> +
>>>>> +    with_intel_runtime_pm(uncore->rpm, wakeref) {
>>>>> +        spin_lock_irq(&uncore->lock);
>>>>> +        intel_uncore_forcewake_get__locked(uncore, fw_domains);
>>>>> +
>>>>> +        ret = __read_timestamps(uncore,
>>>>> +                    RING_TIMESTAMP(base),
>>>>> +                    RING_TIMESTAMP_UDW(base),
>>>>> +                    cs_ts,
>>>>> +                    cpu_ts,
>>>>> +                    cpu_clock);
>>>>> +
>>>>> +        intel_uncore_forcewake_put__locked(uncore, fw_domains);
>>>>> +        spin_unlock_irq(&uncore->lock);
>>>>> +    }
>>>>> +
>>>>> +    return ret;
>>>>> +}
>>>>> +
>>>>> +static int
>>>>> +query_cs_cycles(struct drm_i915_private *i915,
>>>>> +        struct drm_i915_query_item *query_item)
>>>>> +{
>>>>> +    struct drm_i915_query_cs_cycles __user *query_ptr;
>>>>> +    struct drm_i915_query_cs_cycles query;
>>>>> +    struct intel_engine_cs *engine;
>>>>> +    __ktime_func_t cpu_clock;
>>>>> +    int ret;
>>>>> +
>>>>> +    if (INTEL_GEN(i915) < 6)
>>>>> +        return -ENODEV;
>>>>> +
>>>>> +    query_ptr = u64_to_user_ptr(query_item->data_ptr);
>>>>> +    ret = copy_query_item(&query, sizeof(query), sizeof(query), 
>>>>> query_item);
>>>>> +    if (ret != 0)
>>>>> +        return ret;
>>>>> +
>>>>> +    if (query.flags)
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    if (query.rsvd)
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    cpu_clock = __clock_id_to_func(query.clockid);
>>>>> +    if (!cpu_clock)
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    engine = intel_engine_lookup_user(i915,
>>>>> +                      query.engine.engine_class,
>>>>> +                      query.engine.engine_instance);
>>>>> +    if (!engine)
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    if (IS_GEN(i915, 6) &&
>>>>> +        query.engine.engine_class != I915_ENGINE_CLASS_RENDER)
>>>>> +        return -ENODEV;
>>>>> +
>>>>> +    query.cs_frequency = engine->gt->clock_frequency;
>>>>> +    ret = __query_cs_cycles(engine,
>>>>> +                &query.cs_cycles,
>>>>> +                &query.cpu_timestamp,
>>>>> +                cpu_clock);
>>>>> +    if (ret)
>>>>> +        return ret;
>>>>> +
>>>>> +    if (put_user(query.cs_frequency, &query_ptr->cs_frequency))
>>>>> +        return -EFAULT;
>>>>> +
>>>>> +    if (put_user(query.cpu_timestamp, &query_ptr->cpu_timestamp))
>>>>> +        return -EFAULT;
>>>>> +
>>>>> +    if (put_user(query.cs_cycles, &query_ptr->cs_cycles))
>>>>> +        return -EFAULT;
>>>>> +
>>>>> +    return sizeof(query);
>>>>> +}
>>>>> +
>>>>>  static int
>>>>>  query_engine_info(struct drm_i915_private *i915,
>>>>>            struct drm_i915_query_item *query_item)
>>>>> @@ -424,6 +563,7 @@ static int (* const i915_query_funcs[])(struct 
>>>>> drm_i915_private *dev_priv,
>>>>>      query_topology_info,
>>>>>      query_engine_info,
>>>>>      query_perf_config,
>>>>> +    query_cs_cycles,
>>>>>  };
>>>>>  int i915_query_ioctl(struct drm_device *dev, void *data, struct 
>>>>> drm_file *file)
>>>>> diff --git a/include/uapi/drm/i915_drm.h 
>>>>> b/include/uapi/drm/i915_drm.h
>>>>> index 1987e2ea79a3..379ae6e7aeb0 100644
>>>>> --- a/include/uapi/drm/i915_drm.h
>>>>> +++ b/include/uapi/drm/i915_drm.h
>>>>> @@ -2176,6 +2176,10 @@ struct drm_i915_query_item {
>>>>>  #define DRM_I915_QUERY_TOPOLOGY_INFO    1
>>>>>  #define DRM_I915_QUERY_ENGINE_INFO    2
>>>>>  #define DRM_I915_QUERY_PERF_CONFIG      3
>>>>> +    /**
>>>>> +     * Query Command Streamer timestamp register.
>>>>> +     */
>>>>> +#define DRM_I915_QUERY_CS_CYCLES    4
>>>>>  /* Must be kept compact -- no holes and well documented */
>>>>>      /*
>>>>> @@ -2309,6 +2313,45 @@ struct drm_i915_engine_info {
>>>>>      __u64 rsvd1[4];
>>>>>  };
>>>>> +/**
>>>>> + * struct drm_i915_query_cs_cycles
>>>>> + *
>>>>> + * The query returns the command streamer cycles and the 
>>>>> frequency that can be
>>>>> + * used to calculate the command streamer timestamp. In addition 
>>>>> the query
>>>>> + * returns the cpu timestamp that indicates when the command 
>>>>> streamer cycle
>>>>> + * count was captured.
>>>>> + */
>>>>> +struct drm_i915_query_cs_cycles {
>>>>> +    /** Engine for which command streamer cycles is queried. */
>>>>> +    struct i915_engine_class_instance engine;
>>>>> +
>>>>> +    /** Must be zero. */
>>>>> +    __u32 flags;
>>>>> +
>>>>> +    /**
>>>>> +     * Command streamer cycles as read from the command streamer
>>>>> +     * register at 0x358 offset.
>>>>> +     */
>>>>> +    __u64 cs_cycles;
>>>>> +
>>>>> +    /** Frequency of the cs cycles in Hz. */
>>>>> +    __u64 cs_frequency;
>>>>> +
>>>>> +    /** CPU timestamp in nanoseconds. */
>>>>> +    __u64 cpu_timestamp;
>>>>
>>>>
>>>> Would it be possible to have : u64 cpu_timestamps[2];
>>>>
>>>> with cpu_timestamps[0] taken before & cpu_timestamps[1] taken after 
>>>> the cs_cycles, so we can have an idea of how long the read takes.
>>>
>>> Possible, but I thought multiple queries would indirectly provide 
>>> such information. If query1 returns cpu1 and cs1 time and query2 
>>> returns cpu2 and cs2 times. Assuming neither overflowed,
>>>
>>> |((cpu2 - cpu1) - (cs1 - cs2))|
>>>
>>> should be the worst case time taken to read the register 
>>> (essentially delta_delta in the IGT test). Thoughts?
>>
>>
>> Going through 2 syscalls introduces a delay.
>>
>> I did some measurements and it appears to be in the orders of 20~30us.
>>
>
> Have you tried multiple query items in the same call to the query 
> ioctl?  Does that make any difference?


Yeah, it was the average :/


-Lionel


>
>>
>> While doing the 2 cpu timestamp capture with a single mmio read in 
>> between should be below 2us.
>>
>> We're hoping to go as precise as possible with this :)
>
> I see. I will post an update.
>
> Can you also share how you intend to use the query result with 2 cpu 
> timestamps? I want to add that to the IGT.
>
> Thanks,
> Umesh
>
>>
>>
>> -Lionel
>>
>>
>>>
>>> Thanks,
>>> Umesh
>>>
>>>>
>>>>
>>>>> +
>>>>> +    /**
>>>>> +     * Reference clock id for CPU timestamp. For definition, see
>>>>> +     * clock_gettime(2) and perf_event_open(2). Supported clock 
>>>>> ids are
>>>>> +     * CLOCK_MONOTONIC, CLOCK_MONOTONIC_RAW, CLOCK_REALTIME, 
>>>>> CLOCK_BOOTTIME,
>>>>> +     * CLOCK_TAI.
>>>>> +     */
>>>>> +    __s32 clockid;
>>>>> +
>>>>> +    /** Must be zero. */
>>>>> +    __u32 rsvd;
>>>>> +};
>>>>> +
>>>>>  /**
>>>>>   * struct drm_i915_query_engine_info
>>>>>   *
>>>>
>>>>
>>



More information about the Intel-gfx mailing list