[Intel-gfx] [PATCH] i915/query: Correlate engine and cpu timestamps with better accuracy
Lionel Landwerlin
lionel.g.landwerlin at intel.com
Wed Mar 3 16:28:22 UTC 2021
On 03/03/2021 18:27, Umesh Nerlige Ramappa wrote:
> On Wed, Mar 03, 2021 at 11:21:39AM +0200, Lionel Landwerlin wrote:
>> On 03/03/2021 02:12, Umesh Nerlige Ramappa wrote:
>>> On Tue, Mar 02, 2021 at 10:35:19PM +0200, Lionel Landwerlin wrote:
>>>> Thanks a bunch for sharing this!
>>>>
>>>> On 02/03/2021 20:29, Umesh Nerlige Ramappa wrote:
>>>>> Perf measurements rely on CPU and engine timestamps to correlate
>>>>> events of interest across these time domains. Current mechanisms get
>>>>> these timestamps separately and the calculated delta between these
>>>>> timestamps lack enough accuracy.
>>>>>
>>>>> To improve the accuracy of these time measurements to within a few
>>>>> us,
>>>>> add a query that returns the engine and cpu timestamps captured as
>>>>> close to each other as possible.
>>>>>
>>>>> v2: (Tvrtko)
>>>>> - document clock reference used
>>>>> - return cpu timestamp always
>>>>> - capture cpu time just before lower dword of cs timestamp
>>>>>
>>>>> v3: (Chris)
>>>>> - use uncore-rpm
>>>>> - use __query_cs_timestamp helper
>>>>>
>>>>> v4: (Lionel)
>>>>> - Kernel perf subsytem allows users to specify the clock id to be
>>>>> used
>>>>> in perf_event_open. This clock id is used by the perf subsystem to
>>>>> return the appropriate cpu timestamp in perf events. Similarly, let
>>>>> the user pass the clockid to this query so that cpu timestamp
>>>>> corresponds to the clock id requested.
>>>>>
>>>>> v5: (Tvrtko)
>>>>> - Use normal ktime accessors instead of fast versions
>>>>> - Add more uApi documentation
>>>>>
>>>>> v6: (Lionel)
>>>>> - Move switch out of spinlock
>>>>>
>>>>> v7: (Chris)
>>>>> - cs_timestamp is a misnomer, use cs_cycles instead
>>>>> - return the cs cycle frequency as well in the query
>>>>>
>>>>> v8:
>>>>> - Add platform and engine specific checks
>>>>>
>>>>> Signed-off-by: Umesh Nerlige Ramappa
>>>>> <umesh.nerlige.ramappa at intel.com>
>>>>> ---
>>>>> drivers/gpu/drm/i915/i915_query.c | 140
>>>>> ++++++++++++++++++++++++++++++
>>>>> include/uapi/drm/i915_drm.h | 43 +++++++++
>>>>> 2 files changed, 183 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/i915_query.c
>>>>> b/drivers/gpu/drm/i915/i915_query.c
>>>>> index fed337ad7b68..763f0f918065 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_query.c
>>>>> +++ b/drivers/gpu/drm/i915/i915_query.c
>>>>> @@ -6,6 +6,8 @@
>>>>> #include <linux/nospec.h>
>>>>> +#include "gt/intel_engine_pm.h"
>>>>> +#include "gt/intel_engine_user.h"
>>>>> #include "i915_drv.h"
>>>>> #include "i915_perf.h"
>>>>> #include "i915_query.h"
>>>>> @@ -90,6 +92,143 @@ static int query_topology_info(struct
>>>>> drm_i915_private *dev_priv,
>>>>> return total_length;
>>>>> }
>>>>> +typedef u64 (*__ktime_func_t)(void);
>>>>> +static __ktime_func_t __clock_id_to_func(clockid_t clk_id)
>>>>> +{
>>>>> + /*
>>>>> + * Use logic same as the perf subsystem to allow user to
>>>>> select the
>>>>> + * reference clock id to be used for timestamps.
>>>>> + */
>>>>> + switch (clk_id) {
>>>>> + case CLOCK_MONOTONIC:
>>>>> + return &ktime_get_ns;
>>>>> + case CLOCK_MONOTONIC_RAW:
>>>>> + return &ktime_get_raw_ns;
>>>>> + case CLOCK_REALTIME:
>>>>> + return &ktime_get_real_ns;
>>>>> + case CLOCK_BOOTTIME:
>>>>> + return &ktime_get_boottime_ns;
>>>>> + case CLOCK_TAI:
>>>>> + return &ktime_get_clocktai_ns;
>>>>> + default:
>>>>> + return NULL;
>>>>> + }
>>>>> +}
>>>>> +
>>>>> +static inline int
>>>>> +__read_timestamps(struct intel_uncore *uncore,
>>>>> + i915_reg_t lower_reg,
>>>>> + i915_reg_t upper_reg,
>>>>> + u64 *cs_ts,
>>>>> + u64 *cpu_ts,
>>>>> + __ktime_func_t cpu_clock)
>>>>> +{
>>>>> + u32 upper, lower, old_upper, loop = 0;
>>>>> +
>>>>> + upper = intel_uncore_read_fw(uncore, upper_reg);
>>>>> + do {
>>>>> + *cpu_ts = cpu_clock();
>>>>> + lower = intel_uncore_read_fw(uncore, lower_reg);
>>>>> + old_upper = upper;
>>>>> + upper = intel_uncore_read_fw(uncore, upper_reg);
>>>>> + } while (upper != old_upper && loop++ < 2);
>>>>
>>>>
>>>> With the 2 cpu timestamps things I mentioned below, this would be
>>>>
>>>>
>>>> do {
>>>>
>>>> *cpu_ts0 = cpu_clock();
>>>>
>>>> lower = intel_uncore_read_fw(uncore, lower_reg);
>>>>
>>>> *cpu_ts1 = cpu_clock();
>>>>
>>>> upper = intel_uncore_read_fw(uncore, upper_reg);
>>>>
>>>> } while (upper != old_upper && loop++ < 2);
>>>>
>>>>
>>>>> +
>>>>> + *cs_ts = (u64)upper << 32 | lower;
>>>>> +
>>>>> + return 0;
>>>>> +}
>>>>> +
>>>>> +static int
>>>>> +__query_cs_cycles(struct intel_engine_cs *engine,
>>>>> + u64 *cs_ts, u64 *cpu_ts,
>>>>> + __ktime_func_t cpu_clock)
>>>>> +{
>>>>> + struct intel_uncore *uncore = engine->uncore;
>>>>> + enum forcewake_domains fw_domains;
>>>>> + u32 base = engine->mmio_base;
>>>>> + intel_wakeref_t wakeref;
>>>>> + int ret;
>>>>> +
>>>>> + fw_domains = intel_uncore_forcewake_for_reg(uncore,
>>>>> + RING_TIMESTAMP(base),
>>>>> + FW_REG_READ);
>>>>> +
>>>>> + with_intel_runtime_pm(uncore->rpm, wakeref) {
>>>>> + spin_lock_irq(&uncore->lock);
>>>>> + intel_uncore_forcewake_get__locked(uncore, fw_domains);
>>>>> +
>>>>> + ret = __read_timestamps(uncore,
>>>>> + RING_TIMESTAMP(base),
>>>>> + RING_TIMESTAMP_UDW(base),
>>>>> + cs_ts,
>>>>> + cpu_ts,
>>>>> + cpu_clock);
>>>>> +
>>>>> + intel_uncore_forcewake_put__locked(uncore, fw_domains);
>>>>> + spin_unlock_irq(&uncore->lock);
>>>>> + }
>>>>> +
>>>>> + return ret;
>>>>> +}
>>>>> +
>>>>> +static int
>>>>> +query_cs_cycles(struct drm_i915_private *i915,
>>>>> + struct drm_i915_query_item *query_item)
>>>>> +{
>>>>> + struct drm_i915_query_cs_cycles __user *query_ptr;
>>>>> + struct drm_i915_query_cs_cycles query;
>>>>> + struct intel_engine_cs *engine;
>>>>> + __ktime_func_t cpu_clock;
>>>>> + int ret;
>>>>> +
>>>>> + if (INTEL_GEN(i915) < 6)
>>>>> + return -ENODEV;
>>>>> +
>>>>> + query_ptr = u64_to_user_ptr(query_item->data_ptr);
>>>>> + ret = copy_query_item(&query, sizeof(query), sizeof(query),
>>>>> query_item);
>>>>> + if (ret != 0)
>>>>> + return ret;
>>>>> +
>>>>> + if (query.flags)
>>>>> + return -EINVAL;
>>>>> +
>>>>> + if (query.rsvd)
>>>>> + return -EINVAL;
>>>>> +
>>>>> + cpu_clock = __clock_id_to_func(query.clockid);
>>>>> + if (!cpu_clock)
>>>>> + return -EINVAL;
>>>>> +
>>>>> + engine = intel_engine_lookup_user(i915,
>>>>> + query.engine.engine_class,
>>>>> + query.engine.engine_instance);
>>>>> + if (!engine)
>>>>> + return -EINVAL;
>>>>> +
>>>>> + if (IS_GEN(i915, 6) &&
>>>>> + query.engine.engine_class != I915_ENGINE_CLASS_RENDER)
>>>>> + return -ENODEV;
>>>>> +
>>>>> + query.cs_frequency = engine->gt->clock_frequency;
>>>>> + ret = __query_cs_cycles(engine,
>>>>> + &query.cs_cycles,
>>>>> + &query.cpu_timestamp,
>>>>> + cpu_clock);
>>>>> + if (ret)
>>>>> + return ret;
>>>>> +
>>>>> + if (put_user(query.cs_frequency, &query_ptr->cs_frequency))
>>>>> + return -EFAULT;
>>>>> +
>>>>> + if (put_user(query.cpu_timestamp, &query_ptr->cpu_timestamp))
>>>>> + return -EFAULT;
>>>>> +
>>>>> + if (put_user(query.cs_cycles, &query_ptr->cs_cycles))
>>>>> + return -EFAULT;
>>>>> +
>>>>> + return sizeof(query);
>>>>> +}
>>>>> +
>>>>> static int
>>>>> query_engine_info(struct drm_i915_private *i915,
>>>>> struct drm_i915_query_item *query_item)
>>>>> @@ -424,6 +563,7 @@ static int (* const i915_query_funcs[])(struct
>>>>> drm_i915_private *dev_priv,
>>>>> query_topology_info,
>>>>> query_engine_info,
>>>>> query_perf_config,
>>>>> + query_cs_cycles,
>>>>> };
>>>>> int i915_query_ioctl(struct drm_device *dev, void *data, struct
>>>>> drm_file *file)
>>>>> diff --git a/include/uapi/drm/i915_drm.h
>>>>> b/include/uapi/drm/i915_drm.h
>>>>> index 1987e2ea79a3..379ae6e7aeb0 100644
>>>>> --- a/include/uapi/drm/i915_drm.h
>>>>> +++ b/include/uapi/drm/i915_drm.h
>>>>> @@ -2176,6 +2176,10 @@ struct drm_i915_query_item {
>>>>> #define DRM_I915_QUERY_TOPOLOGY_INFO 1
>>>>> #define DRM_I915_QUERY_ENGINE_INFO 2
>>>>> #define DRM_I915_QUERY_PERF_CONFIG 3
>>>>> + /**
>>>>> + * Query Command Streamer timestamp register.
>>>>> + */
>>>>> +#define DRM_I915_QUERY_CS_CYCLES 4
>>>>> /* Must be kept compact -- no holes and well documented */
>>>>> /*
>>>>> @@ -2309,6 +2313,45 @@ struct drm_i915_engine_info {
>>>>> __u64 rsvd1[4];
>>>>> };
>>>>> +/**
>>>>> + * struct drm_i915_query_cs_cycles
>>>>> + *
>>>>> + * The query returns the command streamer cycles and the
>>>>> frequency that can be
>>>>> + * used to calculate the command streamer timestamp. In addition
>>>>> the query
>>>>> + * returns the cpu timestamp that indicates when the command
>>>>> streamer cycle
>>>>> + * count was captured.
>>>>> + */
>>>>> +struct drm_i915_query_cs_cycles {
>>>>> + /** Engine for which command streamer cycles is queried. */
>>>>> + struct i915_engine_class_instance engine;
>>>>> +
>>>>> + /** Must be zero. */
>>>>> + __u32 flags;
>>>>> +
>>>>> + /**
>>>>> + * Command streamer cycles as read from the command streamer
>>>>> + * register at 0x358 offset.
>>>>> + */
>>>>> + __u64 cs_cycles;
>>>>> +
>>>>> + /** Frequency of the cs cycles in Hz. */
>>>>> + __u64 cs_frequency;
>>>>> +
>>>>> + /** CPU timestamp in nanoseconds. */
>>>>> + __u64 cpu_timestamp;
>>>>
>>>>
>>>> Would it be possible to have : u64 cpu_timestamps[2];
>>>>
>>>> with cpu_timestamps[0] taken before & cpu_timestamps[1] taken after
>>>> the cs_cycles, so we can have an idea of how long the read takes.
>>>
>>> Possible, but I thought multiple queries would indirectly provide
>>> such information. If query1 returns cpu1 and cs1 time and query2
>>> returns cpu2 and cs2 times. Assuming neither overflowed,
>>>
>>> |((cpu2 - cpu1) - (cs1 - cs2))|
>>>
>>> should be the worst case time taken to read the register
>>> (essentially delta_delta in the IGT test). Thoughts?
>>
>>
>> Going through 2 syscalls introduces a delay.
>>
>> I did some measurements and it appears to be in the orders of 20~30us.
>>
>
> Have you tried multiple query items in the same call to the query
> ioctl? Does that make any difference?
Yeah, it was the average :/
-Lionel
>
>>
>> While doing the 2 cpu timestamp capture with a single mmio read in
>> between should be below 2us.
>>
>> We're hoping to go as precise as possible with this :)
>
> I see. I will post an update.
>
> Can you also share how you intend to use the query result with 2 cpu
> timestamps? I want to add that to the IGT.
>
> Thanks,
> Umesh
>
>>
>>
>> -Lionel
>>
>>
>>>
>>> Thanks,
>>> Umesh
>>>
>>>>
>>>>
>>>>> +
>>>>> + /**
>>>>> + * Reference clock id for CPU timestamp. For definition, see
>>>>> + * clock_gettime(2) and perf_event_open(2). Supported clock
>>>>> ids are
>>>>> + * CLOCK_MONOTONIC, CLOCK_MONOTONIC_RAW, CLOCK_REALTIME,
>>>>> CLOCK_BOOTTIME,
>>>>> + * CLOCK_TAI.
>>>>> + */
>>>>> + __s32 clockid;
>>>>> +
>>>>> + /** Must be zero. */
>>>>> + __u32 rsvd;
>>>>> +};
>>>>> +
>>>>> /**
>>>>> * struct drm_i915_query_engine_info
>>>>> *
>>>>
>>>>
>>
More information about the Intel-gfx
mailing list