[Intel-gfx] [PATCH 08/17] drm/i915/selftests: Add request throughput measurement to perf
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Tue Mar 10 11:58:26 UTC 2020
On 10/03/2020 11:09, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2020-03-10 10:38:21)
>>
>> On 06/03/2020 13:38, Chris Wilson wrote:
>>> +static int perf_many(void *arg)
>>> +{
>>> + struct perf_parallel *p = arg;
>>> + struct intel_engine_cs *engine = p->engine;
>>> + struct intel_context *ce;
>>> + IGT_TIMEOUT(end_time);
>>> + unsigned long count;
>>> + int err = 0;
>>> + bool busy;
>>> +
>>> + ce = intel_context_create(engine);
>>> + if (IS_ERR(ce))
>>> + return PTR_ERR(ce);
>>> +
>>> + err = intel_context_pin(ce);
>>> + if (err) {
>>> + intel_context_put(ce);
>>> + return err;
>>> + }
>>> +
>>> + busy = false;
>>> + if (intel_engine_supports_stats(engine) &&
>>> + !intel_enable_engine_stats(engine)) {
>>> + p->busy = intel_engine_get_busy_time(engine);
>>> + busy = true;
>>> + }
>>> +
>>> + count = 0;
>>> + p->time = ktime_get();
>>> + do {
>>> + struct i915_request *rq;
>>> +
>>> + rq = i915_request_create(ce);
>>> + if (IS_ERR(rq)) {
>>> + err = PTR_ERR(rq);
>>> + break;
>>> + }
>>> +
>>> + i915_request_add(rq);
>>
>> Any concerns on ring size here and maybe managing the wait explicitly?
>
> No concern, the intention is to flood the ring. If we are able to wait
> on the ring, we have succeeded in submitting faster than the engine can
> retire. (Which might be another issue for us to resolve, as it may be
> our own interrupt latency that is then the bottleneck.)
>
> If we did a sync0, sync1, many; that could give us some more insight
> into the interrupt latency in comparison to engine latency.
>
>>
>>> + count++;
>>> + } while (!__igt_timeout(end_time, NULL));
>>> + p->time = ktime_sub(ktime_get(), p->time);
>>> +
>>> + if (busy) {
>>> + p->busy = ktime_sub(intel_engine_get_busy_time(engine),
>>> + p->busy);
>>> + intel_disable_engine_stats(engine);
>>> + }
>>> +
>>> + err = switch_to_kernel_sync(ce, err);
>>> + p->runtime = intel_context_get_total_runtime_ns(ce);
>>> + p->count = count;
>>> +
>>> + intel_context_unpin(ce);
>>> + intel_context_put(ce);
>>> + return err;
>>> +}
>>> +
>>> +static int perf_parallel_engines(void *arg)
>>> +{
>>> + struct drm_i915_private *i915 = arg;
>>> + static int (* const func[])(void *arg) = {
>>> + perf_sync,
>>> + perf_many,
>>> + NULL,
>>> + };
>>> + const unsigned int nengines = num_uabi_engines(i915);
>>> + struct intel_engine_cs *engine;
>>> + int (* const *fn)(void *arg);
>>> + struct pm_qos_request *qos;
>>> + struct {
>>> + struct perf_parallel p;
>>> + struct task_struct *tsk;
>>> + } *engines;
>>> + int err = 0;
>>> +
>>> + engines = kcalloc(nengines, sizeof(*engines), GFP_KERNEL);
>>> + if (!engines)
>>> + return -ENOMEM;
>>> +
>>> + qos = kzalloc(sizeof(*qos), GFP_KERNEL);
>>> + if (qos)
>>> + pm_qos_add_request(qos, PM_QOS_CPU_DMA_LATENCY, 0);
>>> +
>>> + for (fn = func; *fn; fn++) {
>>> + char name[KSYM_NAME_LEN];
>>> + struct igt_live_test t;
>>> + unsigned int idx;
>>> +
>>> + snprintf(name, sizeof(name), "%ps", *fn);
>>
>> Is this any better than just storing the name in local static array?
>
> It's easier for sure, and since the name is already in a static array,
> why not use it :)
It looks weird, it needs KSYM_NAME_LEN of stack space and the special
%ps. But okay.
>
>>> + err = igt_live_test_begin(&t, i915, __func__, name);
>>> + if (err)
>>> + break;
>>> +
>>> + atomic_set(&i915->selftest.counter, nengines);
>>> +
>>> + idx = 0;
>>> + for_each_uabi_engine(engine, i915) {
>>
>> For a pure driver overhead test I would suggest this to be a gt live test.
>
> It's a request performance test, so sits above the gt. My thinking is
> that this is a more of a high level request/scheduler test than
> execlists/guc (though it depends on those backends).
Okay, yeah, it makes sense.
>
>>> + intel_engine_pm_get(engine);
>>> +
>>> + memset(&engines[idx].p, 0, sizeof(engines[idx].p));
>>> + engines[idx].p.engine = engine;
>>> +
>>> + engines[idx].tsk = kthread_run(*fn, &engines[idx].p,
>>> + "igt:%s", engine->name);
>>
>> Test will get affected by the host CPU core count. How about we only
>> measure num_cpu engines? Might be even more important with discrete.
>
> No. We want to be able to fill the GPU with the different processors.
> Comparing glk to kbl helps highlight any inefficiencies we have -- we
> have to be efficient enough that core count is simply not a critical
> factor to offset our submission overhead.
>
> So we can run the same test and see how it scaled with engines vs cpus
> just by running it on different machines and look for problems.
Normally you would expect one core per engine is enough to saturate the
engine. I am afraid adding more combinations will be confusing when
reading test results. (Same GPU, same engine count, different CPU core
count.) How about two subtest variants? One is 1:1 CPU core to engine,
and another can be all engines like here?
Or possibly:
1. 1 CPU core - 1 engine - purest latency/overhead
2. 1 CPU core - N engines (N = all engines) - more
3. N CPU cores - N engines (N = min(engines, cores) - global lock
contention, stable setup
4. M CPU cores - N engines (N, M = max) - lock contention stress
5. N CPU cores - 1 engine (N = all cores) - more extreme lock contention
Regards,
Tvrtko
More information about the Intel-gfx
mailing list