[Intel-gfx] [PATCH i-g-t v6] benchmarks/gem_wsim: Command submission workload simulator

Tvrtko Ursulin tvrtko.ursulin at linux.intel.com
Tue Apr 25 12:10:34 UTC 2017


On 25/04/2017 12:35, Chris Wilson wrote:
> On Tue, Apr 25, 2017 at 12:13:04PM +0100, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>
>> Tool which emits batch buffers to engines with configurable
>> sequences, durations, contexts, dependencies and userspace waits.
>>
>> Unfinished but shows promise so sending out for early feedback.
>>
>> v2:
>>  * Load workload descriptors from files. (also -w)
>>  * Help text.
>>  * Calibration control if needed. (-t)
>>  * NORELOC | LUT to eb flags.
>>  * Added sample workload to wsim/workload1.
>>
>> v3:
>>  * Multiple parallel different workloads (-w -w ...).
>>  * Multi-context workloads.
>>  * Variable (random) batch length.
>>  * Load balancing (round robin and queue depth estimation).
>>  * Workloads delays and explicit sync steps.
>>  * Workload frequency (period) control.
>>
>> v4:
>>  * Fixed queue-depth estimation by creating separate batches
>>    per engine when qd load balancing is on.
>>  * Dropped separate -s cmd line option. It can turn itself on
>>    automatically when needed.
>>  * Keep a single status page and lie about the write hazard
>>    as suggested by Chris.
>>  * Use batch_start_offset for controlling the batch duration.
>>    (Chris)
>>  * Set status page object cache level. (Chris)
>>  * Moved workload description to a README.
>>  * Tidied example workloads.
>>  * Some other cleanups and refactorings.
>>
>> v5:
>>  * Master and background workloads (-W / -w).
>>  * Single batch per step is enough even when balancing. (Chris)
>>  * Use hars_petruska_f54_1_random IGT functions and see to zero
>>    at start. (Chris)
>>  * Use WC cache domain when WC mapping. (Chris)
>>  * Keep seqnos 64-bytes apart in the status page. (Chris)
>>  * Add workload throttling and queue-depth throttling commands.
>>    (Chris)
>>
>> v6:
>>  * Added two more workloads.
>>  * Merged RT balancer from Chris.
>>
>> TODO list:
>
> * No reloc!
> * bb caching/reuse

Yeah I know, but have to progress the overall case as well and I am 
thinking it is getting close to good enough now. So now is the time to 
think of interesting workloads, and workload combinations.

>>  * Fence support.
>>  * Better error handling.
>>  * Less 1980's workload parsing.
>>  * More workloads.
>>  * Threads?
>>  * ... ?
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>> Cc: Chris Wilson <chris at chris-wilson.co.uk>
>> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin at intel.com>
>> ---
>
>> +static enum intel_engine_id
>> +rt_balance(const struct workload_balancer *balancer,
>> +	   struct workload *wrk, struct w_step *w)
>> +{
>> +	enum intel_engine_id engine;
>> +	long qd[NUM_ENGINES];
>> +	unsigned int n;
>> +
>> +	igt_assert(w->engine == VCS);
>> +
>> +	/* Estimate the "speed" of the most recent batch
>> +	 *    (finish time - submit time)
>> +	 * and use that as an approximate for the total remaining time for
>> +	 * all batches on that engine. We try to keep the total remaining
>> +	 * balanced between the engines.
>> +	 */
>
> Next steps for this would be to move from an instantaneous speed, to an
> average. I'm thinking something like a exponential decay moving average
> just to make the estimation more robust.

Do you think it would be OK to merge these two tools at this point and 
continue improving them in place?

Your balancer already looks a solid step up from the queue-depth one. I 
checked today myself and, what looks like a worst case of a VCS1 hog and 
a balancing workloads running together, it gets the VCS2 utilisation to 
impressive 85%.

As mentioned before those stats can now be collected easily with:

   trace.pl --trace gem_wsim ...; perf script | trace.pl

I need to start pining the relevant people for help with creating 
relevant workloads and am also entertaining the idea of trying balancing 
via exporting the stats from i915 directly. Just to see if true vs 
estimated numbers would make a difference here.

>> +			if (qd_throttle > 0 && balancer && balancer->get_qd) {
>> +				unsigned int target;
>> +
>> +				for (target = wrk->nr_steps - 1; target > 0;
>> +				     target--) {
>
> I think this should skip other engines.
>
> if (target->engine != engine)
> 	continue;

If you say so. I don't have an opinion on it. Would it be useful to 
perhaps have both options - to throttle globally and per-engine? I could 
easily add two different workload commands for that.

Regards,

Tvrtko


More information about the Intel-gfx mailing list