[Intel-gfx] [PATCH i-g-t v3] benchmarks/gem_wsim: Command submission workload simulator

Fri Apr 7 08:53:05 UTC 2017

On 06/04/2017 09:55, Chris Wilson wrote:
> On Thu, Apr 06, 2017 at 09:18:36AM +0100, Tvrtko Ursulin wrote:

[snip]

>>>> +			j++;
>>>> +		}
>>>> +
>>>> +		bb_i = j++;
>>>> +		w->duration.cur = w->duration.max;
>>>> +		w->bb_sz = get_bb_sz(&w->duration);
>>>> +		w->bb_handle = w->obj[bb_i].handle = gem_create(fd, w->bb_sz);
>>>> +		terminate_bb(w, seqnos, 0);
>>>> +		if (seqnos) {
>>>> +			w->reloc.presumed_offset = -1;
>>>> +			w->reloc.target_handle = 1;
>>>> +			w->reloc.read_domains = I915_GEM_DOMAIN_INSTRUCTION;
>>>> +			w->reloc.write_domain = I915_GEM_DOMAIN_INSTRUCTION;
>>>
>>> Ugh. That's a magic w/a value for pipecontrols. Fortunately we don't want
>>> to set write_domain here anyway.
>>
>> I think I copy-pasted this from another IGT. So you say cheat here
>> as well and set zero for both domains?
>
> Technically the MI is outside of all the GPU cache domains we have :)
> Which you pick is immaterial, aside from understanding that
> (INSTRUCTION, INSTRUCTION) is special ;)
>
> If you were to drop EXEC_OBJECT_WRITE, you would also drop
> reloc.write_domain.

Okay, I will try the cheating approach then.

>>>> +		}
>>>> +
>>>> +		igt_assert(w->dependency <= 0);
>>>> +		if (w->dependency) {
>>>> +			int dep_idx = i + w->dependency;
>>>> +
>>>> +			igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps);
>>>> +			igt_assert(wrk->steps[dep_idx].type == BATCH);
>>>> +
>>>> +			w->obj[j].handle = w->obj[bb_i].handle;
>>>> +			bb_i = j;
>>>> +			w->obj[j - 1].handle =
>>>> +					wrk->steps[dep_idx].obj[0].handle;
>>>> +			j++;
>>>> +		}
>>>> +
>>>> +		if (seqnos) {
>>>> +			w->obj[bb_i].relocs_ptr = to_user_pointer(&w->reloc);
>>>> +			w->obj[bb_i].relocation_count = 1;
>>>> +		}
>>>> +
>>>> +		w->eb.buffers_ptr = to_user_pointer(w->obj);
>>>> +		w->eb.buffer_count = j;
>>>> +		w->eb.rsvd1 = wrk->ctx_id[w->context];
>>>> +
>>>> +		if (swap_vcs && engine == VCS1)
>>>> +			engine = VCS2;
>>>> +		else if (swap_vcs && engine == VCS2)
>>>> +			engine = VCS1;
>>>> +		w->eb.flags = eb_engine_map[engine];
>>>> +		w->eb.flags |= I915_EXEC_HANDLE_LUT;
>>>> +		if (!seqnos)
>>>> +			w->eb.flags |= I915_EXEC_NO_RELOC;
>>>
>>> Doesn't look too hard to get the relocation right. Forcing relocations
>>> between batches is probably a good one to check (just to say don't do
>>> that)
>>
>> I am not following here? You are saying don't do relocations at all?
>> How do I make sure things stay fixed and even how to find out where
>> they are in the first pass?
>
> Depending on the workload, it may be informative to also do comparisons
> between NORELOC and always RELOC. Personally I would make sure we were
> using NORELOC as this should be a simulator/example.

How do I use NORELOC? I mean, I have to know where to objects will be 
pinned, or be able to pin them first and know they will remain put. What 
am I not understanding here?

>>>> +static void update_bb_seqno(struct w_step *w, uint32_t seqno)
>>>> +{
>>>> +	unsigned long mmap_start, mmap_offset, mmap_len;
>>>> +	void *ptr;
>>>> +
>>>> +	mmap_start = rounddown(w->seqno_offset, PAGE_SIZE);
>>>> +	mmap_offset = w->seqno_offset - mmap_start;
>>>> +	mmap_len = sizeof(uint32_t) + mmap_offset;
>>>> +
>>>> +	gem_set_domain(fd, w->bb_handle,
>>>> +		       I915_GEM_DOMAIN_CPU, I915_GEM_DOMAIN_CPU);
>>>> +
>>>> +	ptr = gem_mmap__cpu(fd, w->bb_handle, mmap_start, mmap_len, PROT_WRITE);
>>>> +
>>>> +	*(uint32_t *)((char *)ptr + mmap_offset) = seqno;
>>>
>>> Uh oh. I hope this isn't called inside any loop. Note this is
>>> unsynchronized to the gpu so I wonder what this is for.
>>
>> To update the seqno inside the store_dword_imm. It is called every
>> time before a batch is executed so I was thinking whether a gem_sync
>> should be preceding it. But then I was thinking it is problematic in
>> general if we queue up multiple same batches before they get
>> executed. :( Sounds like I would need a separate batch for every
>> iteration for this to work correctly. But that sounds too costly. So
>> I don't know at the moment.
>
> mmap/munmap, especially munmap, is not free. The munmap will do a
> tlb_flush across all cores -- though maybe that's batched and the
> munmaps I do all tend to be large enough to trigger every time.
>
> Since you are using a CPU write, on !llc this will be clflushing
> everytime. I would suggest stashing the gem_mmap__wc for updating the
> seqno between repeats.

Ok, I can try that approach.

> [snip]
>
>>> I need to study this a bit more...
>>
>> Yes please, especially the bit about how to get accurate seqnos
>> written out in each step without needing separate execbuf batches.
>>
>> I've heard recursive batches mentioned in the past so maybe each
>> iteration could have it's own small batch which would jump to the
>> nop/delay one (shared between all iterations) and write the unique
>> seqno. No idea if that is possible/supported at the moment - I'll go
>> and dig a bit.
>
> You end up with the same problem of having the reloc change and need to
> update every cycle. You could use a fresh batch to rewrite the seqno
> values... However, now that you explained what you want, just keep the
> WC mmap.

Hm not sure without researching that approach first.

But in general is this correctly implementing your idea for queue depth 
estimation?

Regards,

Tvrtko