[Intel-gfx] [PATCH i-g-t v3] benchmarks/gem_wsim: Command submission workload simulator
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Fri Apr 7 08:53:05 UTC 2017
On 06/04/2017 09:55, Chris Wilson wrote:
> On Thu, Apr 06, 2017 at 09:18:36AM +0100, Tvrtko Ursulin wrote:
[snip]
>>>> + j++;
>>>> + }
>>>> +
>>>> + bb_i = j++;
>>>> + w->duration.cur = w->duration.max;
>>>> + w->bb_sz = get_bb_sz(&w->duration);
>>>> + w->bb_handle = w->obj[bb_i].handle = gem_create(fd, w->bb_sz);
>>>> + terminate_bb(w, seqnos, 0);
>>>> + if (seqnos) {
>>>> + w->reloc.presumed_offset = -1;
>>>> + w->reloc.target_handle = 1;
>>>> + w->reloc.read_domains = I915_GEM_DOMAIN_INSTRUCTION;
>>>> + w->reloc.write_domain = I915_GEM_DOMAIN_INSTRUCTION;
>>>
>>> Ugh. That's a magic w/a value for pipecontrols. Fortunately we don't want
>>> to set write_domain here anyway.
>>
>> I think I copy-pasted this from another IGT. So you say cheat here
>> as well and set zero for both domains?
>
> Technically the MI is outside of all the GPU cache domains we have :)
> Which you pick is immaterial, aside from understanding that
> (INSTRUCTION, INSTRUCTION) is special ;)
>
> If you were to drop EXEC_OBJECT_WRITE, you would also drop
> reloc.write_domain.
Okay, I will try the cheating approach then.
>>>> + }
>>>> +
>>>> + igt_assert(w->dependency <= 0);
>>>> + if (w->dependency) {
>>>> + int dep_idx = i + w->dependency;
>>>> +
>>>> + igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps);
>>>> + igt_assert(wrk->steps[dep_idx].type == BATCH);
>>>> +
>>>> + w->obj[j].handle = w->obj[bb_i].handle;
>>>> + bb_i = j;
>>>> + w->obj[j - 1].handle =
>>>> + wrk->steps[dep_idx].obj[0].handle;
>>>> + j++;
>>>> + }
>>>> +
>>>> + if (seqnos) {
>>>> + w->obj[bb_i].relocs_ptr = to_user_pointer(&w->reloc);
>>>> + w->obj[bb_i].relocation_count = 1;
>>>> + }
>>>> +
>>>> + w->eb.buffers_ptr = to_user_pointer(w->obj);
>>>> + w->eb.buffer_count = j;
>>>> + w->eb.rsvd1 = wrk->ctx_id[w->context];
>>>> +
>>>> + if (swap_vcs && engine == VCS1)
>>>> + engine = VCS2;
>>>> + else if (swap_vcs && engine == VCS2)
>>>> + engine = VCS1;
>>>> + w->eb.flags = eb_engine_map[engine];
>>>> + w->eb.flags |= I915_EXEC_HANDLE_LUT;
>>>> + if (!seqnos)
>>>> + w->eb.flags |= I915_EXEC_NO_RELOC;
>>>
>>> Doesn't look too hard to get the relocation right. Forcing relocations
>>> between batches is probably a good one to check (just to say don't do
>>> that)
>>
>> I am not following here? You are saying don't do relocations at all?
>> How do I make sure things stay fixed and even how to find out where
>> they are in the first pass?
>
> Depending on the workload, it may be informative to also do comparisons
> between NORELOC and always RELOC. Personally I would make sure we were
> using NORELOC as this should be a simulator/example.
How do I use NORELOC? I mean, I have to know where to objects will be
pinned, or be able to pin them first and know they will remain put. What
am I not understanding here?
>>>> +static void update_bb_seqno(struct w_step *w, uint32_t seqno)
>>>> +{
>>>> + unsigned long mmap_start, mmap_offset, mmap_len;
>>>> + void *ptr;
>>>> +
>>>> + mmap_start = rounddown(w->seqno_offset, PAGE_SIZE);
>>>> + mmap_offset = w->seqno_offset - mmap_start;
>>>> + mmap_len = sizeof(uint32_t) + mmap_offset;
>>>> +
>>>> + gem_set_domain(fd, w->bb_handle,
>>>> + I915_GEM_DOMAIN_CPU, I915_GEM_DOMAIN_CPU);
>>>> +
>>>> + ptr = gem_mmap__cpu(fd, w->bb_handle, mmap_start, mmap_len, PROT_WRITE);
>>>> +
>>>> + *(uint32_t *)((char *)ptr + mmap_offset) = seqno;
>>>
>>> Uh oh. I hope this isn't called inside any loop. Note this is
>>> unsynchronized to the gpu so I wonder what this is for.
>>
>> To update the seqno inside the store_dword_imm. It is called every
>> time before a batch is executed so I was thinking whether a gem_sync
>> should be preceding it. But then I was thinking it is problematic in
>> general if we queue up multiple same batches before they get
>> executed. :( Sounds like I would need a separate batch for every
>> iteration for this to work correctly. But that sounds too costly. So
>> I don't know at the moment.
>
> mmap/munmap, especially munmap, is not free. The munmap will do a
> tlb_flush across all cores -- though maybe that's batched and the
> munmaps I do all tend to be large enough to trigger every time.
>
> Since you are using a CPU write, on !llc this will be clflushing
> everytime. I would suggest stashing the gem_mmap__wc for updating the
> seqno between repeats.
Ok, I can try that approach.
> [snip]
>
>>> I need to study this a bit more...
>>
>> Yes please, especially the bit about how to get accurate seqnos
>> written out in each step without needing separate execbuf batches.
>>
>> I've heard recursive batches mentioned in the past so maybe each
>> iteration could have it's own small batch which would jump to the
>> nop/delay one (shared between all iterations) and write the unique
>> seqno. No idea if that is possible/supported at the moment - I'll go
>> and dig a bit.
>
> You end up with the same problem of having the reloc change and need to
> update every cycle. You could use a fresh batch to rewrite the seqno
> values... However, now that you explained what you want, just keep the
> WC mmap.
Hm not sure without researching that approach first.
But in general is this correctly implementing your idea for queue depth
estimation?
Regards,
Tvrtko
More information about the Intel-gfx
mailing list