[igt-dev] [PATCH i-g-t 4/6] igt/gem_exec_nop: Drip feed nops

Wed Jun 20 09:12:58 UTC 2018

Quoting Katarzyna Dec (2018-06-20 09:31:40)
> On Tue, Jun 19, 2018 at 11:49:18AM +0100, Chris Wilson wrote:
> Few questions below.
> 
> > Wait until the previous nop batch is running before submitting the next.
> > This prevents the kernel from batching up sequential requests into a
> > a ringfull, more strenuous exercising the "lite-restore" execution path.
> > 
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > ---
> >  tests/gem_exec_nop.c | 146 +++++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 142 insertions(+), 4 deletions(-)
> > 
> > diff --git a/tests/gem_exec_nop.c b/tests/gem_exec_nop.c
> > index 50f0a3aad..0523b1c02 100644
> > --- a/tests/gem_exec_nop.c
> > +++ b/tests/gem_exec_nop.c
> > @@ -104,6 +104,129 @@ static double nop_on_ring(int fd, uint32_t handle, unsigned ring_id,
> >       return elapsed(&start, &now);
> >  }
> >  
> > +static void poll_ring(int fd, unsigned ring, const char *name, int timeout)
> > +{
> > +     const int gen = intel_gen(intel_get_drm_devid(fd));
> > +     const uint32_t MI_ARB_CHK = 0x5 << 23;
> > +     struct drm_i915_gem_execbuffer2 execbuf;
> > +     struct drm_i915_gem_exec_object2 obj;
> > +     struct drm_i915_gem_relocation_entry reloc[4], *r;
> > +     uint32_t *bbe[2], *state, *batch;
> > +     unsigned engines[16], nengine, flags;
> > +     struct timespec tv = {};
> > +     unsigned long cycles;
> > +     uint64_t elapsed;
> > +
> > +     flags = I915_EXEC_NO_RELOC;
> This flag means we will prepare relocations table for kernel?

No. It means that the contents of the batch buffer match the
reloc.presumed_offset + reloc.delta which matches obj.offset. Then if
obj.offset matches the final location, the kernel knows it doesn't have
to check the reloc[]. On the first pass, the kernel will have to patch
things up but after that, we don't even have to check the 4 reloc entries
on every pass.

The goal is not to measure the reloc patching overhead, but how long it
takes to do a series of "lite-restores".

> > +     if (gen == 4 || gen == 5)
> > +             flags |= I915_EXEC_SECURE;
> > +
> > +     nengine = 0;
> > +     if (ring == ALL_ENGINES) {
> > +             for_each_physical_engine(fd, ring) {
> > +                     if (!gem_can_store_dword(fd, ring))
> > +                             continue;
> > +
> > +                     engines[nengine++] = ring;
> > +             }
> > +     } else {
> > +             gem_require_ring(fd, ring);
> > +             igt_require(gem_can_store_dword(fd, ring));
> > +             engines[nengine++] = ring;
> > +     }
> > +     igt_require(nengine);
> > +
> > +     memset(&obj, 0, sizeof(obj));
> > +     obj.handle = gem_create(fd, 4096);
> > +     obj.relocs_ptr = to_user_pointer(reloc);
> > +     obj.relocation_count = ARRAY_SIZE(reloc);
> > +
> > +     r = memset(reloc, 0, sizeof(reloc));
> > +     batch = gem_mmap__wc(fd, obj.handle, 0, 4096, PROT_WRITE);
> > +
> > +     for (unsigned int start_offset = 0;
> > +          start_offset <= 128;
> > +          start_offset += 128) {
> It looks like this loop will run only once. Why to use such 'strange'
> values and why we need loop here?

Twice.

> > +             uint32_t *b = batch + start_offset / sizeof(*batch);
> I am curious why in b we add batch and below in r->offset we subtract it?

Just a generalised means of finding the byte offset from the start of
the bo.

> > +
> > +             r->target_handle = obj.handle;
> > +             r->offset = (b - batch + 1) * sizeof(uint32_t);
> 
> > +             r->delta = 4092;
> > +             r->read_domains = I915_GEM_DOMAIN_RENDER;
> > +
> > +             *b = MI_STORE_DWORD_IMM | (gen < 6 ? 1 << 22 : 0);
> > +             if (gen >= 8) {
> > +                     *++b = r->delta;
> > +                     *++b = 0;
> > +             } else if (gen >= 4) {
> > +                     r->offset += sizeof(uint32_t);
> > +                     *++b = 0;
> > +                     *++b = r->delta;
> > +             } else {
> > +                     *b -= 1;
> > +                     *++b = r->delta;
> > +             }
> > +             *++b = start_offset != 0;
> > +             r++;
> > +
> Could you explain why we need such 'hacky' batch settings?^^^

We flip the value written between 1/0 so we can wait for each batch to
start.

> > +             b = batch + (start_offset + 64) / sizeof(*batch);
> > +             bbe[start_offset != 0] = b;
> > +             *b++ = MI_ARB_CHK;
> > +
> > +             r->target_handle = obj.handle;
> > +             r->offset = (b - batch + 1) * sizeof(uint32_t);
> > +             r->read_domains = I915_GEM_DOMAIN_COMMAND;

> Why do we need to change domain from render to command?

It's ored. It's also entirely irrelevant as the kernel only stores a
write bit.

> > +             r->delta = start_offset + 64;
> > +             if (gen >= 8) {
> > +                     *b++ = MI_BATCH_BUFFER_START | 1 << 8 | 1;
> > +                     *b++ = r->delta;
> > +                     *b++ = 0;
> > +             } else if (gen >= 6) {
> > +                     *b++ = MI_BATCH_BUFFER_START | 1 << 8;
> > +                     *b++ = r->delta;
> > +             } else {
> > +                     *b++ = MI_BATCH_BUFFER_START | 2 << 6;
> > +                     if (gen < 4)
> > +                             r->delta |= 1;
> > +                     *b++ = r->delta;
> > +             }
> > +             r++;
> > +     }
> > +     igt_assert(r == reloc + ARRAY_SIZE(reloc));
> > +     state = batch + 1023;
> > +
> > +     memset(&execbuf, 0, sizeof(execbuf));
> > +     execbuf.buffers_ptr = to_user_pointer(&obj);
> If I understand correctly obj is 'containing' previously prepared batch, right?

Obj is the pair of batches, plus the status dword.
-Chris