[PATCH i-g-t v4 1/1] tests/intel/xe_exec_capture: Add xe_exec_capture test

Wed Nov 13 23:23:28 UTC 2024

On Tue, 2024-11-12 at 23:48 -0800, Teres Alexis, Alan Previn wrote:
> On Tue, 2024-10-22 at 09:33 -0700, Zhanjun Dong wrote:
> > Test with GuC reset, check if devcoredump register dump is within the
> > +               sync[0].flags &= ~DRM_XE_SYNC_FLAG_SIGNAL;
> > +               sync[1].flags |= DRM_XE_SYNC_FLAG_SIGNAL;
> > +               sync[1].handle = syncobjs[e];
> > +
> > +               exec.exec_queue_id = exec_queues[e];
> > +               exec.address = exec_addr;
> > +               if (e != i)
> > +                       syncobj_reset(fd, &syncobjs[e], 1);
> > +               xe_exec(fd, &exec);
> > +       }
> alan: so this code is new to me - so to help me understand, can you explain
> if i got this right in terms of what above loop and below sync-waits are doing?:
> 1. so e send n_execs number of batches across n_exec_queues number of queues (on the same engine)
> 2. then, below, we wait on each one of those batches to start. 
> 3. And finally we wait for the vm to unbind?
> 
> However, i notice from the caller that you are only doing count batch of one and queue count of one.
> That said, i am kinda wondering that if the batch buffer is doing a spinner, and if this igt
> test coule potentially be running alone without any other workloads, then how would a reset be
> triggered? i thought if we only have 1 workload running with nothing else being queued, then
> the GuC wont have a reason to preempt the work. I also thought we dont have heartbeat in xe...
> am i mistaken? how do we guarantee that an engine reset occurs?
> 
> lets connect offline as i am a bit lost in some of these codes.
alan: as per offline follow up and after consulting others, we know now that the reason
this batch does actually get reset despite being the only batch of the only queue of the
only engine running at the moment (here in this line of code) for the entire card.
so its the drm subsystems scheduler that has a timeout and will reques the job to be
killed due to timeout if not done within some time. We need to follow up if there is a way
to configure what this timeout is because we dont want our test to be at the mercy of
different distro's or customer systems that may have different timeouts making our
test execution inconsistent. I assume the first set of syncobj-waits guarantee that
the task has started, thus we can even put a very short timeout like just one second
since it is after all just spinner (looping "stard batch, store-dword" go again).

> > +
> > +       for (i = 0; i < n_exec_queues && n_execs; i++)
> > +               igt_assert(syncobj_wait(fd, &syncobjs[i], 1, INT64_MAX, 0,
> > +                                       NULL));
> > +       igt_assert(syncobj_wait(fd, &sync[0].handle, 1, INT64_MAX, 0, NULL));
> > +
> > +       sync[0].flags |= DRM_XE_SYNC_FLAG_SIGNAL;
> > +       xe_vm_unbind_async(fd, vm, 0, 0, addr, bo_size, sync, 1);
> > +       igt_assert(syncobj_wait(fd, &sync[0].handle, 1, INT64_MAX, 0, NULL));
> > +
> > +       syncobj_destroy(fd, sync[0].handle);
> > +       for (i = 0; i < n_exec_queues; i++) {
> > +               syncobj_destroy(fd, syncobjs[i]);
> > +               xe_exec_queue_destroy(fd, exec_queues[i]);
> > +       }
> > +
> > +       munmap(data, bo_size);
> > +       gem_close(fd, bo);
> > +       xe_vm_destroy(fd, vm);
> > +}
> > +