[PATCH v9] tests/intel/xe_exec_capture: Add xe_exec_capture test

Thu Dec 12 19:54:58 UTC 2024

Zhanjun, per offline chats with Kamil looks like we need to expand the
igt_fixture sections before and after the igt_subtest section and
save the per-engine-timeouts in the initial fixture and restore
the per-engine-timeouts in the later fixture because the fixture
section is not bypassed during an assert. That's what i understood.
That said, we will need another rev of this. 

On Wed, 2024-12-11 at 14:08 -0800, Teres Alexis, Alan Previn wrote:
> Just re-RB-ing after the recent addition for the change to set engine execution time manually before running the tests
> on each engine in order to limit the execution time of this test:
> 
> Reviewed-by: Alan Previn <alan.previn.teres.alexis at intel.com>
> 
> 
> On Fri, 2024-12-06 at 14:59 -0800, Dong, Zhanjun wrote:
> > Submit cmds to the GPU that result in a GuC engine reset and check that
> > devcoredump register dump is generated, by the GuC, and includes the
> > full register range.
> > 
> > Signed-off-by: Zhanjun Dong <zhanjun.dong at intel.com>
> > Cc: Alan Previn <alan.previn.teres.alexis at intel.com>
> > Cc: Kamil Konieczny <kamil.konieczny at linux.intel.com>
> > ---
> > Changes from prior revs:
> >  v9:-  Reduced job timeout to 2 seconds to speedup test
> >        Add info print to show test is running on single/multiple GPU
> >  v8:-  Move change list below ---
> 
> 
> alan: I just reviewed the difference of the last two revs (diff of diff
> farther below):
> with that change, we hope it will address Kamil's concern by reducing the execution
> time dramatically. IIRC Zhanjun couldn't designate any subtest to declare pass
> or fail without ensuring multiple engines are executed-on back to back since the
> test needs to ensure that XE-KMD is catching the correct guc-error-dump for the
> exact batch on the exact engine we expect it to capture amidst multiple back to back
> runs of different-batches-same-engine vs different-engines. (the test uses the ring
> buffer batch buffer address as a way to differentiate and determine precisely).
> 
> 
> 28a29
> > +#include "igt_sysfs.h"
> 37a39,40
> > +#define CAPTURE_JOB_TIMEOUT            2000
> > +#define JOB_TIMOUT_ENTRY               "job_timeout_ms"
> 83a87,109
> > +static u64
> > +xe_sysfs_get_job_timeout_ms(int fd, struct drm_xe_engine_class_instance *eci)
> > +{
> > +       int engine_fd = -1;
> > +       u64 ret;
> > +
> > +       engine_fd = xe_sysfs_engine_open(fd, eci->gt_id, eci->engine_class);
> > +       ret = igt_sysfs_get_u64(engine_fd, JOB_TIMOUT_ENTRY);
> > +       close(engine_fd);
> > +
> > +       return ret;
> > +}
> > +
> > +static void xe_sysfs_set_job_timeout_ms(int fd, struct drm_xe_engine_class_instance *eci,
> > +                                       u64 timeout)
> > +{
> > +       int engine_fd = -1;
> > +
> > +       engine_fd = xe_sysfs_engine_open(fd, eci->gt_id, eci->engine_class);
> > +       igt_sysfs_set_u64(engine_fd, JOB_TIMOUT_ENTRY, CAPTURE_JOB_TIMEOUT);
> > +       close(engine_fd);
> > +}
> > +
> 
> ...
> 
> >         xe_for_each_engine(fd, hwe) {
> >                 /*
> >                  * To test devcoredump register data, the test batch address is
> >                  * used to compare with the dump, address bit 40 to 46 act as
> >                  * context id, which start with an random number, increased 1
> >                  * per engine. By this way, the address is unique for each
> >                  * engine, and start with an random number on each run.
> >                  */
> >                 const u64 addr = BASE_ADDRESS | ((u64)(engine_cid++ % CID_ADDRESS_MASK) <<
> >                                                  ADDRESS_SHIFT);
> 413a440
> > +               u64 job_timeout = xe_sysfs_get_job_timeout_ms(fd, hwe);
> 417a445,447
> > +               /* Reduce timeout value to speedup test */
> > +               xe_sysfs_set_job_timeout_ms(fd, hwe, CAPTURE_JOB_TIMEOUT);
> > +
> 419a450,452
> > +               /* Restore timeout value */
> > +               xe_sysfs_set_job_timeout_ms(fd, hwe, job_timeout);
> > +
> 460a494,495
> > +                       igt_info("Running test on multiple GPU\n");
> > +
> 473a509
> > +                       igt_info("Running test on single GPU\n");
>