[igt-dev] [RFC PATCH] i915/gem_tiled_blits: Switch from libdrm methods to igt lib

Wed Nov 13 19:58:26 UTC 2019

Quoting Vanshidhar Konda (2019-11-13 18:15:55)
> On Wed, Nov 13, 2019 at 09:51:01AM +0000, Chris Wilson wrote:
> >Quoting Vanshidhar Konda (2019-11-13 04:31:50)
> >> Switch the test from using libdrm methods to using methods provided by
> >> the igt library. Like some of the other gem_tiled* tests, this test also
> >> creates the batch buffer used to do the blitter copies. Also, the test
> >> avoids calling GET/SET_TILING IOCTLs - something that will not be
> >> supported on Gen12+.
> >>
> >> Signed-off-by: Vanshidhar Konda <vanshidhar.r.konda at intel.com>
> >> ---
> >>  tests/i915/gem_tiled_blits.c | 176 +++++++++++++++++++++++++----------
> >>  1 file changed, 126 insertions(+), 50 deletions(-)
> >>
> >> diff --git a/tests/i915/gem_tiled_blits.c b/tests/i915/gem_tiled_blits.c
> >> index df0699f3..b72ec600 100644
> >> --- a/tests/i915/gem_tiled_blits.c
> >> +++ b/tests/i915/gem_tiled_blits.c
> >> @@ -57,52 +57,135 @@
> >>  IGT_TEST_DESCRIPTION("Test doing many tiled blits, with a working set larger"
> >>                      " than the aperture size.");
> >>
> >> -static drm_intel_bufmgr *bufmgr;
> >> -struct intel_batchbuffer *batch;
> >>  static int width = 512, height = 512;
> >> +int device_gen;
> >>
> >> -static drm_intel_bo *
> >> -create_bo(uint32_t start_val)
> >> +static void
> >> +copy(int fd, uint32_t dst, uint32_t dst_tiling,
> >> +     uint32_t src, uint32_t src_tiling)
> >>  {
> >> -       drm_intel_bo *bo, *linear_bo;
> >> +       uint32_t batch[12];
> >
> >Be considerate and think in cachelines.
> 
> Thanks for pointing that out. I'll keep it in mind.
> 
> >
> >> +       struct drm_i915_gem_relocation_entry reloc[2];
> >> +       struct drm_i915_gem_exec_object2 obj[3];
> >> +       struct drm_i915_gem_execbuffer2 exec;
> >> +       int src_pitch, dst_pitch;
> >> +       int tile_height, tile_width;
> >> +       int i = 0;
> >> +
> >> +       src_pitch = 4096;
> >> +       dst_pitch = 4096;
> >> +       tile_width = 1024;
> >> +       tile_height = width*height*4/4096;
> >> +
> >> +       batch[i++] = XY_SRC_COPY_BLT_CMD |
> >> +                 XY_SRC_COPY_BLT_WRITE_ALPHA |
> >> +                 XY_SRC_COPY_BLT_WRITE_RGB;
> >> +       if (device_gen >= 8)
> >> +               batch[i - 1] |= 8;
> >> +       else
> >> +               batch[i - 1] |= 6;
> >> +
> >> +       if (device_gen >= 4 && src_tiling != I915_TILING_NONE) {
> >> +               src_pitch /= 4;
> >> +               batch[i - 1] |= XY_SRC_COPY_BLT_SRC_TILED;
> >> +       }
> >> +       if (device_gen >= 4 && dst_tiling != I915_TILING_NONE) {
> >> +               dst_pitch /= 4;
> >> +               batch[i - 1] |= XY_SRC_COPY_BLT_DST_TILED;
> >> +       }
> >> +
> >> +       batch[i++] = (3 << 24) | /* 32 bits */
> >> +                 (0xcc << 16) | /* copy ROP */
> >> +                 dst_pitch;
> >> +       batch[i++] = 0; /* dst x1,y1 */
> >> +       batch[i++] = (tile_height << 16) | tile_width; /* dst x2,y2 */
> >> +       batch[i++] = 0; /* dst reloc */
> >> +       if (device_gen >= 8)
> >> +               batch[i++] = 0;
> >> +       batch[i++] = 0; /* src x1,y1 */
> >> +       batch[i++] = src_pitch;
> >> +       batch[i++] = 0; /* src reloc */
> >> +       if (device_gen >= 8)
> >> +               batch[i++] = 0;
> >> +       batch[i++] = MI_BATCH_BUFFER_END;
> >> +       batch[i++] = MI_NOOP;
> >> +
> >> +       memset(reloc, 0, sizeof(reloc));
> >> +       reloc[0].target_handle = dst;
> >> +       reloc[0].delta = 0;
> >> +       reloc[0].offset = 4 * sizeof(batch[0]);
> >> +       reloc[0].presumed_offset = 0;
> >> +       reloc[0].read_domains = I915_GEM_DOMAIN_RENDER;
> >> +       reloc[0].write_domain = I915_GEM_DOMAIN_RENDER;
> >> +
> >> +       reloc[1].target_handle = src;
> >> +       reloc[1].delta = 0;
> >> +       reloc[1].offset = 7 * sizeof(batch[0]);
> >> +       if (device_gen >= 8)
> >> +               reloc[1].offset += sizeof(batch[0]);
> >> +       reloc[1].presumed_offset = 0;
> >> +       reloc[1].read_domains = I915_GEM_DOMAIN_RENDER;
> >> +       reloc[1].write_domain = 0;
> >
> >Hmm. Since the batch is fresh every time, relocations are not a gpu
> >stall. So horrible but not a test breaker. (Stalling for relocations
> >would very effectively nerf the test.)
> 
> I'm not sure I follow the comment here. Are you saying it would be
> better to keep track of the buffer offset after it has been "relocated"
> and then reuse that? Also, would it be better to reuse the batch buffer?
> I saw the same approach being taken in gem_linear_blits and thought it
> would suffice for testing.

The choice is either reuse batches and so avoid the penalty of
allocating and flushing fresh pages everytime; or to force the system to
find room for another page everytime. Both have their merits :)

Now, the problem I was referring to above is that since we are not
reusing batches, we cannot avoid relocations (as we don't know where the
buffers are). Relocations are slow so best avoided, but worse if you
relocate something busy, it may require a complete pipeline stall.
Nowadays (say last few years), we use gpu relocations such that we could
just reuse one batch with the fixed copy and just modify the addresses
each time. Hmm, that might be interesting for a later variation.

So what I was worrying about was whether the relocations would stall and
in doing so make the test much more relaxed as there would be no
inflight requests making eviction hard.

> >> +       memset(obj, 0, sizeof(obj));
> >> +       obj[0].handle = dst;
> >> +       obj[1].handle = src;
> >> +       obj[2].handle = gem_create(fd, 4096);
> >> +       gem_write(fd, obj[2].handle, 0, batch, i * sizeof(batch[0]));
> >> +       obj[2].relocation_count = 2;
> >> +       obj[2].relocs_ptr = to_user_pointer(reloc);
> >> +
> >> +       memset(&exec, 0, sizeof(exec));
> >> +       exec.buffers_ptr = to_user_pointer(obj);
> >> +       exec.buffer_count = 3;
> >> +       exec.batch_len = i * sizeof(batch[0]);
> >> +       exec.flags = gem_has_blt(fd) ? I915_EXEC_BLT : 0;
> >> +
> >> +       gem_execbuf(fd, &exec);
> >> +       if (dst_tiling == I915_TILING_NONE)
> >> +               gem_sync(fd, obj[2].handle);
> >
> >Why is there a sync here? What bug are you trying to hide?
> 
> This is what I've been trying to figure out as well. I tried using
> gem_mmap__cpu() in check_bo() method like you mentioned. Without this
> gem_sync() the test fails regardless of gem_mmap__cpu() or
> gem_mmap__wc() in the check_bo() method. The data doesn't match the
> expected values and the assert in check_bo() method is triggered.

The issue is that a "gem_mmap__wc" is asynchronous, nor prepares the
data to be coherent. (Though it's wc it should be!) The trick is that
you have to tell the kernel you are about to access the pointer through
the WC domain, or manually do the gem_sync() + invalidation yourself.
...

> >>  static void
> >> -check_bo(drm_intel_bo *bo, uint32_t val)
> >> +check_bo(int fd, uint32_t bo, uint32_t val)
> >>  {
> >> -       drm_intel_bo *linear_bo;
> >> +       uint32_t linear_bo;
> >>         uint32_t *linear;
> >>         int num_errors;
> >>         int i;
> >> +       const uint32_t buf_size = 1024 * 1024;
> >>
> >> -       linear_bo = drm_intel_bo_alloc(bufmgr, "linear dst", 1024 * 1024, 4096);
> >> +       linear_bo = gem_create(fd, buf_size);
> >>
> >> -       intel_copy_bo(batch, linear_bo, bo, width*height*4);
> >> +       copy(fd, linear_bo, I915_TILING_NONE, bo, I915_TILING_X);
> >>
> >> -       do_or_die(drm_intel_bo_map(linear_bo, 0));
> >> -       linear = linear_bo->virtual;
> >> +       linear = (uint32_t *)gem_mmap__wc(fd, linear_bo, 0, buf_size,
> >> +                                         PROT_READ);
> >
> >And this is the error above you were trying to hide.

So here, you need to have a
gem_set_domain(fd, linear_bo, I915_GEM_DOMAIN_WC, 0);
(replace DOMAIN_WC with the appropriate domain if you have to fallback).
-Chris