[Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

Wed Dec 11 01:58:18 CET 2013

[snip]

On Tue, Nov 26, 2013 at 11:35:38AM -0800, Daniel Vetter wrote:
> > 2) Coherency. I've found two types of coherency issues when reading the batch
> >    buffer from the CPU during execbuffer2. Looking for help with both issues.
> >     i. First, the i-g-t test gem_cpu_reloc blits to a batch buffer and the
> >        parser isn't properly waiting for the operation to complete before
> >        parsing. I tried adding i915_gem_object_sync(batch_obj, [ring|NULL])
> >        but that actually caused more failures.
> 
> This synchronization should happen when processing the relocations. The
> batch itself isn't modified by the gpu, we simply upload it using the
> blitter. So this going wrong indicates there's some issue somewhere ...

I looked at this again. The blitter copy uploads a batch full of noops and there
are no relocations associated with the failing batch, as far as I can tell. I
suspect that would explain why relocation handling wouldn't catch this. In any case,
I copied the relevant setup code from shmem pread, and it resolved the issue.

> 
> 
> >    ii. Second, on VLV, I've seen cache coherency issues when userspace writes
> >        the batch via pwrite fast path before calling execbuffer2. The parser
> >        reads stale data. This works fine on IVB and HSW, so I believe it's an
> >        LLC vs. non-LLC issue. I'm just unclear on what the correct flushing or
> >        synchronization is for this scenario.
> 
> Imo we take a good look at the optimized buffer read/write code from
> i915_gem_shmem_pread (for reading the userspace batch) and
> i915_gem_shmem_pwrite (for writing to the checked buffer). If we do the
> checking in-line with the reading this should also bring down the overhead
> massively compared to the current solution (those shmem read/write
> functions are fairly optimized).

So, I have a functioning kmap_atomic based parser using an sg_mapping_iter, and in the
tests I'm running, it's worse than the vmap approach. This is still without the batch
copy, but I think it's relevant anyhow. I haven't done much in the way of analysis as to
why we're getting these results. At a high level, the vmap approach leads to simple
code with a few function calls and control structures. The per-page approach has
somewhat more complex logic around mapping the next page at the right time and checking
for commands that cross a page boundary. I'd guess that difference factors into it.

Due to the risk of regressions, I think it would be better not to delay getting
broader functional and performance testing by waiting to sort this out. I'd rather
stick with vmap for now and revisit that overhead once we have more concrete
performance data to work from.

I'll propose that I send an updated series later this week or next that addresses
feedback from the review, including better handling for secure and chained batches,
the sync fixes for gem_cpu_reloc, some of the additional tests you suggested,
and possibly an attempt at batch copy. If that goes well, could we see about
getting the patches into the hands of your QA team for further testing?

Brad

> 
> > 3) 2nd-level batches. The parser currently allows MI_BATCH_BUFFER_START commands
> >    in userspace batches without parsing them. The media driver uses 2nd-level
> >    batches, so a solution is required. I have some ideas but don't want to delay
> >    the review process for what I have so far. It may be that the 2nd-level
> >    parsing is only needed for VCS and the current code (plus rejecting BBS)
> >    would be sufficient for RCS.
> 
> Afaik only libva uses second-level batches, and only on the vcs. So I hope
> we can just submit those as unpriviledged batches if possible. If that's
> not possible it'll get fairly ugly I fear :(
> 
> > 4) Command buffer copy. To avoid CPU modifications to buffers after parsing, and
> >    to avoid GPU modifications to buffers via EUs or commands in the batch, we
> >    should copy the userspace batch buffer to memory that userspace does not
> >    have access to, map it into GGTT, and execute that batch buffer. I have a
> >    sense of how to do this for 1st-level batches, but it would need changes to
> >    tie in with the 2nd-level batch parsing I think, so I've again held off.
> 
> Yeah, we need the copying for otherwise the parsing is fairly pointless.
> I've stumbled over some of your internally patches and had a quick look at
> them. Two high-level comments:
> 
> - Using the existing active buffer lru instead of manual pinning would
>   better integrate with the eviction code. For an example of in-kernel
>   objects (and not userspace objects) using this look at the hw context
>   code.
> - Imo we should tag all buffers as purgeable while they're in the cache.
>   That way the shrinker will automatically drop the backing storage if
>   memory runs low (and thanks to the active list lru only when the gpu has
>   stopped processing the batch). That way we can just keep on allocating
>   buffers if they're all busy without any concern for running out of
>   memory.
> 
> I'll try to read through your patch pile in the next few days, this is
> just the very-very high-level stuff that came to mind immediately.
> 
> Cheers, Daniel
> > 
> > Brad Volkin (22):
> >   drm/i915: Add data structures for command parser
> >   drm/i915: Initial command parser table definitions
> >   drm/i915: Hook command parser tables up to rings
> >   drm/i915: Add per-ring command length decode functions
> >   drm/i915: Implement command parsing
> >   drm/i915: Add a HAS_CMD_PARSER getparam
> >   drm/i915: Add support for rejecting commands during parsing
> >   drm/i915: Add support for checking register accesses
> >   drm/i915: Add support for rejecting commands via bitmasks
> >   drm/i915: Reject unsafe commands
> >   drm/i915: Add register whitelists for mesa
> >   drm/i915: Enable register whitelist checks
> >   drm/i915: Enable bit checking for some commands
> >   drm/i915: Enable PPGTT command parser checks
> >   drm/i915: Reject commands that would store to global HWS page
> >   drm/i915: Reject additional commands
> >   drm/i915: Add parser data for perf monitoring GL extensions
> >   drm/i915: Reject MI_ARB_ON_OFF on VECS
> >   drm/i915: Fix length handling for MFX_WAIT
> >   drm/i915: Fix MI_STORE_DWORD_IMM parser defintion
> >   drm/i915: Clean up command parser enable decision
> >   drm/i915: Enable command parsing by default
> > 
> >  drivers/gpu/drm/i915/Makefile              |   3 +-
> >  drivers/gpu/drm/i915/i915_cmd_parser.c     | 712 +++++++++++++++++++++++++++++
> >  drivers/gpu/drm/i915/i915_dma.c            |   3 +
> >  drivers/gpu/drm/i915/i915_drv.c            |   5 +
> >  drivers/gpu/drm/i915/i915_drv.h            |  96 ++++
> >  drivers/gpu/drm/i915/i915_gem_execbuffer.c |  15 +
> >  drivers/gpu/drm/i915/i915_reg.h            |  66 +++
> >  drivers/gpu/drm/i915/intel_ringbuffer.c    |   2 +
> >  drivers/gpu/drm/i915/intel_ringbuffer.h    |  25 +
> >  include/uapi/drm/i915_drm.h                |   1 +
> >  10 files changed, 927 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/gpu/drm/i915/i915_cmd_parser.c
> > 
> > -- 
> > 1.8.4.4
> > 
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch