[Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

Wed Nov 27 02:32:32 CET 2013

On Tue, 2013-11-26 at 13:24 -0700, Volkin, Bradley D wrote:
> On Tue, Nov 26, 2013 at 11:35:38AM -0800, Daniel Vetter wrote:
> > Hi Brad,
> > 
> > On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.volkin at intel.com wrote:
> > > From: Brad Volkin <bradley.d.volkin at intel.com>
> > > 
> > > Certain OpenGL features (e.g. transform feedback, performance monitoring)
> > > require userspace code to submit batches containing commands such as
> > > MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some
> > > generations of the hardware will noop these commands in "unsecure" batches
> > > (which includes all userspace batches submitted via i915) even though the
> > > commands may be safe and represent the intended programming model of the device.
> > > 
> > > This series introduces a software command parser similar in operation to the
> > > command parsing done in hardware for unsecure batches. However, the software
> > > parser allows some operations that would be noop'd by hardware, if the parser
> > > determines the operation is safe, and submits the batch as "secure" to prevent
> > > hardware parsing. Currently the series implements this on IVB and HSW.
> > > 
> > > The series is divided into several phases:
> > > 
> > > patches 01-09: These implement infrastructure and the command parsing algorithm,
> > >                all behind a module parameter. I expect some discussion and
> > > 	       rework, but hopefully there's nothing too controversial.
> > > patches 10-17: These define the checks performed by the parser.
> > >                I expect much discussion :)
> > > patches 18-20: In a final pass over the command checks, I found some issues with
> > >                the definitions. They looked painful to rebase in, so I've added
> > > 	       them here.
> > > patches 21-22: These enable the parser by default. It runs on all batches except
> > >                those that set the I915_EXEC_SECURE flag in the execbuffer2 call.
> > 
> > I think long-term we should even scan secure batches. We'd need to allow
> > some registers which only the drm master (i.e. owner of the display
> > hardware) is allowed to do, e.g. for scanline waits. But once we have that
> > we should be able to port all current users of secure batches over to
> > scanned batches and so enforce this everywhere by default.
> > 
> > The other issue is that igt tests assume to be able to run some evil
> > tests, so maybe we don't actually want this.
> 
> Agreed. I thought we could handle this as a follow-up task once the basic stuff is
> in place, particularly given that we'd want to modify at least some users to test.
> I also wasn't sure if we would want the check to be root && master, as in the current
> secure flag, or just master.
> 
> W.r.t. the tests, I suppose we can just turn checking on for secure batches and see
> what happens.
> 
> > 
> > > There are follow-up patches to libdrm and to i-g-t. The i-g-t tests are very
> > > basic and do not test all of the commands used by the parser on the assumption
> > > that I'm likely to make the same mistakes in both the parser and the test.
> > 
> > Yeah, I agree that just checking whether commands all go through (or not)
> > as expected adds very little value on top of the few tests you have done.
> > I think we should take a look at some corner cases which might trip up
> > your checker a bit though:
> > - I think we should check batchbuffer chaining and make sure it works on
> >   the vcs ring and not anywhere else (we can't ever break shipping libva
> >   which uses this).
> > - Some tests to trip up your parser should be done, like 3D commands that
> >   fall off the end of the batch bo. Or commands that span page boundaries.
> >   The later isn't an issue atm since you use vmap, but we should switch to
> >   per-page kmap since the vmap overhead is fairly horrible.
> 
> Good suggestions. I'll look into these.
Hi, Brad
      More inputs from libva about the batchbuffer chaining.

      Now the batchbuffer chaining is widely used in libva driver. This
is related with how the libva driver processes the image. For the
encoding purpose, it needs to be handled based on macroblock(16x16).And
every macroblock needs a group of GPU commands. So the GPU commands for
all the macroblocks will be constructed in the second-level batchbuffer.
The mode of batchbuffer chaining will bring the following benefits:
      a. The size of second-level batch buffer can be allocated based on
the size of handled image. For example: 1080p/720p/480p can use the
different size.
      b. The gpu commands in second-level batchbuffer can be constructed
by using GPU instead of CPU, which is helpful to improve the
performance. 

      At the same time both VCS and Render Ring are used in libva
driver. For example: The encoding will use VCS and RCS ring. Firstly the
RCS ring is used to execute GPU command for the motion vector/mode
prediction. And then the VCS Ring is used to execute the GPU command for
generating the bit-stream. So not only VCS ring uses the mode of
batchbuffer chaining, but also the Render Ring uses the mode of
batchbuffer chaining.
      
Thanks.
    Yakui
> 
> > 
> > > I've run the i-g-t gem_* tests, the piglit quick tests (w/Mesa git from a few
> > > days ago), and generally used an Ubuntu 13.10 IVB system with the parser
> > > running. Aside from a failure described below, I don't think there are any
> > > regressions. That is, piglit claims some regressions, but from manually running
> > > the tests I think these are false positives. However, I could use help in
> > > getting broader testing, particularly around performance. In general, I see less
> > > than 3% performance impact on HSW, with more like 10% impact for pathological
> > > batch sizes. But we'll certainly want to run relevant benchmarks beyond what
> > > I've done.
> > 
> > Yeah, a microbenchmark that just shovels MI_NOP batches of various sizes
> > through the checker and bypassing it (with EXEC_SECURE) would be really
> > good. Maybe even some variable-sized commands (all the state setup stuff
> > should be useful for that) to keep things interesting. Some variation is
> > also important to have some good cache thrasing going on (since your check
> > tables are fairly large I think).
> 
> Ok. I'd be interested in some comment from the mesa and libva guys here for real world
> workloads, but a microbenchmark would be a good start.
> 
> Which "state setup stuff" are you referring to? Something specific in i-g-t or something
> more general?
> 
> > 
> > > At this point there are a couple of known issues and potential improvements.
> > > 
> > > 1) VLV. The parser is currently disabled for VLV. One type of check performed by
> > >    the parser is that commands which access memory do so via PPGTT. VLV does not
> > >    have PPGTT enabled at this time. I chose to implement the PPGTT checks via
> > >    generic bit checking infrastructure in the parser, so they are not easily
> > >    disabled for VLV. For now, I'm disabling parsing altogether in the hope that
> > >    PPGTT can be enabled for VLV in the near future.
> > 
> > We need ppgtt for the parser anyway, since otherwise userspace can submit
> > a self-modifying batch. Checking for that is impossible, so allowing sw
> > checked batches without the ppgtt/ggtt split would be a decent security
> > hole.
> > 
> > > 2) Coherency. I've found two types of coherency issues when reading the batch
> > >    buffer from the CPU during execbuffer2. Looking for help with both issues.
> > >     i. First, the i-g-t test gem_cpu_reloc blits to a batch buffer and the
> > >        parser isn't properly waiting for the operation to complete before
> > >        parsing. I tried adding i915_gem_object_sync(batch_obj, [ring|NULL])
> > >        but that actually caused more failures.
> > 
> > This synchronization should happen when processing the relocations. The
> > batch itself isn't modified by the gpu, we simply upload it using the
> > blitter. So this going wrong indicates there's some issue somewhere ...
> 
> Ok. I didn't debug too far. Putting a gem_sync() in the test between the upload and
> exec fixed the issue. Since I wasn't doing any explicit synchronization I assumed it
> was my issue.
> 
> > 
> > 
> > >    ii. Second, on VLV, I've seen cache coherency issues when userspace writes
> > >        the batch via pwrite fast path before calling execbuffer2. The parser
> > >        reads stale data. This works fine on IVB and HSW, so I believe it's an
> > >        LLC vs. non-LLC issue. I'm just unclear on what the correct flushing or
> > >        synchronization is for this scenario.
> > 
> > Imo we take a good look at the optimized buffer read/write code from
> > i915_gem_shmem_pread (for reading the userspace batch) and
> > i915_gem_shmem_pwrite (for writing to the checked buffer). If we do the
> > checking in-line with the reading this should also bring down the overhead
> > massively compared to the current solution (those shmem read/write
> > functions are fairly optimized).
> 
> I'll take a look. So far I'm not seeing the vmap as a bottleneck and I'm a bit concerned
> about the complexity of trying to do mapping/parsing per-page, but I agree it's worth a
> more detailed analysis.
> 
> > 
> > > 3) 2nd-level batches. The parser currently allows MI_BATCH_BUFFER_START commands
> > >    in userspace batches without parsing them. The media driver uses 2nd-level
> > >    batches, so a solution is required. I have some ideas but don't want to delay
> > >    the review process for what I have so far. It may be that the 2nd-level
> > >    parsing is only needed for VCS and the current code (plus rejecting BBS)
> > >    would be sufficient for RCS.
> > 
> > Afaik only libva uses second-level batches, and only on the vcs. So I hope
> 
> I would be very happy if that was the case. I couldn't easily tell from reading the libva
> code which ring they were submitting to. I definitely did not find uses in mesa or the ddx.
> 
> > we can just submit those as unpriviledged batches if possible. If that's
> > not possible it'll get fairly ugly I fear :(
> > 
> > > 4) Command buffer copy. To avoid CPU modifications to buffers after parsing, and
> > >    to avoid GPU modifications to buffers via EUs or commands in the batch, we
> > >    should copy the userspace batch buffer to memory that userspace does not
> > >    have access to, map it into GGTT, and execute that batch buffer. I have a
> > >    sense of how to do this for 1st-level batches, but it would need changes to
> > >    tie in with the 2nd-level batch parsing I think, so I've again held off.
> > 
> > Yeah, we need the copying for otherwise the parsing is fairly pointless.
> > I've stumbled over some of your internally patches and had a quick look at
> > them. Two high-level comments:
> > 
> > - Using the existing active buffer lru instead of manual pinning would
> >   better integrate with the eviction code. For an example of in-kernel
> >   objects (and not userspace objects) using this look at the hw context
> >   code.
> > - Imo we should tag all buffers as purgeable while they're in the cache.
> >   That way the shrinker will automatically drop the backing storage if
> >   memory runs low (and thanks to the active list lru only when the gpu has
> >   stopped processing the batch). That way we can just keep on allocating
> >   buffers if they're all busy without any concern for running out of
> >   memory.
> 
> Ok, I'll look at the hw context code for buffer mgmt. For "purgeable", just via the
> madv field in the i915 gem object?
> 
> Also, there are a couple iterations of the work-in-progress patches. Do you prefer a
> cache per ring or a single cache shared by all rings?
> 
> Thanks,
> Brad
> 
> > 
> > I'll try to read through your patch pile in the next few days, this is
> > just the very-very high-level stuff that came to mind immediately.
> > 
> > Cheers, Daniel
> > > 
> > > Brad Volkin (22):
> > >   drm/i915: Add data structures for command parser
> > >   drm/i915: Initial command parser table definitions
> > >   drm/i915: Hook command parser tables up to rings
> > >   drm/i915: Add per-ring command length decode functions
> > >   drm/i915: Implement command parsing
> > >   drm/i915: Add a HAS_CMD_PARSER getparam
> > >   drm/i915: Add support for rejecting commands during parsing
> > >   drm/i915: Add support for checking register accesses
> > >   drm/i915: Add support for rejecting commands via bitmasks
> > >   drm/i915: Reject unsafe commands
> > >   drm/i915: Add register whitelists for mesa
> > >   drm/i915: Enable register whitelist checks
> > >   drm/i915: Enable bit checking for some commands
> > >   drm/i915: Enable PPGTT command parser checks
> > >   drm/i915: Reject commands that would store to global HWS page
> > >   drm/i915: Reject additional commands
> > >   drm/i915: Add parser data for perf monitoring GL extensions
> > >   drm/i915: Reject MI_ARB_ON_OFF on VECS
> > >   drm/i915: Fix length handling for MFX_WAIT
> > >   drm/i915: Fix MI_STORE_DWORD_IMM parser defintion
> > >   drm/i915: Clean up command parser enable decision
> > >   drm/i915: Enable command parsing by default
> > > 
> > >  drivers/gpu/drm/i915/Makefile              |   3 +-
> > >  drivers/gpu/drm/i915/i915_cmd_parser.c     | 712 +++++++++++++++++++++++++++++
> > >  drivers/gpu/drm/i915/i915_dma.c            |   3 +
> > >  drivers/gpu/drm/i915/i915_drv.c            |   5 +
> > >  drivers/gpu/drm/i915/i915_drv.h            |  96 ++++
> > >  drivers/gpu/drm/i915/i915_gem_execbuffer.c |  15 +
> > >  drivers/gpu/drm/i915/i915_reg.h            |  66 +++
> > >  drivers/gpu/drm/i915/intel_ringbuffer.c    |   2 +
> > >  drivers/gpu/drm/i915/intel_ringbuffer.h    |  25 +
> > >  include/uapi/drm/i915_drm.h                |   1 +
> > >  10 files changed, 927 insertions(+), 1 deletion(-)
> > >  create mode 100644 drivers/gpu/drm/i915/i915_cmd_parser.c
> > > 
> > > -- 
> > > 1.8.4.4
> > > 
> > > _______________________________________________
> > > Intel-gfx mailing list
> > > Intel-gfx at lists.freedesktop.org
> > > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx