[Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

Tue Nov 26 21:24:14 CET 2013

On Tue, Nov 26, 2013 at 11:35:38AM -0800, Daniel Vetter wrote:
> Hi Brad,
> 
> On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.volkin at intel.com wrote:
> > From: Brad Volkin <bradley.d.volkin at intel.com>
> > 
> > Certain OpenGL features (e.g. transform feedback, performance monitoring)
> > require userspace code to submit batches containing commands such as
> > MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some
> > generations of the hardware will noop these commands in "unsecure" batches
> > (which includes all userspace batches submitted via i915) even though the
> > commands may be safe and represent the intended programming model of the device.
> > 
> > This series introduces a software command parser similar in operation to the
> > command parsing done in hardware for unsecure batches. However, the software
> > parser allows some operations that would be noop'd by hardware, if the parser
> > determines the operation is safe, and submits the batch as "secure" to prevent
> > hardware parsing. Currently the series implements this on IVB and HSW.
> > 
> > The series is divided into several phases:
> > 
> > patches 01-09: These implement infrastructure and the command parsing algorithm,
> >                all behind a module parameter. I expect some discussion and
> > 	       rework, but hopefully there's nothing too controversial.
> > patches 10-17: These define the checks performed by the parser.
> >                I expect much discussion :)
> > patches 18-20: In a final pass over the command checks, I found some issues with
> >                the definitions. They looked painful to rebase in, so I've added
> > 	       them here.
> > patches 21-22: These enable the parser by default. It runs on all batches except
> >                those that set the I915_EXEC_SECURE flag in the execbuffer2 call.
> 
> I think long-term we should even scan secure batches. We'd need to allow
> some registers which only the drm master (i.e. owner of the display
> hardware) is allowed to do, e.g. for scanline waits. But once we have that
> we should be able to port all current users of secure batches over to
> scanned batches and so enforce this everywhere by default.
> 
> The other issue is that igt tests assume to be able to run some evil
> tests, so maybe we don't actually want this.

Agreed. I thought we could handle this as a follow-up task once the basic stuff is
in place, particularly given that we'd want to modify at least some users to test.
I also wasn't sure if we would want the check to be root && master, as in the current
secure flag, or just master.

W.r.t. the tests, I suppose we can just turn checking on for secure batches and see
what happens.

> 
> > There are follow-up patches to libdrm and to i-g-t. The i-g-t tests are very
> > basic and do not test all of the commands used by the parser on the assumption
> > that I'm likely to make the same mistakes in both the parser and the test.
> 
> Yeah, I agree that just checking whether commands all go through (or not)
> as expected adds very little value on top of the few tests you have done.
> I think we should take a look at some corner cases which might trip up
> your checker a bit though:
> - I think we should check batchbuffer chaining and make sure it works on
>   the vcs ring and not anywhere else (we can't ever break shipping libva
>   which uses this).
> - Some tests to trip up your parser should be done, like 3D commands that
>   fall off the end of the batch bo. Or commands that span page boundaries.
>   The later isn't an issue atm since you use vmap, but we should switch to
>   per-page kmap since the vmap overhead is fairly horrible.

Good suggestions. I'll look into these.

> 
> > I've run the i-g-t gem_* tests, the piglit quick tests (w/Mesa git from a few
> > days ago), and generally used an Ubuntu 13.10 IVB system with the parser
> > running. Aside from a failure described below, I don't think there are any
> > regressions. That is, piglit claims some regressions, but from manually running
> > the tests I think these are false positives. However, I could use help in
> > getting broader testing, particularly around performance. In general, I see less
> > than 3% performance impact on HSW, with more like 10% impact for pathological
> > batch sizes. But we'll certainly want to run relevant benchmarks beyond what
> > I've done.
> 
> Yeah, a microbenchmark that just shovels MI_NOP batches of various sizes
> through the checker and bypassing it (with EXEC_SECURE) would be really
> good. Maybe even some variable-sized commands (all the state setup stuff
> should be useful for that) to keep things interesting. Some variation is
> also important to have some good cache thrasing going on (since your check
> tables are fairly large I think).

Ok. I'd be interested in some comment from the mesa and libva guys here for real world
workloads, but a microbenchmark would be a good start.

Which "state setup stuff" are you referring to? Something specific in i-g-t or something
more general?

> 
> > At this point there are a couple of known issues and potential improvements.
> > 
> > 1) VLV. The parser is currently disabled for VLV. One type of check performed by
> >    the parser is that commands which access memory do so via PPGTT. VLV does not
> >    have PPGTT enabled at this time. I chose to implement the PPGTT checks via
> >    generic bit checking infrastructure in the parser, so they are not easily
> >    disabled for VLV. For now, I'm disabling parsing altogether in the hope that
> >    PPGTT can be enabled for VLV in the near future.
> 
> We need ppgtt for the parser anyway, since otherwise userspace can submit
> a self-modifying batch. Checking for that is impossible, so allowing sw
> checked batches without the ppgtt/ggtt split would be a decent security
> hole.
> 
> > 2) Coherency. I've found two types of coherency issues when reading the batch
> >    buffer from the CPU during execbuffer2. Looking for help with both issues.
> >     i. First, the i-g-t test gem_cpu_reloc blits to a batch buffer and the
> >        parser isn't properly waiting for the operation to complete before
> >        parsing. I tried adding i915_gem_object_sync(batch_obj, [ring|NULL])
> >        but that actually caused more failures.
> 
> This synchronization should happen when processing the relocations. The
> batch itself isn't modified by the gpu, we simply upload it using the
> blitter. So this going wrong indicates there's some issue somewhere ...

Ok. I didn't debug too far. Putting a gem_sync() in the test between the upload and
exec fixed the issue. Since I wasn't doing any explicit synchronization I assumed it
was my issue.

> 
> 
> >    ii. Second, on VLV, I've seen cache coherency issues when userspace writes
> >        the batch via pwrite fast path before calling execbuffer2. The parser
> >        reads stale data. This works fine on IVB and HSW, so I believe it's an
> >        LLC vs. non-LLC issue. I'm just unclear on what the correct flushing or
> >        synchronization is for this scenario.
> 
> Imo we take a good look at the optimized buffer read/write code from
> i915_gem_shmem_pread (for reading the userspace batch) and
> i915_gem_shmem_pwrite (for writing to the checked buffer). If we do the
> checking in-line with the reading this should also bring down the overhead
> massively compared to the current solution (those shmem read/write
> functions are fairly optimized).

I'll take a look. So far I'm not seeing the vmap as a bottleneck and I'm a bit concerned
about the complexity of trying to do mapping/parsing per-page, but I agree it's worth a
more detailed analysis.

> 
> > 3) 2nd-level batches. The parser currently allows MI_BATCH_BUFFER_START commands
> >    in userspace batches without parsing them. The media driver uses 2nd-level
> >    batches, so a solution is required. I have some ideas but don't want to delay
> >    the review process for what I have so far. It may be that the 2nd-level
> >    parsing is only needed for VCS and the current code (plus rejecting BBS)
> >    would be sufficient for RCS.
> 
> Afaik only libva uses second-level batches, and only on the vcs. So I hope

I would be very happy if that was the case. I couldn't easily tell from reading the libva
code which ring they were submitting to. I definitely did not find uses in mesa or the ddx.

> we can just submit those as unpriviledged batches if possible. If that's
> not possible it'll get fairly ugly I fear :(
> 
> > 4) Command buffer copy. To avoid CPU modifications to buffers after parsing, and
> >    to avoid GPU modifications to buffers via EUs or commands in the batch, we
> >    should copy the userspace batch buffer to memory that userspace does not
> >    have access to, map it into GGTT, and execute that batch buffer. I have a
> >    sense of how to do this for 1st-level batches, but it would need changes to
> >    tie in with the 2nd-level batch parsing I think, so I've again held off.
> 
> Yeah, we need the copying for otherwise the parsing is fairly pointless.
> I've stumbled over some of your internally patches and had a quick look at
> them. Two high-level comments:
> 
> - Using the existing active buffer lru instead of manual pinning would
>   better integrate with the eviction code. For an example of in-kernel
>   objects (and not userspace objects) using this look at the hw context
>   code.
> - Imo we should tag all buffers as purgeable while they're in the cache.
>   That way the shrinker will automatically drop the backing storage if
>   memory runs low (and thanks to the active list lru only when the gpu has
>   stopped processing the batch). That way we can just keep on allocating
>   buffers if they're all busy without any concern for running out of
>   memory.

Ok, I'll look at the hw context code for buffer mgmt. For "purgeable", just via the
madv field in the i915 gem object?

Also, there are a couple iterations of the work-in-progress patches. Do you prefer a
cache per ring or a single cache shared by all rings?

Thanks,
Brad

> 
> I'll try to read through your patch pile in the next few days, this is
> just the very-very high-level stuff that came to mind immediately.
> 
> Cheers, Daniel
> > 
> > Brad Volkin (22):
> >   drm/i915: Add data structures for command parser
> >   drm/i915: Initial command parser table definitions
> >   drm/i915: Hook command parser tables up to rings
> >   drm/i915: Add per-ring command length decode functions
> >   drm/i915: Implement command parsing
> >   drm/i915: Add a HAS_CMD_PARSER getparam
> >   drm/i915: Add support for rejecting commands during parsing
> >   drm/i915: Add support for checking register accesses
> >   drm/i915: Add support for rejecting commands via bitmasks
> >   drm/i915: Reject unsafe commands
> >   drm/i915: Add register whitelists for mesa
> >   drm/i915: Enable register whitelist checks
> >   drm/i915: Enable bit checking for some commands
> >   drm/i915: Enable PPGTT command parser checks
> >   drm/i915: Reject commands that would store to global HWS page
> >   drm/i915: Reject additional commands
> >   drm/i915: Add parser data for perf monitoring GL extensions
> >   drm/i915: Reject MI_ARB_ON_OFF on VECS
> >   drm/i915: Fix length handling for MFX_WAIT
> >   drm/i915: Fix MI_STORE_DWORD_IMM parser defintion
> >   drm/i915: Clean up command parser enable decision
> >   drm/i915: Enable command parsing by default
> > 
> >  drivers/gpu/drm/i915/Makefile              |   3 +-
> >  drivers/gpu/drm/i915/i915_cmd_parser.c     | 712 +++++++++++++++++++++++++++++
> >  drivers/gpu/drm/i915/i915_dma.c            |   3 +
> >  drivers/gpu/drm/i915/i915_drv.c            |   5 +
> >  drivers/gpu/drm/i915/i915_drv.h            |  96 ++++
> >  drivers/gpu/drm/i915/i915_gem_execbuffer.c |  15 +
> >  drivers/gpu/drm/i915/i915_reg.h            |  66 +++
> >  drivers/gpu/drm/i915/intel_ringbuffer.c    |   2 +
> >  drivers/gpu/drm/i915/intel_ringbuffer.h    |  25 +
> >  include/uapi/drm/i915_drm.h                |   1 +
> >  10 files changed, 927 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/gpu/drm/i915/i915_cmd_parser.c
> > 
> > -- 
> > 1.8.4.4
> > 
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch