[Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

Tue Nov 26 20:35:38 CET 2013

Hi Brad,

On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.volkin at intel.com wrote:
> From: Brad Volkin <bradley.d.volkin at intel.com>
> 
> Certain OpenGL features (e.g. transform feedback, performance monitoring)
> require userspace code to submit batches containing commands such as
> MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some
> generations of the hardware will noop these commands in "unsecure" batches
> (which includes all userspace batches submitted via i915) even though the
> commands may be safe and represent the intended programming model of the device.
> 
> This series introduces a software command parser similar in operation to the
> command parsing done in hardware for unsecure batches. However, the software
> parser allows some operations that would be noop'd by hardware, if the parser
> determines the operation is safe, and submits the batch as "secure" to prevent
> hardware parsing. Currently the series implements this on IVB and HSW.
> 
> The series is divided into several phases:
> 
> patches 01-09: These implement infrastructure and the command parsing algorithm,
>                all behind a module parameter. I expect some discussion and
> 	       rework, but hopefully there's nothing too controversial.
> patches 10-17: These define the checks performed by the parser.
>                I expect much discussion :)
> patches 18-20: In a final pass over the command checks, I found some issues with
>                the definitions. They looked painful to rebase in, so I've added
> 	       them here.
> patches 21-22: These enable the parser by default. It runs on all batches except
>                those that set the I915_EXEC_SECURE flag in the execbuffer2 call.

I think long-term we should even scan secure batches. We'd need to allow
some registers which only the drm master (i.e. owner of the display
hardware) is allowed to do, e.g. for scanline waits. But once we have that
we should be able to port all current users of secure batches over to
scanned batches and so enforce this everywhere by default.

The other issue is that igt tests assume to be able to run some evil
tests, so maybe we don't actually want this.

> There are follow-up patches to libdrm and to i-g-t. The i-g-t tests are very
> basic and do not test all of the commands used by the parser on the assumption
> that I'm likely to make the same mistakes in both the parser and the test.

Yeah, I agree that just checking whether commands all go through (or not)
as expected adds very little value on top of the few tests you have done.
I think we should take a look at some corner cases which might trip up
your checker a bit though:
- I think we should check batchbuffer chaining and make sure it works on
  the vcs ring and not anywhere else (we can't ever break shipping libva
  which uses this).
- Some tests to trip up your parser should be done, like 3D commands that
  fall off the end of the batch bo. Or commands that span page boundaries.
  The later isn't an issue atm since you use vmap, but we should switch to
  per-page kmap since the vmap overhead is fairly horrible.

> I've run the i-g-t gem_* tests, the piglit quick tests (w/Mesa git from a few
> days ago), and generally used an Ubuntu 13.10 IVB system with the parser
> running. Aside from a failure described below, I don't think there are any
> regressions. That is, piglit claims some regressions, but from manually running
> the tests I think these are false positives. However, I could use help in
> getting broader testing, particularly around performance. In general, I see less
> than 3% performance impact on HSW, with more like 10% impact for pathological
> batch sizes. But we'll certainly want to run relevant benchmarks beyond what
> I've done.

Yeah, a microbenchmark that just shovels MI_NOP batches of various sizes
through the checker and bypassing it (with EXEC_SECURE) would be really
good. Maybe even some variable-sized commands (all the state setup stuff
should be useful for that) to keep things interesting. Some variation is
also important to have some good cache thrasing going on (since your check
tables are fairly large I think).

> At this point there are a couple of known issues and potential improvements.
> 
> 1) VLV. The parser is currently disabled for VLV. One type of check performed by
>    the parser is that commands which access memory do so via PPGTT. VLV does not
>    have PPGTT enabled at this time. I chose to implement the PPGTT checks via
>    generic bit checking infrastructure in the parser, so they are not easily
>    disabled for VLV. For now, I'm disabling parsing altogether in the hope that
>    PPGTT can be enabled for VLV in the near future.

We need ppgtt for the parser anyway, since otherwise userspace can submit
a self-modifying batch. Checking for that is impossible, so allowing sw
checked batches without the ppgtt/ggtt split would be a decent security
hole.

> 2) Coherency. I've found two types of coherency issues when reading the batch
>    buffer from the CPU during execbuffer2. Looking for help with both issues.
>     i. First, the i-g-t test gem_cpu_reloc blits to a batch buffer and the
>        parser isn't properly waiting for the operation to complete before
>        parsing. I tried adding i915_gem_object_sync(batch_obj, [ring|NULL])
>        but that actually caused more failures.

This synchronization should happen when processing the relocations. The
batch itself isn't modified by the gpu, we simply upload it using the
blitter. So this going wrong indicates there's some issue somewhere ...

>    ii. Second, on VLV, I've seen cache coherency issues when userspace writes
>        the batch via pwrite fast path before calling execbuffer2. The parser
>        reads stale data. This works fine on IVB and HSW, so I believe it's an
>        LLC vs. non-LLC issue. I'm just unclear on what the correct flushing or
>        synchronization is for this scenario.

Imo we take a good look at the optimized buffer read/write code from
i915_gem_shmem_pread (for reading the userspace batch) and
i915_gem_shmem_pwrite (for writing to the checked buffer). If we do the
checking in-line with the reading this should also bring down the overhead
massively compared to the current solution (those shmem read/write
functions are fairly optimized).

> 3) 2nd-level batches. The parser currently allows MI_BATCH_BUFFER_START commands
>    in userspace batches without parsing them. The media driver uses 2nd-level
>    batches, so a solution is required. I have some ideas but don't want to delay
>    the review process for what I have so far. It may be that the 2nd-level
>    parsing is only needed for VCS and the current code (plus rejecting BBS)
>    would be sufficient for RCS.

Afaik only libva uses second-level batches, and only on the vcs. So I hope
we can just submit those as unpriviledged batches if possible. If that's
not possible it'll get fairly ugly I fear :(

> 4) Command buffer copy. To avoid CPU modifications to buffers after parsing, and
>    to avoid GPU modifications to buffers via EUs or commands in the batch, we
>    should copy the userspace batch buffer to memory that userspace does not
>    have access to, map it into GGTT, and execute that batch buffer. I have a
>    sense of how to do this for 1st-level batches, but it would need changes to
>    tie in with the 2nd-level batch parsing I think, so I've again held off.

Yeah, we need the copying for otherwise the parsing is fairly pointless.
I've stumbled over some of your internally patches and had a quick look at
them. Two high-level comments:

- Using the existing active buffer lru instead of manual pinning would
  better integrate with the eviction code. For an example of in-kernel
  objects (and not userspace objects) using this look at the hw context
  code.
- Imo we should tag all buffers as purgeable while they're in the cache.
  That way the shrinker will automatically drop the backing storage if
  memory runs low (and thanks to the active list lru only when the gpu has
  stopped processing the batch). That way we can just keep on allocating
  buffers if they're all busy without any concern for running out of
  memory.

I'll try to read through your patch pile in the next few days, this is
just the very-very high-level stuff that came to mind immediately.

Cheers, Daniel
> 
> Brad Volkin (22):
>   drm/i915: Add data structures for command parser
>   drm/i915: Initial command parser table definitions
>   drm/i915: Hook command parser tables up to rings
>   drm/i915: Add per-ring command length decode functions
>   drm/i915: Implement command parsing
>   drm/i915: Add a HAS_CMD_PARSER getparam
>   drm/i915: Add support for rejecting commands during parsing
>   drm/i915: Add support for checking register accesses
>   drm/i915: Add support for rejecting commands via bitmasks
>   drm/i915: Reject unsafe commands
>   drm/i915: Add register whitelists for mesa
>   drm/i915: Enable register whitelist checks
>   drm/i915: Enable bit checking for some commands
>   drm/i915: Enable PPGTT command parser checks
>   drm/i915: Reject commands that would store to global HWS page
>   drm/i915: Reject additional commands
>   drm/i915: Add parser data for perf monitoring GL extensions
>   drm/i915: Reject MI_ARB_ON_OFF on VECS
>   drm/i915: Fix length handling for MFX_WAIT
>   drm/i915: Fix MI_STORE_DWORD_IMM parser defintion
>   drm/i915: Clean up command parser enable decision
>   drm/i915: Enable command parsing by default
> 
>  drivers/gpu/drm/i915/Makefile              |   3 +-
>  drivers/gpu/drm/i915/i915_cmd_parser.c     | 712 +++++++++++++++++++++++++++++
>  drivers/gpu/drm/i915/i915_dma.c            |   3 +
>  drivers/gpu/drm/i915/i915_drv.c            |   5 +
>  drivers/gpu/drm/i915/i915_drv.h            |  96 ++++
>  drivers/gpu/drm/i915/i915_gem_execbuffer.c |  15 +
>  drivers/gpu/drm/i915/i915_reg.h            |  66 +++
>  drivers/gpu/drm/i915/intel_ringbuffer.c    |   2 +
>  drivers/gpu/drm/i915/intel_ringbuffer.h    |  25 +
>  include/uapi/drm/i915_drm.h                |   1 +
>  10 files changed, 927 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/gpu/drm/i915/i915_cmd_parser.c
> 
> -- 
> 1.8.4.4
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch