[Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
Volkin, Bradley D
bradley.d.volkin at intel.com
Tue Nov 26 21:24:14 CET 2013
On Tue, Nov 26, 2013 at 11:35:38AM -0800, Daniel Vetter wrote:
> Hi Brad,
>
> On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.volkin at intel.com wrote:
> > From: Brad Volkin <bradley.d.volkin at intel.com>
> >
> > Certain OpenGL features (e.g. transform feedback, performance monitoring)
> > require userspace code to submit batches containing commands such as
> > MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some
> > generations of the hardware will noop these commands in "unsecure" batches
> > (which includes all userspace batches submitted via i915) even though the
> > commands may be safe and represent the intended programming model of the device.
> >
> > This series introduces a software command parser similar in operation to the
> > command parsing done in hardware for unsecure batches. However, the software
> > parser allows some operations that would be noop'd by hardware, if the parser
> > determines the operation is safe, and submits the batch as "secure" to prevent
> > hardware parsing. Currently the series implements this on IVB and HSW.
> >
> > The series is divided into several phases:
> >
> > patches 01-09: These implement infrastructure and the command parsing algorithm,
> > all behind a module parameter. I expect some discussion and
> > rework, but hopefully there's nothing too controversial.
> > patches 10-17: These define the checks performed by the parser.
> > I expect much discussion :)
> > patches 18-20: In a final pass over the command checks, I found some issues with
> > the definitions. They looked painful to rebase in, so I've added
> > them here.
> > patches 21-22: These enable the parser by default. It runs on all batches except
> > those that set the I915_EXEC_SECURE flag in the execbuffer2 call.
>
> I think long-term we should even scan secure batches. We'd need to allow
> some registers which only the drm master (i.e. owner of the display
> hardware) is allowed to do, e.g. for scanline waits. But once we have that
> we should be able to port all current users of secure batches over to
> scanned batches and so enforce this everywhere by default.
>
> The other issue is that igt tests assume to be able to run some evil
> tests, so maybe we don't actually want this.
Agreed. I thought we could handle this as a follow-up task once the basic stuff is
in place, particularly given that we'd want to modify at least some users to test.
I also wasn't sure if we would want the check to be root && master, as in the current
secure flag, or just master.
W.r.t. the tests, I suppose we can just turn checking on for secure batches and see
what happens.
>
> > There are follow-up patches to libdrm and to i-g-t. The i-g-t tests are very
> > basic and do not test all of the commands used by the parser on the assumption
> > that I'm likely to make the same mistakes in both the parser and the test.
>
> Yeah, I agree that just checking whether commands all go through (or not)
> as expected adds very little value on top of the few tests you have done.
> I think we should take a look at some corner cases which might trip up
> your checker a bit though:
> - I think we should check batchbuffer chaining and make sure it works on
> the vcs ring and not anywhere else (we can't ever break shipping libva
> which uses this).
> - Some tests to trip up your parser should be done, like 3D commands that
> fall off the end of the batch bo. Or commands that span page boundaries.
> The later isn't an issue atm since you use vmap, but we should switch to
> per-page kmap since the vmap overhead is fairly horrible.
Good suggestions. I'll look into these.
>
> > I've run the i-g-t gem_* tests, the piglit quick tests (w/Mesa git from a few
> > days ago), and generally used an Ubuntu 13.10 IVB system with the parser
> > running. Aside from a failure described below, I don't think there are any
> > regressions. That is, piglit claims some regressions, but from manually running
> > the tests I think these are false positives. However, I could use help in
> > getting broader testing, particularly around performance. In general, I see less
> > than 3% performance impact on HSW, with more like 10% impact for pathological
> > batch sizes. But we'll certainly want to run relevant benchmarks beyond what
> > I've done.
>
> Yeah, a microbenchmark that just shovels MI_NOP batches of various sizes
> through the checker and bypassing it (with EXEC_SECURE) would be really
> good. Maybe even some variable-sized commands (all the state setup stuff
> should be useful for that) to keep things interesting. Some variation is
> also important to have some good cache thrasing going on (since your check
> tables are fairly large I think).
Ok. I'd be interested in some comment from the mesa and libva guys here for real world
workloads, but a microbenchmark would be a good start.
Which "state setup stuff" are you referring to? Something specific in i-g-t or something
more general?
>
> > At this point there are a couple of known issues and potential improvements.
> >
> > 1) VLV. The parser is currently disabled for VLV. One type of check performed by
> > the parser is that commands which access memory do so via PPGTT. VLV does not
> > have PPGTT enabled at this time. I chose to implement the PPGTT checks via
> > generic bit checking infrastructure in the parser, so they are not easily
> > disabled for VLV. For now, I'm disabling parsing altogether in the hope that
> > PPGTT can be enabled for VLV in the near future.
>
> We need ppgtt for the parser anyway, since otherwise userspace can submit
> a self-modifying batch. Checking for that is impossible, so allowing sw
> checked batches without the ppgtt/ggtt split would be a decent security
> hole.
>
> > 2) Coherency. I've found two types of coherency issues when reading the batch
> > buffer from the CPU during execbuffer2. Looking for help with both issues.
> > i. First, the i-g-t test gem_cpu_reloc blits to a batch buffer and the
> > parser isn't properly waiting for the operation to complete before
> > parsing. I tried adding i915_gem_object_sync(batch_obj, [ring|NULL])
> > but that actually caused more failures.
>
> This synchronization should happen when processing the relocations. The
> batch itself isn't modified by the gpu, we simply upload it using the
> blitter. So this going wrong indicates there's some issue somewhere ...
Ok. I didn't debug too far. Putting a gem_sync() in the test between the upload and
exec fixed the issue. Since I wasn't doing any explicit synchronization I assumed it
was my issue.
>
>
> > ii. Second, on VLV, I've seen cache coherency issues when userspace writes
> > the batch via pwrite fast path before calling execbuffer2. The parser
> > reads stale data. This works fine on IVB and HSW, so I believe it's an
> > LLC vs. non-LLC issue. I'm just unclear on what the correct flushing or
> > synchronization is for this scenario.
>
> Imo we take a good look at the optimized buffer read/write code from
> i915_gem_shmem_pread (for reading the userspace batch) and
> i915_gem_shmem_pwrite (for writing to the checked buffer). If we do the
> checking in-line with the reading this should also bring down the overhead
> massively compared to the current solution (those shmem read/write
> functions are fairly optimized).
I'll take a look. So far I'm not seeing the vmap as a bottleneck and I'm a bit concerned
about the complexity of trying to do mapping/parsing per-page, but I agree it's worth a
more detailed analysis.
>
> > 3) 2nd-level batches. The parser currently allows MI_BATCH_BUFFER_START commands
> > in userspace batches without parsing them. The media driver uses 2nd-level
> > batches, so a solution is required. I have some ideas but don't want to delay
> > the review process for what I have so far. It may be that the 2nd-level
> > parsing is only needed for VCS and the current code (plus rejecting BBS)
> > would be sufficient for RCS.
>
> Afaik only libva uses second-level batches, and only on the vcs. So I hope
I would be very happy if that was the case. I couldn't easily tell from reading the libva
code which ring they were submitting to. I definitely did not find uses in mesa or the ddx.
> we can just submit those as unpriviledged batches if possible. If that's
> not possible it'll get fairly ugly I fear :(
>
> > 4) Command buffer copy. To avoid CPU modifications to buffers after parsing, and
> > to avoid GPU modifications to buffers via EUs or commands in the batch, we
> > should copy the userspace batch buffer to memory that userspace does not
> > have access to, map it into GGTT, and execute that batch buffer. I have a
> > sense of how to do this for 1st-level batches, but it would need changes to
> > tie in with the 2nd-level batch parsing I think, so I've again held off.
>
> Yeah, we need the copying for otherwise the parsing is fairly pointless.
> I've stumbled over some of your internally patches and had a quick look at
> them. Two high-level comments:
>
> - Using the existing active buffer lru instead of manual pinning would
> better integrate with the eviction code. For an example of in-kernel
> objects (and not userspace objects) using this look at the hw context
> code.
> - Imo we should tag all buffers as purgeable while they're in the cache.
> That way the shrinker will automatically drop the backing storage if
> memory runs low (and thanks to the active list lru only when the gpu has
> stopped processing the batch). That way we can just keep on allocating
> buffers if they're all busy without any concern for running out of
> memory.
Ok, I'll look at the hw context code for buffer mgmt. For "purgeable", just via the
madv field in the i915 gem object?
Also, there are a couple iterations of the work-in-progress patches. Do you prefer a
cache per ring or a single cache shared by all rings?
Thanks,
Brad
>
> I'll try to read through your patch pile in the next few days, this is
> just the very-very high-level stuff that came to mind immediately.
>
> Cheers, Daniel
> >
> > Brad Volkin (22):
> > drm/i915: Add data structures for command parser
> > drm/i915: Initial command parser table definitions
> > drm/i915: Hook command parser tables up to rings
> > drm/i915: Add per-ring command length decode functions
> > drm/i915: Implement command parsing
> > drm/i915: Add a HAS_CMD_PARSER getparam
> > drm/i915: Add support for rejecting commands during parsing
> > drm/i915: Add support for checking register accesses
> > drm/i915: Add support for rejecting commands via bitmasks
> > drm/i915: Reject unsafe commands
> > drm/i915: Add register whitelists for mesa
> > drm/i915: Enable register whitelist checks
> > drm/i915: Enable bit checking for some commands
> > drm/i915: Enable PPGTT command parser checks
> > drm/i915: Reject commands that would store to global HWS page
> > drm/i915: Reject additional commands
> > drm/i915: Add parser data for perf monitoring GL extensions
> > drm/i915: Reject MI_ARB_ON_OFF on VECS
> > drm/i915: Fix length handling for MFX_WAIT
> > drm/i915: Fix MI_STORE_DWORD_IMM parser defintion
> > drm/i915: Clean up command parser enable decision
> > drm/i915: Enable command parsing by default
> >
> > drivers/gpu/drm/i915/Makefile | 3 +-
> > drivers/gpu/drm/i915/i915_cmd_parser.c | 712 +++++++++++++++++++++++++++++
> > drivers/gpu/drm/i915/i915_dma.c | 3 +
> > drivers/gpu/drm/i915/i915_drv.c | 5 +
> > drivers/gpu/drm/i915/i915_drv.h | 96 ++++
> > drivers/gpu/drm/i915/i915_gem_execbuffer.c | 15 +
> > drivers/gpu/drm/i915/i915_reg.h | 66 +++
> > drivers/gpu/drm/i915/intel_ringbuffer.c | 2 +
> > drivers/gpu/drm/i915/intel_ringbuffer.h | 25 +
> > include/uapi/drm/i915_drm.h | 1 +
> > 10 files changed, 927 insertions(+), 1 deletion(-)
> > create mode 100644 drivers/gpu/drm/i915/i915_cmd_parser.c
> >
> > --
> > 1.8.4.4
> >
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
More information about the Intel-gfx
mailing list