[Intel-gfx] [RFC 2/4] drm/i915: Use batch pools with the command parser
Chris Wilson
chris at chris-wilson.co.uk
Wed Jun 18 20:11:46 CEST 2014
On Wed, Jun 18, 2014 at 10:49:05AM -0700, Volkin, Bradley D wrote:
> On Wed, Jun 18, 2014 at 09:52:52AM -0700, Chris Wilson wrote:
> > On Wed, Jun 18, 2014 at 09:36:14AM -0700, bradley.d.volkin at intel.com wrote:
> > > From: Brad Volkin <bradley.d.volkin at intel.com>
> > >
> > > This patch sets up all of the tracking and copying necessary to
> > > use batch pools with the command parser, but does not actually
> > > dispatch the copied (shadow) batch to the hardware yet. We still
> > > aren't quite ready to set the secure bit during dispatch.
> > >
> > > Note that performance takes a hit from the copy in some cases
> > > and will likely need some work. At a rough pass, the memcpy
> > > appears to be the bottleneck. Without having done a deeper
> > > analysis, two ideas that come to mind are:
> > > 1) Copy sections of the batch at a time, as they are reached
> > > by parsing. Might improve cache locality.
> > > 2) Copy only up to the userspace-supplied batch length and
> > > memset the rest of the buffer. Reduces the number of reads.
> > >
> > > Signed-off-by: Brad Volkin <bradley.d.volkin at intel.com>
> > > ---
> > > drivers/gpu/drm/i915/i915_cmd_parser.c | 75 ++++++++++++++++++++++------
> > > drivers/gpu/drm/i915/i915_drv.h | 7 ++-
> > > drivers/gpu/drm/i915/i915_gem.c | 8 ++-
> > > drivers/gpu/drm/i915/i915_gem_execbuffer.c | 45 +++++++++++++++--
> > > drivers/gpu/drm/i915/i915_gem_render_state.c | 2 +-
> > > drivers/gpu/drm/i915/intel_ringbuffer.c | 12 +++++
> > > drivers/gpu/drm/i915/intel_ringbuffer.h | 7 +++
> > > 7 files changed, 134 insertions(+), 22 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/i915/i915_cmd_parser.c b/drivers/gpu/drm/i915/i915_cmd_parser.c
> > > index dea99d9..669afb0 100644
> > > --- a/drivers/gpu/drm/i915/i915_cmd_parser.c
> > > +++ b/drivers/gpu/drm/i915/i915_cmd_parser.c
> > > @@ -831,6 +831,53 @@ finish:
> > > return (u32*)addr;
> > > }
> > >
> > > +/* Returns a vmap'd pointer to dest_obj, which the caller must unmap */
> > > +static u32 *copy_batch(struct drm_i915_gem_object *dest_obj,
> > > + struct drm_i915_gem_object *src_obj)
> > > +{
> > > + int ret = 0;
> > > + int needs_clflush = 0;
> > > + u32 *src_addr, *dest_addr = NULL;
> > > +
> > > + ret = i915_gem_obj_prepare_shmem_read(src_obj, &needs_clflush);
> > > + if (ret) {
> > > + DRM_DEBUG_DRIVER("CMD: failed to prep read\n");
> > > + return ERR_PTR(ret);
> > > + }
> > > +
> > > + src_addr = vmap_batch(src_obj);
> > > + if (!src_addr) {
> > > + DRM_DEBUG_DRIVER("CMD: Failed to vmap batch\n");
> > > + ret = -ENOMEM;
> > > + goto unpin_src;
> > > + }
> > > +
> > > + if (needs_clflush)
> > > + drm_clflush_virt_range((char *)src_addr, src_obj->base.size);
> > > +
> > > + ret = i915_gem_object_set_to_cpu_domain(dest_obj, true);
> > > + if (ret) {
> > > + DRM_DEBUG_DRIVER("CMD: Failed to set batch CPU domain\n");
> > > + goto unmap_src;
> > > + }
> > > +
> > > + dest_addr = vmap_batch(dest_obj);
> > > + if (!dest_addr) {
> > > + DRM_DEBUG_DRIVER("CMD: Failed to vmap shadow batch\n");
> > > + ret = -ENOMEM;
> > > + goto unmap_src;
> > > + }
> > > +
> > > + memcpy(dest_addr, src_addr, src_obj->base.size);
> > > +
> > > +unmap_src:
> > > + vunmap(src_addr);
> > > +unpin_src:
> > > + i915_gem_object_unpin_pages(src_obj);
> > > +
> > > + return ret ? ERR_PTR(ret) : dest_addr;
> > > +}
> >
> > src_obj->size? You should perhaps borrow a byt.
>
> Not sure I completely follow you here. The dest_obj is as big or bigger than
> the source, so I think copying only as much data as exists in the source
> object is right. I should be writing nops into any extra space though. And in
> parse_cmds, I should update the batch_end calculation. Were those the issues
> that you saw?
Just that generally src->size >> batch len and clflush will make it
twice as expensive on byt. (The object may be about 1000 times larger
than the batch at the extreme, libva *cough*.)
> > > static int
> > > @@ -1087,6 +1088,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
> > > struct drm_i915_private *dev_priv = dev->dev_private;
> > > struct eb_vmas *eb;
> > > struct drm_i915_gem_object *batch_obj;
> > > + struct drm_i915_gem_object *shadow_batch_obj = NULL;
> > > struct drm_clip_rect *cliprects = NULL;
> > > struct intel_engine_cs *ring;
> > > struct intel_context *ctx;
> > > @@ -1288,8 +1290,31 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
> > > batch_obj->base.pending_read_domains |= I915_GEM_DOMAIN_COMMAND;
> > >
> > > if (i915_needs_cmd_parser(ring)) {
> > > + shadow_batch_obj =
> > > + i915_gem_batch_pool_get(ring->batch_pool,
> > > + batch_obj->base.size);
> > > + if (IS_ERR(shadow_batch_obj)) {
> > > + ret = PTR_ERR(shadow_batch_obj);
> > > + /* Don't try to clean up the obj in the error path */
> > > + shadow_batch_obj = NULL;
> > > +
> > > + /*
> > > + * If the pool is at capcity, retiring requests might
> > > + * return some buffers.
> > > + */
> > > + if (ret == -EAGAIN)
> > > + i915_gem_retire_requests_ring(ring);
> > > +
> > > + goto err;
> > > + }
> > > +
> > > + ret = i915_gem_obj_ggtt_pin(shadow_batch_obj, 4096, 0);
> > > + if (ret)
> > > + goto err;
> > > +
> > > ret = i915_parse_cmds(ring,
> > > batch_obj,
> > > + shadow_batch_obj,
> > > args->batch_start_offset,
> > > file->is_master);
> >
> > You could just allocate the shadow inside parse_cmds and return it. Then
> > replace the conventional batch_boj with the new pointer and add it to
> > the eb list. Similarly, you do not need to track shadow_batch_obj
> > explicitly in the requests, the pool can do its own busy tracking
> > external to the core and reduce the invasiveness of the patch.
>
> Overall, I agree that this touched more places than I would have liked.
>
> I initially thought about replacing batch_obj and hooking into the eb list,
> but that seemed like it would involve trickier code w.r.t. ref counting,
> making up an exec_entry for the vma, etc. Not the end of the world, but
> it felt less obvious. I can look again though if you think it's worth it.
I think so. The request is already complicated enough and I think the
shadow batch can fit neatly inside the rules we already have with a
modicum of stitching here.
> What specifically are you thinking in terms of implementing busy tracking
> in the pool? The idea with adding the shadow object to the request was just
> to get it back in the pool and purgeable asap. I also thought it would limit
> some additional code given that we know buffers in the pool have had any
> pending work completed. Maybe the suggested approach would do a better job
> of those things though.
I was thinking that a pool is basically a list of bo, and you simply
query whether the oldest was still busy when we need a new bo. Which is
the same as how userspace implements its pool of active/inactive objects.
> > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> > > index e72017b..82aae29 100644
> > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> > > @@ -188,6 +188,13 @@ struct intel_engine_cs {
> > > bool needs_cmd_parser;
> > >
> > > /*
> > > + * A pool of objects to use as shadow copies of client batch buffers
> > > + * when the command parser is enabled. Prevents the client from
> > > + * modifying the batch contents after software parsing.
> > > + */
> > > + struct i915_gem_batch_pool *batch_pool;
> >
> > Why are they tied to a ring?
>
> There was a question as to whether to make a global pool or per-ring
> pools way back when I first sent the parser series. Daniel expressed
> a slight preference for per-ring, but I don't object to a global pool.
> I suppose a potential benefit to per-ring would be if userspace uses
> different sized buffers on different rings, we'd more easily find a
> suitable buffer. But enhancing the pool management could fix that too.
Batch buffer sizes are not fixed per ring anyway. I guess Daniel thought
it might work out easier, but memory is a global resource and should be
tracked primarily at the device level. Userspace segregates its caches
based on size to speed up searches.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
More information about the Intel-gfx
mailing list