[Intel-gfx] [RFC 2/4] drm/i915: Use batch pools with the command parser

Wed Jun 18 20:11:46 CEST 2014

On Wed, Jun 18, 2014 at 10:49:05AM -0700, Volkin, Bradley D wrote:
> On Wed, Jun 18, 2014 at 09:52:52AM -0700, Chris Wilson wrote:
> > On Wed, Jun 18, 2014 at 09:36:14AM -0700, bradley.d.volkin at intel.com wrote:
> > > From: Brad Volkin <bradley.d.volkin at intel.com>
> > > 
> > > This patch sets up all of the tracking and copying necessary to
> > > use batch pools with the command parser, but does not actually
> > > dispatch the copied (shadow) batch to the hardware yet. We still
> > > aren't quite ready to set the secure bit during dispatch.
> > > 
> > > Note that performance takes a hit from the copy in some cases
> > > and will likely need some work. At a rough pass, the memcpy
> > > appears to be the bottleneck. Without having done a deeper
> > > analysis, two ideas that come to mind are:
> > > 1) Copy sections of the batch at a time, as they are reached
> > >    by parsing. Might improve cache locality.
> > > 2) Copy only up to the userspace-supplied batch length and
> > >    memset the rest of the buffer. Reduces the number of reads.
> > > 
> > > Signed-off-by: Brad Volkin <bradley.d.volkin at intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/i915_cmd_parser.c       | 75 ++++++++++++++++++++++------
> > >  drivers/gpu/drm/i915/i915_drv.h              |  7 ++-
> > >  drivers/gpu/drm/i915/i915_gem.c              |  8 ++-
> > >  drivers/gpu/drm/i915/i915_gem_execbuffer.c   | 45 +++++++++++++++--
> > >  drivers/gpu/drm/i915/i915_gem_render_state.c |  2 +-
> > >  drivers/gpu/drm/i915/intel_ringbuffer.c      | 12 +++++
> > >  drivers/gpu/drm/i915/intel_ringbuffer.h      |  7 +++
> > >  7 files changed, 134 insertions(+), 22 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/i915_cmd_parser.c b/drivers/gpu/drm/i915/i915_cmd_parser.c
> > > index dea99d9..669afb0 100644
> > > --- a/drivers/gpu/drm/i915/i915_cmd_parser.c
> > > +++ b/drivers/gpu/drm/i915/i915_cmd_parser.c
> > > @@ -831,6 +831,53 @@ finish:
> > >  	return (u32*)addr;
> > >  }
> > >  
> > > +/* Returns a vmap'd pointer to dest_obj, which the caller must unmap */
> > > +static u32 *copy_batch(struct drm_i915_gem_object *dest_obj,
> > > +		       struct drm_i915_gem_object *src_obj)
> > > +{
> > > +	int ret = 0;
> > > +	int needs_clflush = 0;
> > > +	u32 *src_addr, *dest_addr = NULL;
> > > +
> > > +	ret = i915_gem_obj_prepare_shmem_read(src_obj, &needs_clflush);
> > > +	if (ret) {
> > > +		DRM_DEBUG_DRIVER("CMD: failed to prep read\n");
> > > +		return ERR_PTR(ret);
> > > +	}
> > > +
> > > +	src_addr = vmap_batch(src_obj);
> > > +	if (!src_addr) {
> > > +		DRM_DEBUG_DRIVER("CMD: Failed to vmap batch\n");
> > > +		ret = -ENOMEM;
> > > +		goto unpin_src;
> > > +	}
> > > +
> > > +	if (needs_clflush)
> > > +		drm_clflush_virt_range((char *)src_addr, src_obj->base.size);
> > > +
> > > +	ret = i915_gem_object_set_to_cpu_domain(dest_obj, true);
> > > +	if (ret) {
> > > +		DRM_DEBUG_DRIVER("CMD: Failed to set batch CPU domain\n");
> > > +		goto unmap_src;
> > > +	}
> > > +
> > > +	dest_addr = vmap_batch(dest_obj);
> > > +	if (!dest_addr) {
> > > +		DRM_DEBUG_DRIVER("CMD: Failed to vmap shadow batch\n");
> > > +		ret = -ENOMEM;
> > > +		goto unmap_src;
> > > +	}
> > > +
> > > +	memcpy(dest_addr, src_addr, src_obj->base.size);
> > > +
> > > +unmap_src:
> > > +	vunmap(src_addr);
> > > +unpin_src:
> > > +	i915_gem_object_unpin_pages(src_obj);
> > > +
> > > +	return ret ? ERR_PTR(ret) : dest_addr;
> > > +}
> > 
> > src_obj->size? You should perhaps borrow a byt.
> 
> Not sure I completely follow you here. The dest_obj is as big or bigger than
> the source, so I think copying only as much data as exists in the source
> object is right. I should be writing nops into any extra space though. And in
> parse_cmds, I should update the batch_end calculation. Were those the issues
> that you saw?

Just that generally src->size >> batch len and clflush will make it
twice as expensive on byt. (The object may be about 1000 times larger
than the batch at the extreme, libva *cough*.)

> > >  static int
> > > @@ -1087,6 +1088,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
> > >  	struct drm_i915_private *dev_priv = dev->dev_private;
> > >  	struct eb_vmas *eb;
> > >  	struct drm_i915_gem_object *batch_obj;
> > > +	struct drm_i915_gem_object *shadow_batch_obj = NULL;
> > >  	struct drm_clip_rect *cliprects = NULL;
> > >  	struct intel_engine_cs *ring;
> > >  	struct intel_context *ctx;
> > > @@ -1288,8 +1290,31 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
> > >  	batch_obj->base.pending_read_domains |= I915_GEM_DOMAIN_COMMAND;
> > >  
> > >  	if (i915_needs_cmd_parser(ring)) {
> > > +		shadow_batch_obj =
> > > +			i915_gem_batch_pool_get(ring->batch_pool,
> > > +						batch_obj->base.size);
> > > +		if (IS_ERR(shadow_batch_obj)) {
> > > +			ret = PTR_ERR(shadow_batch_obj);
> > > +			/* Don't try to clean up the obj in the error path */
> > > +			shadow_batch_obj = NULL;
> > > +
> > > +			/*
> > > +			 * If the pool is at capcity, retiring requests might
> > > +			 * return some buffers.
> > > +			 */
> > > +			if (ret == -EAGAIN)
> > > +				i915_gem_retire_requests_ring(ring);
> > > +
> > > +			goto err;
> > > +		}
> > > +
> > > +		ret = i915_gem_obj_ggtt_pin(shadow_batch_obj, 4096, 0);
> > > +		if (ret)
> > > +			goto err;
> > > +
> > >  		ret = i915_parse_cmds(ring,
> > >  				      batch_obj,
> > > +				      shadow_batch_obj,
> > >  				      args->batch_start_offset,
> > >  				      file->is_master);
> > 
> > You could just allocate the shadow inside parse_cmds and return it. Then
> > replace the conventional batch_boj with the new pointer and add it to
> > the eb list. Similarly, you do not need to track shadow_batch_obj
> > explicitly in the requests, the pool can do its own busy tracking
> > external to the core and reduce the invasiveness of the patch.
> 
> Overall, I agree that this touched more places than I would have liked.
> 
> I initially thought about replacing batch_obj and hooking into the eb list,
> but that seemed like it would involve trickier code w.r.t. ref counting,
> making up an exec_entry for the vma, etc. Not the end of the world, but
> it felt less obvious. I can look again though if you think it's worth it.

I think so. The request is already complicated enough and I think the
shadow batch can fit neatly inside the rules we already have with a
modicum of stitching here.

> What specifically are you thinking in terms of implementing busy tracking
> in the pool? The idea with adding the shadow object to the request was just
> to get it back in the pool and purgeable asap. I also thought it would limit
> some additional code given that we know buffers in the pool have had any
> pending work completed. Maybe the suggested approach would do a better job
> of those things though.

I was thinking that a pool is basically a list of bo, and you simply
query whether the oldest was still busy when we need a new bo. Which is
the same as how userspace implements its pool of active/inactive objects.

> > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> > > index e72017b..82aae29 100644
> > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> > > @@ -188,6 +188,13 @@ struct  intel_engine_cs {
> > >  	bool needs_cmd_parser;
> > >  
> > >  	/*
> > > +	 * A pool of objects to use as shadow copies of client batch buffers
> > > +	 * when the command parser is enabled. Prevents the client from
> > > +	 * modifying the batch contents after software parsing.
> > > +	 */
> > > +	struct i915_gem_batch_pool *batch_pool;
> > 
> > Why are they tied to a ring?
> 
> There was a question as to whether to make a global pool or per-ring
> pools way back when I first sent the parser series. Daniel expressed
> a slight preference for per-ring, but I don't object to a global pool.
> I suppose a potential benefit to per-ring would be if userspace uses
> different sized buffers on different rings, we'd more easily find a
> suitable buffer. But enhancing the pool management could fix that too.

Batch buffer sizes are not fixed per ring anyway. I guess Daniel thought
it might work out easier, but memory is a global resource and should be
tracked primarily at the device level. Userspace segregates its caches
based on size to speed up searches.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre