[Intel-gfx] [RFC 2/4] drm/i915: Use batch pools with the command parser

Wed Jun 18 19:49:05 CEST 2014

On Wed, Jun 18, 2014 at 09:52:52AM -0700, Chris Wilson wrote:
> On Wed, Jun 18, 2014 at 09:36:14AM -0700, bradley.d.volkin at intel.com wrote:
> > From: Brad Volkin <bradley.d.volkin at intel.com>
> > 
> > This patch sets up all of the tracking and copying necessary to
> > use batch pools with the command parser, but does not actually
> > dispatch the copied (shadow) batch to the hardware yet. We still
> > aren't quite ready to set the secure bit during dispatch.
> > 
> > Note that performance takes a hit from the copy in some cases
> > and will likely need some work. At a rough pass, the memcpy
> > appears to be the bottleneck. Without having done a deeper
> > analysis, two ideas that come to mind are:
> > 1) Copy sections of the batch at a time, as they are reached
> >    by parsing. Might improve cache locality.
> > 2) Copy only up to the userspace-supplied batch length and
> >    memset the rest of the buffer. Reduces the number of reads.
> > 
> > Signed-off-by: Brad Volkin <bradley.d.volkin at intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_cmd_parser.c       | 75 ++++++++++++++++++++++------
> >  drivers/gpu/drm/i915/i915_drv.h              |  7 ++-
> >  drivers/gpu/drm/i915/i915_gem.c              |  8 ++-
> >  drivers/gpu/drm/i915/i915_gem_execbuffer.c   | 45 +++++++++++++++--
> >  drivers/gpu/drm/i915/i915_gem_render_state.c |  2 +-
> >  drivers/gpu/drm/i915/intel_ringbuffer.c      | 12 +++++
> >  drivers/gpu/drm/i915/intel_ringbuffer.h      |  7 +++
> >  7 files changed, 134 insertions(+), 22 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_cmd_parser.c b/drivers/gpu/drm/i915/i915_cmd_parser.c
> > index dea99d9..669afb0 100644
> > --- a/drivers/gpu/drm/i915/i915_cmd_parser.c
> > +++ b/drivers/gpu/drm/i915/i915_cmd_parser.c
> > @@ -831,6 +831,53 @@ finish:
> >  	return (u32*)addr;
> >  }
> >  
> > +/* Returns a vmap'd pointer to dest_obj, which the caller must unmap */
> > +static u32 *copy_batch(struct drm_i915_gem_object *dest_obj,
> > +		       struct drm_i915_gem_object *src_obj)
> > +{
> > +	int ret = 0;
> > +	int needs_clflush = 0;
> > +	u32 *src_addr, *dest_addr = NULL;
> > +
> > +	ret = i915_gem_obj_prepare_shmem_read(src_obj, &needs_clflush);
> > +	if (ret) {
> > +		DRM_DEBUG_DRIVER("CMD: failed to prep read\n");
> > +		return ERR_PTR(ret);
> > +	}
> > +
> > +	src_addr = vmap_batch(src_obj);
> > +	if (!src_addr) {
> > +		DRM_DEBUG_DRIVER("CMD: Failed to vmap batch\n");
> > +		ret = -ENOMEM;
> > +		goto unpin_src;
> > +	}
> > +
> > +	if (needs_clflush)
> > +		drm_clflush_virt_range((char *)src_addr, src_obj->base.size);
> > +
> > +	ret = i915_gem_object_set_to_cpu_domain(dest_obj, true);
> > +	if (ret) {
> > +		DRM_DEBUG_DRIVER("CMD: Failed to set batch CPU domain\n");
> > +		goto unmap_src;
> > +	}
> > +
> > +	dest_addr = vmap_batch(dest_obj);
> > +	if (!dest_addr) {
> > +		DRM_DEBUG_DRIVER("CMD: Failed to vmap shadow batch\n");
> > +		ret = -ENOMEM;
> > +		goto unmap_src;
> > +	}
> > +
> > +	memcpy(dest_addr, src_addr, src_obj->base.size);
> > +
> > +unmap_src:
> > +	vunmap(src_addr);
> > +unpin_src:
> > +	i915_gem_object_unpin_pages(src_obj);
> > +
> > +	return ret ? ERR_PTR(ret) : dest_addr;
> > +}
> 
> src_obj->size? You should perhaps borrow a byt.

Not sure I completely follow you here. The dest_obj is as big or bigger than
the source, so I think copying only as much data as exists in the source
object is right. I should be writing nops into any extra space though. And in
parse_cmds, I should update the batch_end calculation. Were those the issues
that you saw?

> 
> >  static int
> > @@ -1087,6 +1088,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
> >  	struct drm_i915_private *dev_priv = dev->dev_private;
> >  	struct eb_vmas *eb;
> >  	struct drm_i915_gem_object *batch_obj;
> > +	struct drm_i915_gem_object *shadow_batch_obj = NULL;
> >  	struct drm_clip_rect *cliprects = NULL;
> >  	struct intel_engine_cs *ring;
> >  	struct intel_context *ctx;
> > @@ -1288,8 +1290,31 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
> >  	batch_obj->base.pending_read_domains |= I915_GEM_DOMAIN_COMMAND;
> >  
> >  	if (i915_needs_cmd_parser(ring)) {
> > +		shadow_batch_obj =
> > +			i915_gem_batch_pool_get(ring->batch_pool,
> > +						batch_obj->base.size);
> > +		if (IS_ERR(shadow_batch_obj)) {
> > +			ret = PTR_ERR(shadow_batch_obj);
> > +			/* Don't try to clean up the obj in the error path */
> > +			shadow_batch_obj = NULL;
> > +
> > +			/*
> > +			 * If the pool is at capcity, retiring requests might
> > +			 * return some buffers.
> > +			 */
> > +			if (ret == -EAGAIN)
> > +				i915_gem_retire_requests_ring(ring);
> > +
> > +			goto err;
> > +		}
> > +
> > +		ret = i915_gem_obj_ggtt_pin(shadow_batch_obj, 4096, 0);
> > +		if (ret)
> > +			goto err;
> > +
> >  		ret = i915_parse_cmds(ring,
> >  				      batch_obj,
> > +				      shadow_batch_obj,
> >  				      args->batch_start_offset,
> >  				      file->is_master);
> 
> You could just allocate the shadow inside parse_cmds and return it. Then
> replace the conventional batch_boj with the new pointer and add it to
> the eb list. Similarly, you do not need to track shadow_batch_obj
> explicitly in the requests, the pool can do its own busy tracking
> external to the core and reduce the invasiveness of the patch.

Overall, I agree that this touched more places than I would have liked.

I initially thought about replacing batch_obj and hooking into the eb list,
but that seemed like it would involve trickier code w.r.t. ref counting,
making up an exec_entry for the vma, etc. Not the end of the world, but
it felt less obvious. I can look again though if you think it's worth it.

What specifically are you thinking in terms of implementing busy tracking
in the pool? The idea with adding the shadow object to the request was just
to get it back in the pool and purgeable asap. I also thought it would limit
some additional code given that we know buffers in the pool have had any
pending work completed. Maybe the suggested approach would do a better job
of those things though.

> 
> > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> > index e72017b..82aae29 100644
> > --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> > @@ -188,6 +188,13 @@ struct  intel_engine_cs {
> >  	bool needs_cmd_parser;
> >  
> >  	/*
> > +	 * A pool of objects to use as shadow copies of client batch buffers
> > +	 * when the command parser is enabled. Prevents the client from
> > +	 * modifying the batch contents after software parsing.
> > +	 */
> > +	struct i915_gem_batch_pool *batch_pool;
> 
> Why are they tied to a ring?

There was a question as to whether to make a global pool or per-ring
pools way back when I first sent the parser series. Daniel expressed
a slight preference for per-ring, but I don't object to a global pool.
I suppose a potential benefit to per-ring would be if userspace uses
different sized buffers on different rings, we'd more easily find a
suitable buffer. But enhancing the pool management could fix that too.

Brad

> -Chris
> 
> -- 
> Chris Wilson, Intel Open Source Technology Centre