[Intel-gfx] [PATCH v3 2/5] drm/i915: Use batch pools with the command parser

Tue Nov 4 17:35:00 CET 2014

On Tue, Nov 04, 2014 at 02:17:59AM -0800, Daniel Vetter wrote:
> On Mon, Nov 03, 2014 at 11:19:42AM -0800, bradley.d.volkin at intel.com wrote:
> > From: Brad Volkin <bradley.d.volkin at intel.com>
> > 
> > This patch sets up all of the tracking and copying necessary to
> > use batch pools with the command parser and dispatches the copied
> > (shadow) batch to the hardware.
> > 
> > After this patch, the parser is in 'enabling' mode.
> 
> Awesome. Ok, a few small comments and notes for polish on this one.
> 
> > 
> > Note that performance takes a hit from the copy in some cases
> > and will likely need some work. At a rough pass, the memcpy
> > appears to be the bottleneck. Without having done a deeper
> > analysis, two ideas that come to mind are:
> > 1) Copy sections of the batch at a time, as they are reached
> >    by parsing. Might improve cache locality.
> > 2) Copy only up to the userspace-supplied batch length and
> >    memset the rest of the buffer. Reduces the number of reads.
> > 
> > v2:
> > - Remove setting the capacity of the pool
> > - One global pool instead of per-ring pools
> > - Replace batch_obj with shadow_batch_obj and hook into eb->vmas
> > - Memset any space in the shadow batch beyond what gets copied
> 
> Hm, what is that memset for? Just appeasing the cmdparser?
> 
> In that case why can't we just fail the scanning if we don't find the
> MI_BB_END within the limit - the hardware should never execute beyond
> that, so garbage beyond is ok. Or do I miss something.

Yeah, in retrospect this probably isn't needed. I think it was just some
extra paranoia to make sure the extra space is harmless.

> 
> > - Rebased on execlist prep refactoring
> > 
> > v3:
> > - Rebase on chained batch handling
> > - Squash in setting the secure dispatch flag
> > - Add a note about the interaction w/secure dispatch pinning
> > - Check for request->batch_obj == NULL in i915_gem_free_request
> > 
> > Signed-off-by: Brad Volkin <bradley.d.volkin at intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_cmd_parser.c     | 84 ++++++++++++++++++++++++------
> >  drivers/gpu/drm/i915/i915_dma.c            |  1 +
> >  drivers/gpu/drm/i915/i915_drv.h            |  8 +++
> >  drivers/gpu/drm/i915/i915_gem.c            | 10 ++++
> >  drivers/gpu/drm/i915/i915_gem_execbuffer.c | 61 ++++++++++++++++++++--
> >  5 files changed, 143 insertions(+), 21 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_cmd_parser.c b/drivers/gpu/drm/i915/i915_cmd_parser.c
> > index 809bb95..c8fe403 100644
> > --- a/drivers/gpu/drm/i915/i915_cmd_parser.c
> > +++ b/drivers/gpu/drm/i915/i915_cmd_parser.c
> > @@ -838,6 +838,56 @@ finish:
> >  	return (u32*)addr;
> >  }
> >  
> > +/* Returns a vmap'd pointer to dest_obj, which the caller must unmap */
> > +static u32 *copy_batch(struct drm_i915_gem_object *dest_obj,
> > +		       struct drm_i915_gem_object *src_obj)
> > +{
> > +	int ret = 0;
> > +	int needs_clflush = 0;
> > +	u32 *src_addr, *dest_addr = NULL;
> > +
> > +	ret = i915_gem_obj_prepare_shmem_read(src_obj, &needs_clflush);
> > +	if (ret) {
> > +		DRM_DEBUG_DRIVER("CMD: failed to prep read\n");
> > +		return ERR_PTR(ret);
> > +	}
> > +
> > +	src_addr = vmap_batch(src_obj);
> > +	if (!src_addr) {
> > +		DRM_DEBUG_DRIVER("CMD: Failed to vmap batch\n");
> > +		ret = -ENOMEM;
> > +		goto unpin_src;
> > +	}
> > +
> > +	if (needs_clflush)
> > +		drm_clflush_virt_range((char *)src_addr, src_obj->base.size);
> > +
> > +	ret = i915_gem_object_set_to_cpu_domain(dest_obj, true);
> > +	if (ret) {
> > +		DRM_DEBUG_DRIVER("CMD: Failed to set batch CPU domain\n");
> > +		goto unmap_src;
> > +	}
> > +
> > +	dest_addr = vmap_batch(dest_obj);
> > +	if (!dest_addr) {
> > +		DRM_DEBUG_DRIVER("CMD: Failed to vmap shadow batch\n");
> > +		ret = -ENOMEM;
> > +		goto unmap_src;
> > +	}
> > +
> > +	memcpy(dest_addr, src_addr, src_obj->base.size);
> > +	if (dest_obj->base.size > src_obj->base.size)
> > +		memset((u8 *)dest_addr + src_obj->base.size, 0,
> > +		       dest_obj->base.size - src_obj->base.size);
> 
> Hm, no in-line clflush and cached cpu mmaps. How slow is this on vlv?

No system to test on. Maybe we can get some qa perf testing on all relevant
platforms, since we'll want an overall sense of perf impact anyhow.

Is there a specific change you would make here to avoid the issue?

> 
> > +
> > +unmap_src:
> > +	vunmap(src_addr);
> > +unpin_src:
> > +	i915_gem_object_unpin_pages(src_obj);
> > +
> > +	return ret ? ERR_PTR(ret) : dest_addr;
> > +}
> > +
> >  /**
> >   * i915_needs_cmd_parser() - should a given ring use software command parsing?
> >   * @ring: the ring in question
> > @@ -954,6 +1004,7 @@ static bool check_cmd(const struct intel_engine_cs *ring,
> >   * i915_parse_cmds() - parse a submitted batch buffer for privilege violations
> >   * @ring: the ring on which the batch is to execute
> >   * @batch_obj: the batch buffer in question
> > + * @shadow_batch_obj: copy of the batch buffer in question
> >   * @batch_start_offset: byte offset in the batch at which execution starts
> >   * @is_master: is the submitting process the drm master?
> >   *
> > @@ -965,32 +1016,28 @@ static bool check_cmd(const struct intel_engine_cs *ring,
> >   */
> >  int i915_parse_cmds(struct intel_engine_cs *ring,
> >  		    struct drm_i915_gem_object *batch_obj,
> > +		    struct drm_i915_gem_object *shadow_batch_obj,
> >  		    u32 batch_start_offset,
> >  		    bool is_master)
> >  {
> >  	int ret = 0;
> >  	u32 *cmd, *batch_base, *batch_end;
> >  	struct drm_i915_cmd_descriptor default_desc = { 0 };
> > -	int needs_clflush = 0;
> >  	bool oacontrol_set = false; /* OACONTROL tracking. See check_cmd() */
> >  
> > -	ret = i915_gem_obj_prepare_shmem_read(batch_obj, &needs_clflush);
> > -	if (ret) {
> > -		DRM_DEBUG_DRIVER("CMD: failed to prep read\n");
> > -		return ret;
> > +	batch_base = copy_batch(shadow_batch_obj, batch_obj);
> > +	if (IS_ERR(batch_base)) {
> > +		DRM_DEBUG_DRIVER("CMD: Failed to copy batch\n");
> > +		return PTR_ERR(batch_base);
> >  	}
> >  
> > -	batch_base = vmap_batch(batch_obj);
> > -	if (!batch_base) {
> > -		DRM_DEBUG_DRIVER("CMD: Failed to vmap batch\n");
> > -		i915_gem_object_unpin_pages(batch_obj);
> > -		return -ENOMEM;
> > -	}
> > -
> > -	if (needs_clflush)
> > -		drm_clflush_virt_range((char *)batch_base, batch_obj->base.size);
> > -
> >  	cmd = batch_base + (batch_start_offset / sizeof(*cmd));
> > +
> > +	/*
> > +	 * We use the source object's size because the shadow object is as
> > +	 * large or larger and copy_batch() will write MI_NOPs to the extra
> > +	 * space. Parsing should be faster in some cases this way.
> > +	 */
> >  	batch_end = cmd + (batch_obj->base.size / sizeof(*batch_end));
> >  
> >  	while (cmd < batch_end) {
> > @@ -1052,7 +1099,12 @@ int i915_parse_cmds(struct intel_engine_cs *ring,
> >  
> >  	vunmap(batch_base);
> >  
> > -	i915_gem_object_unpin_pages(batch_obj);
> > +	if (!ret) {
> > +		ret = i915_gem_object_set_to_gtt_domain(shadow_batch_obj,
> > +							false);
> 
> This smells like hacking around the domain tracking system. The batch is
> executed by the gpu, gtt domain is for cpu access. This might also be a
> reason why the shrinker wreaks havoc with you shadow patches. With the
> correct pending_read_domains move_to_active should take care of things
> mostly.

Ok, I'll look at this again. I was using the golden context batch buffer code
as a reference for this, so perhaps there's something to fix there as well.

> 
> > +		if (ret)
> > +			DRM_DEBUG_DRIVER("CMD: Failed to set batch GTT domain\n");
> > +	}
> >  
> >  	return ret;
> >  }
> > diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
> > index 9a73533..dea0f45 100644
> > --- a/drivers/gpu/drm/i915/i915_dma.c
> > +++ b/drivers/gpu/drm/i915/i915_dma.c
> > @@ -1891,6 +1891,7 @@ int i915_driver_unload(struct drm_device *dev)
> >  
> >  		mutex_lock(&dev->struct_mutex);
> >  		i915_gem_cleanup_ringbuffer(dev);
> > +		i915_gem_batch_pool_fini(&dev_priv->mm.batch_pool);
> >  		i915_gem_context_fini(dev);
> >  		mutex_unlock(&dev->struct_mutex);
> >  		i915_gem_cleanup_stolen(dev);
> > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> > index fbf10cc..50d974d 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.h
> > +++ b/drivers/gpu/drm/i915/i915_drv.h
> > @@ -1145,6 +1145,13 @@ struct i915_gem_mm {
> >  	 */
> >  	struct list_head unbound_list;
> >  
> > +	/*
> > +	 * A pool of objects to use as shadow copies of client batch buffers
> > +	 * when the command parser is enabled. Prevents the client from
> > +	 * modifying the batch contents after software parsing.
> > +	 */
> > +	struct i915_gem_batch_pool batch_pool;
> > +
> >  	/** Usable portion of the GTT for GEM */
> >  	unsigned long stolen_base; /* limited to low memory (32-bit) */
> >  
> > @@ -2782,6 +2789,7 @@ void i915_cmd_parser_fini_ring(struct intel_engine_cs *ring);
> >  bool i915_needs_cmd_parser(struct intel_engine_cs *ring);
> >  int i915_parse_cmds(struct intel_engine_cs *ring,
> >  		    struct drm_i915_gem_object *batch_obj,
> > +		    struct drm_i915_gem_object *shadow_batch_obj,
> >  		    u32 batch_start_offset,
> >  		    bool is_master);
> >  
> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > index 4dbd7b9..c59100d 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.c
> > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > @@ -2500,6 +2500,14 @@ static void i915_gem_free_request(struct drm_i915_gem_request *request)
> >  	if (request->ctx)
> >  		i915_gem_context_unreference(request->ctx);
> >  
> > +	if (request->batch_obj &&
> > +	    !list_empty(&request->batch_obj->batch_pool_list)) {
> > +		struct drm_i915_private *dev_priv = to_i915(request->ring->dev);
> > +
> > +		i915_gem_batch_pool_put(&dev_priv->mm.batch_pool,
> > +					request->batch_obj);
> > +	}
> 
> This looks a bit like reinvented active tracking. If we copy the libdrm
> cache design a bit more the cache would simply walk the active list when
> there's nothing in the inactive list and shovel all the objects with
> ->active.
> 
> The advantage would be that we wouldn't leak special batch_pool_put code
> all over the place, keeping things nice&tidy.

Chris had also suggested looking at the libdrm cache with respect to this,
but I was concerned about the impact of the scheduler reordering requests.
I'd have to go back and look at libdrm, but it sounded like it was relying
on the fifo nature of submissions. Could be wrong though.

Plus, we're hardly leaking batch_pool_put "all over the place" :). There's
exactly 3 call sites - this one for batches that execute and complete,
one for the chained batch fallback, and one in the execbuf error handling
path. So my inclination is really to leave it as is.

> 
> > +
> >  	kfree(request);
> >  }
> >  
> > @@ -5019,6 +5027,8 @@ i915_gem_load(struct drm_device *dev)
> >  	dev_priv->mm.oom_notifier.notifier_call = i915_gem_shrinker_oom;
> >  	register_oom_notifier(&dev_priv->mm.oom_notifier);
> >  
> > +	i915_gem_batch_pool_init(dev, &dev_priv->mm.batch_pool);
> > +
> >  	mutex_init(&dev_priv->fb_tracking.lock);
> >  }
> >  
> > diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> > index 4b7f5c1..c216ff2 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> > @@ -1236,6 +1236,8 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
> >  	struct drm_i915_private *dev_priv = dev->dev_private;
> >  	struct eb_vmas *eb;
> >  	struct drm_i915_gem_object *batch_obj;
> > +	struct drm_i915_gem_object *shadow_batch_obj = NULL;
> > +	struct drm_i915_gem_exec_object2 shadow_exec_entry;
> >  	struct intel_engine_cs *ring;
> >  	struct intel_context *ctx;
> >  	struct i915_address_space *vm;
> > @@ -1361,22 +1363,63 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
> >  	batch_obj->base.pending_read_domains |= I915_GEM_DOMAIN_COMMAND;
> 
> This line here to assign read domains is only required on the real batch
> obj. So I think this should be into the if maze below to make sure we only
> set it on the real batch (either the userspace one or the shadow copy).

Ok, I can put it after the whole parser block, after we've set
	batch_obj = shadow_batch_obj;

> 
> >  
> >  	if (i915_needs_cmd_parser(ring)) {
> > +		shadow_batch_obj =
> > +			i915_gem_batch_pool_get(&dev_priv->mm.batch_pool,
> > +						batch_obj->base.size);
> > +		if (IS_ERR(shadow_batch_obj)) {
> > +			ret = PTR_ERR(shadow_batch_obj);
> > +			/* Don't try to clean up the obj in the error path */
> > +			shadow_batch_obj = NULL;
> > +			goto err;
> > +		}
> > +
> > +		ret = i915_gem_obj_ggtt_pin(shadow_batch_obj, 4096, 0);
> > +		if (ret)
> > +			goto err;
> > +
> >  		ret = i915_parse_cmds(ring,
> >  				      batch_obj,
> > +				      shadow_batch_obj,
> >  				      args->batch_start_offset,
> >  				      file->is_master);
> >  		if (ret) {
> >  			if (ret != -EACCES)
> >  				goto err;
> > +			else {
> > +				i915_gem_object_ggtt_unpin(shadow_batch_obj);
> > +				i915_gem_batch_pool_put(&dev_priv->mm.batch_pool,
> > +							shadow_batch_obj);
> > +			}
> >  		} else {
> > +			struct i915_vma *vma;
> > +
> > +			memset(&shadow_exec_entry, 0,
> > +			       sizeof(shadow_exec_entry));
> > +
> > +			vma = i915_gem_obj_to_ggtt(shadow_batch_obj);
> > +			vma->exec_entry = &shadow_exec_entry;
> > +			vma->exec_entry->flags = __EXEC_OBJECT_HAS_PIN;
> > +			drm_gem_object_reference(&shadow_batch_obj->base);
> > +			list_add_tail(&vma->exec_list, &eb->vmas);
> > +
> > +			batch_obj = shadow_batch_obj;
> > +
> >  			/*
> > -			 * XXX: Actually do this when enabling batch copy...
> > +			 * Set the DISPATCH_SECURE bit to remove the NON_SECURE
> > +			 * bit from MI_BATCH_BUFFER_START commands issued in the
> > +			 * dispatch_execbuffer implementations. We specifically
> > +			 * don't want that set when the command parser is
> > +			 * enabled.
> >  			 *
> > -			 * Set the DISPATCH_SECURE bit to remove the NON_SECURE bit
> > -			 * from MI_BATCH_BUFFER_START commands issued in the
> > -			 * dispatch_execbuffer implementations. We specifically don't
> > -			 * want that set when the command parser is enabled.
> > +			 * NB: By setting this flag, we're going to hit the
> > +			 * normal secure dispatch path below. That path will
> > +			 * do an extra pin of the shadow batch object and then
> > +			 * unpin after ->gt.do_execbuf(). At that point, the
> > +			 * object will have the single outstanding pin from
> > +			 * just before calling i915_parse_cmds(), and will get
> > +			 * unpinned via eb_destroy() and __EXEC_OBJECT_HAS_PIN.
> >  			 */
> 
> That sounds like this additional pin should be put into the cmd parser,
> just around the copy function.

I suppose we could do that. I had originally been trying to keep i915_parse_cmds()
to only mapping/parsing the batch, but it's grown beyond that already I think.

> 
> Also do_execbuf is already way too big a function. I think it would be
> good to extract all the SECURE_DISPATCH/cmd parser/batch_bo special
> handling code into a new function for clarity. And then if possible not
> leak any of the cleanup code out.

The issue is all the potential failure points after this block. I'll look at it
again and see if there's a way to keep it clean.

> 
> 
> > +			flags |= I915_DISPATCH_SECURE;
> >  		}
> >  	}
> >  
> > @@ -1418,6 +1461,14 @@ err:
> >  	i915_gem_context_unreference(ctx);
> >  	eb_destroy(eb);
> >  
> > +	if (ret && ring && shadow_batch_obj) {
> > +		if (i915_gem_obj_is_pinned(shadow_batch_obj))
> > +			i915_gem_object_ggtt_unpin(shadow_batch_obj);
> 
> Which would also clean up this code, which looks more like a hack to work
> around the pin_count check on final unref than anything solid and
> intentional ;-)

This is error path cleanup, at least for a window where we've allocated the
shadow batch, pinned it, but haven't hooked it into the vma list. Again,
I'll see if there's a cleaner way to do it, but I don't think it's that bad.

Brad

> 
> > +
> > +		i915_gem_batch_pool_put(&dev_priv->mm.batch_pool,
> > +					shadow_batch_obj);
> > +	}
> > +
> >  	mutex_unlock(&dev->struct_mutex);
> >  
> >  pre_mutex_err:
> > -- 
> > 1.9.1
> > 
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch