[Intel-gfx] [PATCH 5/5] drm/i915:Use the coarse mechanism based on drm fd to dispatch the BSD command on BDW GT3

Thu Apr 10 10:04:22 CEST 2014

On Thu, 2014-04-10 at 00:48 -0600, Daniel Vetter wrote:
> On Thu, Apr 10, 2014 at 10:24:53AM +0800, Zhao Yakui wrote:
> > On Wed, 2014-04-09 at 08:34 -0600, Daniel Vetter wrote:
> > > On Wed, Apr 09, 2014 at 09:59:56AM +0800, Zhao Yakui wrote:
> > > > The BDW GT3 has two independent BSD rings, which can be used to process the
> > > > video commands. To be simpler, it is transparent to user-space driver/middleware.
> > > > Instead the kernel driver will decide which ring is to dispatch the BSD video
> > > > command.
> > > > 
> > > > As every BSD ring is powerful, it is enough to dispatch the BSD video command
> > > > based on the drm fd. In such case the different BSD ring is used for video playing
> > > > back and encoding. At the same time the coarse dispatch mechanism can help to avoid
> > > > the object synchronization between the BSD rings.
> > > > 
> > > > Signed-off-by: Zhao Yakui <yakui.zhao at intel.com>
> > > 
> > > This looks way too complicated. First things first please get rid of the
> > > atomic_t usage. If you don't have _massive_ comments explaining the memory
> > > barriers you're most likely using linux kernel atomic_t wrong. They are
> > > fully unordered.
> > 
> > Thanks for the review.
> > 
> > For the atomic_t usage:  I will remove it in next version as the counter
> > is already protected by the lock.  
> > 
> > > 
> > > With that out of the way this still looks a bit complicated really. Can't
> > > we just use a very simple static rule in gen8_dispatch_bsd_ring which
> > > hashed the pointer address of the file_priv? Just to get things going,
> > > once we have a clear need we can try to make things more intelligent. But
> > > in case of doubt I really prefer if we start with the dumbest possible
> > > approach first and add complexity instead of starting with something
> > > really complex and simplifying it.
> > 
> > Do you mean that file_priv is hashed and then is mapped to BSD 0 or 1
> > ring?  
> 
> Yeah, that's the idea. Get in the basic support first, make it fancy like
> you describe below second. This has a few upsides:
> - We can concentrate on validating basic support in the first round
>   instead of potentially fighting a bug in the load balancer.
> - Discussions and performance testing for the load balancer won't hold up
>   the entire feature.
> - Like I've said this might not be required. Before we add more complexity
>   than just hashing the file_priv I want to see some benchmarks of
>   expected workloads that show that the load balancing is indeed a good
>   idea - for the case of a transcode server I guess we should have
>   sufficient in-flight operations that it won't really matter. Or at least
>   I hope so.
> 

OK. Understand your concerns. I can split it two steps. One is to add
the basic support. The second step is for the optimization.

But I don't think that the hash of file_priv is a good idea. As it only
has two rings, it is possible that the hash value is always mapped to
BSD ring 0.  In such case when multiples video clips are played back,
the performance can't meet with the requirement.(For example: User can
play back 4 1080p video clips concurrently when only one BSD ring is
used. On the BDW GT3, they hope to play back 8 1080p video clips
concurrently. The poor hash design will cause that all the workload are
mapped to one BSD ring and then it can't meet with the requirement).

How about using the ping-pong mechanism for the file_priv? For one new
fd, it will use BSD ring 0 and then next file_priv will use BSD ring 1.
Then BSD ring 0....BSD ring 1. 

Does this make sense to you?

> So maybe split this patch up into the first step with the basic file_priv
> hashing mapping and the 2nd patch to add the improved algo?
> 
> Cheers, Daniel
> 
> > The GT3 machine has two independent BSD rings. It will be better that
> > the kernel driver can balance the video workload between the two rings. 
> > When using the hashed file_priv to select BSD ring, the video balance
> > depends on the design of hash design. Under some scenarios, it will be
> > possible that one ring is very busy while another ring is very idle. And
> > then performance of video playing back/encoding will be affected.
> > At the same time the hash mechanism is only used to select the
> > corresponding BSD ring when one drm_fd is opened. And it doesn't
> > consider the video workload balance after finishing some workloads.
> > 
> > The following is the basic idea in my patch.(A counter variable is added
> > for ring. The bigger the counter, the higher the workload).
> >    a. When one new fd needs to dispatch the BSD video command, it will
> > select the ring with the lowest workload(lowest counter). And then
> > counter in this ring will be added.
> >    b. when the drm fd is closed(the workload is finished), the counter
> > of the ring used by file_priv will be decreased. 
> >    c. When the drm fd already selects one BSD ring in previously
> > submitted command, it will check whether it is using the ring with the
> > lowest workload(lowest counter). If not, it can be switched. The purpose
> > is to assure that the workload is still balanced between the two BSD
> > rings. For example: User wants to play back four video clips. BSD 0 ring
> > is selected to play back the two long clips. BSD 1 ring is selected to
> > play back the two short clips. After it finishes the playing back of two
> > short clips, the BSD 1 ring can be switched to play back the long clip.
> > Still balance.
> > 
> > What do you think?
> > 
> > > -Daniel
> > > 
> > > > ---
> > > >  drivers/gpu/drm/i915/i915_dma.c            |   14 ++++++
> > > >  drivers/gpu/drm/i915/i915_drv.h            |    3 ++
> > > >  drivers/gpu/drm/i915/i915_gem_execbuffer.c |   73 +++++++++++++++++++++++++++-
> > > >  drivers/gpu/drm/i915/intel_ringbuffer.c    |    2 +
> > > >  drivers/gpu/drm/i915/intel_ringbuffer.h    |    2 +
> > > >  5 files changed, 93 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
> > > > index 0b38f88..8260463 100644
> > > > --- a/drivers/gpu/drm/i915/i915_dma.c
> > > > +++ b/drivers/gpu/drm/i915/i915_dma.c
> > > > @@ -1572,6 +1572,7 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags)
> > > >  	spin_lock_init(&dev_priv->backlight_lock);
> > > >  	spin_lock_init(&dev_priv->uncore.lock);
> > > >  	spin_lock_init(&dev_priv->mm.object_stat_lock);
> > > > +	spin_lock_init(&dev_priv->bsd_lock);
> > > >  	mutex_init(&dev_priv->dpio_lock);
> > > >  	mutex_init(&dev_priv->modeset_restore_lock);
> > > >  
> > > > @@ -1928,7 +1929,20 @@ void i915_driver_preclose(struct drm_device * dev, struct drm_file *file_priv)
> > > >  void i915_driver_postclose(struct drm_device *dev, struct drm_file *file)
> > > >  {
> > > >  	struct drm_i915_file_private *file_priv = file->driver_priv;
> > > > +	struct intel_ring_buffer *bsd_ring;
> > > > +	struct drm_i915_private *dev_priv = dev->dev_private;
> > > >  
> > > > +	if (file_priv && file_priv->bsd_ring) {
> > > > +		int cmd_counter;
> > > > +		bsd_ring = file_priv->bsd_ring;
> > > > +		file_priv->bsd_ring = NULL;
> > > > +		spin_lock(&dev_priv->bsd_lock);
> > > > +		cmd_counter = atomic_sub_return(1, &bsd_ring->bsd_cmd_counter);
> > > > +		if (cmd_counter < 0) {
> > > > +			atomic_set(&bsd_ring->bsd_cmd_counter, 0);
> > > > +		}
> > > > +		spin_unlock(&dev_priv->bsd_lock);
> > > > +	}
> > > >  	kfree(file_priv);
> > > >  }
> > > >  
> > > > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> > > > index d77f4e0..128639c 100644
> > > > --- a/drivers/gpu/drm/i915/i915_drv.h
> > > > +++ b/drivers/gpu/drm/i915/i915_drv.h
> > > > @@ -1457,6 +1457,8 @@ struct drm_i915_private {
> > > >  	struct i915_dri1_state dri1;
> > > >  	/* Old ums support infrastructure, same warning applies. */
> > > >  	struct i915_ums_state ums;
> > > > +	/* the lock for dispatch video commands on two BSD rings */
> > > > +	spinlock_t bsd_lock;
> > > >  };
> > > >  
> > > >  static inline struct drm_i915_private *to_i915(const struct drm_device *dev)
> > > > @@ -1664,6 +1666,7 @@ struct drm_i915_file_private {
> > > >  
> > > >  	struct i915_hw_context *private_default_ctx;
> > > >  	atomic_t rps_wait_boost;
> > > > +	struct  intel_ring_buffer *bsd_ring;
> > > >  };
> > > >  
> > > >  /*
> > > > diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> > > > index 3491402..75d8cc0 100644
> > > > --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> > > > +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> > > > @@ -999,6 +999,70 @@ i915_reset_gen7_sol_offsets(struct drm_device *dev,
> > > >  	return 0;
> > > >  }
> > > >  
> > > > +/**
> > > > + * Find one BSD ring to dispatch the corresponding BSD command.
> > > > + * The Ring ID is returned.
> > > > + */
> > > > +static int gen8_dispatch_bsd_ring(struct drm_device *dev,
> > > > +				  struct drm_file *file)
> > > > +{
> > > > +	struct drm_i915_private *dev_priv = dev->dev_private;
> > > > +	struct drm_i915_file_private *file_priv = file->driver_priv;
> > > > +	struct intel_ring_buffer *temp_ring, *bsd_ring;
> > > > +	int bsd_counter, temp_counter;
> > > > +
> > > > +	if (file_priv->bsd_ring) {
> > > > +		/* Check whether the load balance is required.*/
> > > > +		spin_lock(&dev_priv->bsd_lock);
> > > > +		bsd_counter = atomic_read(&(file_priv->bsd_ring->bsd_cmd_counter));
> > > > +		temp_ring = &dev_priv->ring[VCS];
> > > > +		temp_counter = atomic_read(&temp_ring->bsd_cmd_counter);
> > > > +		bsd_ring = &dev_priv->ring[VCS];
> > > > +
> > > > +		temp_ring = &dev_priv->ring[VCS2];
> > > > +		if (atomic_read(&temp_ring->bsd_cmd_counter) < temp_counter) {
> > > > +			temp_counter = atomic_read(&temp_ring->bsd_cmd_counter);
> > > > +			bsd_ring = temp_ring;
> > > > +		}
> > > > +		/*
> > > > +		 * If it is already the ring with the minimum load, it is
> > > > +		 * unnecessary to switch it.
> > > > +		 */
> > > > +		if (bsd_ring == file_priv->bsd_ring) {
> > > > +			spin_unlock(&dev_priv->bsd_lock);
> > > > +			return bsd_ring->id;
> > > > +		}
> > > > +		/*
> > > > +		 * If the load delta between current ring and target ring is
> > > > +		 * small, it is unnecessary to switch it.
> > > > +		 */
> > > > +		if ((bsd_counter - temp_counter) <= 1) {
> > > > +			spin_unlock(&dev_priv->bsd_lock);
> > > > +			return file_priv->bsd_ring->id;
> > > > +		}
> > > > +		/* balance the load between current ring and target ring */
> > > > +		atomic_dec(&file_priv->bsd_ring->bsd_cmd_counter);
> > > > +		atomic_inc(&bsd_ring->bsd_cmd_counter);
> > > > +		spin_unlock(&dev_priv->bsd_lock);
> > > > +		file_priv->bsd_ring = bsd_ring;
> > > > +		return bsd_ring->id;
> > > > +	} else {
> > > > +		spin_lock(&dev_priv->bsd_lock);
> > > > +		bsd_ring = &dev_priv->ring[VCS];
> > > > +		bsd_counter = atomic_read(&bsd_ring->bsd_cmd_counter);
> > > > +		temp_ring = &dev_priv->ring[VCS2];
> > > > +		temp_counter = atomic_read(&temp_ring->bsd_cmd_counter);
> > > > +		if (temp_counter < bsd_counter) {
> > > > +			bsd_ring = temp_ring;
> > > > +			bsd_counter = temp_counter;
> > > > +		}
> > > > +		atomic_inc(&bsd_ring->bsd_cmd_counter);
> > > > +		file_priv->bsd_ring = bsd_ring;
> > > > +		spin_unlock(&dev_priv->bsd_lock);
> > > > +		return bsd_ring->id;
> > > > +	}
> > > > +}
> > > > +
> > > >  static int
> > > >  i915_gem_do_execbuffer(struct drm_device *dev, void *data,
> > > >  		       struct drm_file *file,
> > > > @@ -1043,7 +1107,14 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data,
> > > >  
> > > >  	if ((args->flags & I915_EXEC_RING_MASK) == I915_EXEC_DEFAULT)
> > > >  		ring = &dev_priv->ring[RCS];
> > > > -	else
> > > > +	else if ((args->flags & I915_EXEC_RING_MASK) == I915_EXEC_BSD) {
> > > > +		if (HAS_BSD2(dev)) {
> > > > +			int ring_id;
> > > > +			ring_id = gen8_dispatch_bsd_ring(dev, file);
> > > > +			ring = &dev_priv->ring[ring_id];
> > > > +		} else
> > > > +			ring = &dev_priv->ring[VCS];
> > > > +	} else
> > > >  		ring = &dev_priv->ring[(args->flags & I915_EXEC_RING_MASK) - 1];
> > > >  
> > > >  	if (!intel_ring_initialized(ring)) {
> > > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > index 43e0227..852fc29 100644
> > > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > > > @@ -2123,6 +2123,7 @@ int intel_init_bsd_ring_buffer(struct drm_device *dev)
> > > >  		ring->dispatch_execbuffer = i965_dispatch_execbuffer;
> > > >  	}
> > > >  	ring->init = init_ring_common;
> > > > +	atomic_set(&ring->bsd_cmd_counter, 0);
> > > >  
> > > >  	return intel_init_ring_buffer(dev, ring);
> > > >  }
> > > > @@ -2170,6 +2171,7 @@ int intel_init_bsd2_ring_buffer(struct drm_device *dev)
> > > >  
> > > >  	ring->init = init_ring_common;
> > > >  
> > > > +	atomic_set(&ring->bsd_cmd_counter, 0);
> > > >  	return intel_init_ring_buffer(dev, ring);
> > > >  }
> > > >  
> > > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> > > > index 8ca4285..4f87b08 100644
> > > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> > > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> > > > @@ -196,6 +196,8 @@ struct  intel_ring_buffer {
> > > >  	 * to encode the command length in the header).
> > > >  	 */
> > > >  	u32 (*get_cmd_length_mask)(u32 cmd_header);
> > > > +
> > > > +	atomic_t bsd_cmd_counter;
> > > >  };
> > > >  
> > > >  static inline bool
> > > > -- 
> > > > 1.7.10.1
> > > > 
> > > > _______________________________________________
> > > > Intel-gfx mailing list
> > > > Intel-gfx at lists.freedesktop.org
> > > > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
> > > 
> > 
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch