[Intel-gfx] [PATCH 3/3] [RFC] drm/i915: add interface to simulate gpu hangs

Sat Nov 5 23:30:42 CET 2011

On Sat, Nov 05, 2011 at 03:07:45PM -0700, Ben Widawsky wrote:
> On Wed,  2 Nov 2011 12:46:36 +0100
> Daniel Vetter <daniel.vetter at ffwll.ch> wrote:
> 
> > GPU reset is a very important piece of our infrastructure.
> > Unfortunately we only really test it by actually hanging the gpu,
> > which often has bad side-effects for the entire system. And the gpu
> > hang handling code is one of the rather complicated pieces of code we
> > have, consisting of
> > - hang detection
> > - error capture
> > - actual gpu reset
> > - reset of all the gem bookkeeping
> > - reinitialition of the entire gpu
> > 
> > This patch adds a debugfs to selectively stopping rings by ceasing to
> > update the hw tail pointer. This way we can exercise the gpu hang code
> > under controlled conditions without a dying gpu taking down the entire
> > systems.
> > 
> > Patch motivated by me forgetting to properly reinitialize ppgtt after
> > a gpu reset.
> > 
> > Signed-Off-by: Daniel Vetter <daniel.vetter at ffwll.ch>
> 
> I like the concept here very much. One thing which I thought might be
> interesting would to allow hanging rings temporarily instead of
> automatically hanging until reset on the write of the debugfs entry. I
> think if we could start dramatically altering the time it takes to get
> commands into the ring would really test our buffer dependency tracking.
> The obvious target here to me is hang the media ring for say 33ms, and
> see what happens to the system.
> 
> Also, maybe I missed something but could you explain how you use this
> differently than i915_wedged?

I fail at writing decent patch descriptions. The difference wrt
i915_wedged is the essence of the idea ;-) Let's try again:

The idea is to simulate a gpu hang as accurately as possible without
actually hanging the gpu (or touching it much in any way). Because when
the gpu is dead for real it has the tendency to take down the entire
systems. One way to do that is to simply set the wedged state and manually
call the error_handler, like i915_wedged does. This leaves out all the
hangcheck detection code, though. And it runs the error_capture in a
completely different context.

So we need a way to stop the gpu without actually touching it and without
faking anything in our gem code. The idea used in this patch is to simply
cease to update the ringbuffer tail pointer. This way the gpu will simply
stop processing commands, which the hangcheck eventually notices because
the gpu head pointer doesn't advance anymore. Then the entire gpu hang
machinery gets fired off as if the gpu has died for real (but hopefully
with a massively increased chance that we survive the gpu reset).

> 
> > ---
> >  drivers/gpu/drm/i915/i915_debugfs.c     |   63 +++++++++++++++++++++++++++++++
> >  drivers/gpu/drm/i915/i915_drv.c         |    3 +
> >  drivers/gpu/drm/i915/i915_drv.h         |    2 +
> >  drivers/gpu/drm/i915/intel_ringbuffer.c |    4 ++
> >  4 files changed, 72 insertions(+), 0 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> > index 8476441..9821a3b 100644
> > --- a/drivers/gpu/drm/i915/i915_debugfs.c
> > +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> > @@ -1392,6 +1392,64 @@ static const struct file_operations i915_wedged_fops = {
> >  };
> >  
> >  static ssize_t
> > +i915_ring_stop_read(struct file *filp,
> > +		    char __user *ubuf,
> > +		    size_t max,
> > +		    loff_t *ppos)
> > +{
> > +	struct drm_device *dev = filp->private_data;
> > +	drm_i915_private_t *dev_priv = dev->dev_private;
> > +	char buf[80];
> > +	int len;
> > +
> > +	len = snprintf(buf, sizeof(buf),
> > +		       "%d\n", dev_priv->stop_rings);
> > +
> > +	if (len > sizeof(buf))
> > +		len = sizeof(buf);
> > +
> > +	return simple_read_from_buffer(ubuf, max, ppos, buf, len);
> > +}
> 
> buf[80] seems a little excessive.

Oops, copy&paste. I'll shrink it a bit ...

> 
> > +
> > +static ssize_t
> > +i915_ring_stop_write(struct file *filp,
> > +		     const char __user *ubuf,
> > +		     size_t cnt,
> > +		     loff_t *ppos)
> > +{
> > +	struct drm_device *dev = filp->private_data;
> > +	struct drm_i915_private *dev_priv = dev->dev_private;
> > +	char buf[20];
> > +	int val = 0;
> > +
> > +	if (cnt > 0) {
> > +		if (cnt > sizeof(buf) - 1)
> > +			return -EINVAL;
> > +
> > +		if (copy_from_user(buf, ubuf, cnt))
> > +			return -EFAULT;
> > +		buf[cnt] = 0;
> > +
> > +		val = simple_strtoul(buf, NULL, 0);
> > +	}
> > +
> > +	DRM_DEBUG_DRIVER("Stopping rings %u\n", val);
> > +
> > +	mutex_lock(&dev->struct_mutex);
> > +	dev_priv->stop_rings = val;
> > +	mutex_unlock(&dev->struct_mutex);
> > +
> > +	return cnt;
> > +}
> > +
> 
> I think an atomic takes away the need for struct_mutex, unless you plan
> to do more.

atomic_t would require atomic ops on the read side. I don't care about
locking on the read side as long as the new value will eventually show up,
so I've opted for the unlocked (and fastest) option for reads.

> 
> > +static const struct file_operations i915_ring_stop_fops = {
> > +	.owner = THIS_MODULE,
> > +	.open = i915_debugfs_common_open,
> > +	.read = i915_ring_stop_read,
> > +	.write = i915_ring_stop_write,
> > +	.llseek = default_llseek,
> > +};
> > +static ssize_t
> >  i915_max_freq_read(struct file *filp,
> >  		   char __user *ubuf,
> >  		   size_t max,
> > @@ -1691,6 +1749,11 @@ int i915_debugfs_init(struct drm_minor *minor)
> >  				  &i915_cache_sharing_fops);
> >  	if (ret)
> >  		return ret;
> > +	ret = i915_debugfs_create(minor->debugfs_root, minor,
> > +				  "i915_ring_stop",
> > +				  &i915_ring_stop_fops);
> > +	if (ret)
> > +		return ret;
> >  
> >  	return drm_debugfs_create_files(i915_debugfs_list,
> >  					I915_DEBUGFS_ENTRIES,
> 
> I think the fact that you're stopping rings is an implementation detail,
> and the name doesn't sounds nearly dangerous enough. I'd just call this
> i915_hang_gpu, or something like that. Unless of course you plan
> something like what I mentioned at the top about being able to restart
> the ring.

Because our current hangcheck code only checks the render ring I've wanted
to stop rings individually (so we could check hangcheck extensions to the
other rings). Adding an extension to restart the rings sounds good, it
would just require to redo the last tail writes that got dropped.

> > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> > index 548e04b..566cc1e 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.c
> > +++ b/drivers/gpu/drm/i915/i915_drv.c
> > @@ -611,6 +611,9 @@ int i915_reset(struct drm_device *dev, u8 flags)
> >  	if (!mutex_trylock(&dev->struct_mutex))
> >  		return -EBUSY;
> >  
> > +	printk("reenabling rings\n");
> > +	dev_priv->stop_rings = 0;
> > +
> >  	i915_gem_reset(dev);
> >  
> >  	ret = -ENODEV;
> 
> Probably shouldn't be using printk, just to keep things consistent with
> everything else. DRM_DEBUG_DRIVER

Oops, debug leftover, I'll change this.

> 
> > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> > index bd98fb3..503ae8c 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.h
> > +++ b/drivers/gpu/drm/i915/i915_drv.h
> > @@ -330,6 +330,8 @@ typedef struct drm_i915_private {
> >  	uint32_t last_instdone;
> >  	uint32_t last_instdone1;
> >  
> > +	unsigned int stop_rings;
> > +
> >  	unsigned long cfb_size;
> >  	unsigned int cfb_fb;
> >  	enum plane cfb_plane;
> 
> bool? Also I think you could consider treating this per ring.

It's a flag array, so already per-ring. See below.

> 
> > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > index 3c30dba..ef7a1ca 100644
> > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > @@ -1179,7 +1179,11 @@ int intel_ring_begin(struct intel_ring_buffer *ring,
> >  
> >  void intel_ring_advance(struct intel_ring_buffer *ring)
> >  {
> > +	struct drm_i915_private *dev_priv = ring->dev->dev_private;
> > +
> >  	ring->tail &= ring->size - 1;
> > +	if (dev_priv->stop_rings & intel_ring_flag(ring))
> > +		return;
> >  	ring->write_tail(ring, ring->tail);
> >  }
> >  
> 
> Maybe wrap the if with "unlikely?"

Yeah.

-- 
Daniel Vetter
Mail: daniel at ffwll.ch
Mobile: +41 (0)79 365 57 48