[Intel-gfx] [PATCH 1/2] drm/i915: Only slightly increment hangcheck score if we succesfully kick a ring

Fri Jun 7 09:33:51 CEST 2013

On Thu, Jun 06, 2013 at 08:58:10PM -0700, Ben Widawsky wrote:
> On Thu, Jun 06, 2013 at 10:45:42AM +0100, Chris Wilson wrote:
> > +		if (ring->hangcheck.seqno == seqno) {
> > +			if (ring_idle(ring, seqno)) {
> > +				if (waitqueue_active(&ring->irq_queue)) {
> > +					/* Issue a wake-up to catch stuck h/w. */
> > +					DRM_ERROR("Hangcheck timer elapsed... %s idle\n",
> > +						  ring->name);
> > +					wake_up_all(&ring->irq_queue);
> > +					ring->hangcheck.score += HUNG;
> > +					busy_count++;
> > +				}
> 
> Correct me if I'm wrong, but we shouldn't be able to get here if the
> waitqueue isn't active, so perhaps a WARN for the else?

No, we will inspect unused rings whilst other rings are running. So we
expect to find rings that never advance the seqno and have no requests
pending.

> >  			} else {
> > -				ring->hangcheck.score = 0;
> > +				stuck[i] = ring->hangcheck.acthd == acthd;
> > +				if (stuck[i]) {
> > +					/* Every time we kick the ring, add a
> > +					 * small increment to the hangcheck
> > +					 * score so that we can catch a
> > +					 * batch that is repeatedly kicked.
> > +					 */
> > +					ring->hangcheck.score += KICK;
> > +					if (ring_hung(ring))
> > +						ring->hangcheck.score += HUNG;
> > +				}
> > +				busy_count++;
> >  			}
> > +		} else {
> > +			/* Gradually reduce the count so that we catch DoS
> > +			 * attempts across multiple batches.
> > +			 */
> > +			if (ring->hangcheck.score > 0)
> > +				ring->hangcheck.score--;
> >  		}
> 
> This else feels weird to me. I think we'd want to decrease at least by a
> multiple of KICK, or HUNG if seqno is actually moving. I'm also not
> certain I understand why we're worried about DoS. Did that conversation
> happen somewhere I can't find?

Mika said that the reason why he incremented hangcheck score whilst
kicking rings was in case someone submitted a batch with multiple
invalid WAITs. It is easy to extend that attack to an invalid WAIT per
batch and DoS the machine, if we forget about kicking after each batch.

I actually would prefer not to charge the ring at all unless we do kick
it (i.e. so we do not accidentally declare a machine hung whilst busy
but waiting on semapores). Time to respin again.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre