[Intel-gfx] [RFC] algorithm for handling bad cachelines

Wed Mar 28 23:15:27 CEST 2012

On Wed, 28 Mar 2012 02:59:26 -0700
Andi Kleen <andi at firstfloor.org> wrote:

> Ben Widawsky <ben at bwidawsk.net> writes:
> >
> > 1. Handle cache line going bad interrupt.
> > <After n number of these interrupts to the same line,>
> 
> Never use global n without timeout for corrected errors, you would 
> need a leaky bucket with a suitable timeout.

As I understand electrons (which is not very well) parity errors happen
all the time and are transparently corrected by our HW. So I suppose
'n' is still interesting information, but your point is noted. It is
probably better to let userspace decide that n value.

Take this with a grain of salt because the number of interrupts we get
is speculative as I haven't actually tried to enable this.

> 
> > 2. send a uevent
> > 2.5 reset the GPU (docs tell us to)
> > <On module load>
> 
> Persistent lists on disk usually suffer from all kinds of problems,
> e.g. you need to detect when the board or CPU has changed.
> Also when the problem is temporary you do not really want
> to save such information permanent.
> 
> Usually it's better to rediscover such state each time and handle
> it again. Then you also don't need the uevent or complicated
> user interfaces.

It seems nice to have information stored non-volatility. It doesn't have
to be used by the user, but assuming they want to load the option to
actually detect these events, it's probably also beneficial to give the
known bad cachelines since this requires a GPU reset once detected. The
reset both takes time, and may do more damage (that is based on past
experience/products only and I hope IVB can magnificently recover from
our bad GPU programming).

> 
> > Any feedback is highly appreciated. I couldn't really find much
> > precedent for doing this in other drivers, so pointers to similar
> > things would also be highly welcome.
> 
> http://mcelog.org
> 
> -Andi

Thanks.