[Intel-gfx] [RFC] algorithm for handling bad cachelines

Ben Widawsky ben at bwidawsk.net
Tue Mar 27 17:09:31 CEST 2012

On Tue, 27 Mar 2012 16:50:39 +0200
Daniel Vetter <daniel at ffwll.ch> wrote:

> On Tue, Mar 27, 2012 at 07:19:43AM -0700, Ben Widawsky wrote:
> > I wanted to run this by folks before I start doing any actual work.
> > 
> > This is primarily for GPGPU, or perhaps *really* accurate rendering
> > requirements.
> > 
> > IVB+ has an interrupt to tell us when a cacheline seems to be going
> > bad. There is also a mechanism to remap the bad cachelines. The
> > implementation details aren't quite clear to me yet, but I'd like to
> > enable this feature for userspace.
> > 
> > Here is my current plan, but it involves filesystem access, so it's
> > probably going to get a lot of flames.
> > 
> > 1. Handle cache line going bad interrupt.
> > <After n number of these interrupts to the same line,>
> > 2. send a uevent
> > 2.5 reset the GPU (docs tell us to)
> > <On module load>
> > 3. Read  a module parameter with a path in the filesystem
> > of the list of bad lines. It's not clear to me yet exactly what I
> > need to store, but it should be a relatively simple list.
> .... path in filesystem is no-go for kernel interface. So bad
> cachelines need to go into the modele parameter itself. Or we add a
> sysfs interface and reset the gpu (because if my understanding is
> right, we can't disable cachelines once the gpu has used them).

I think we have to assume the list could get quite long. So long in
fact, I imagine the user may often want to reset it and try his/her
luck again with some lines.

Could you elaborate more on why it's a no-go? The module parameter
setting itself is limited to root. I was trying to clearly understand
exactly why this can't be done, and some of the lore behind why file
access in the kernel is such a bad thing (assuming the files being
accessed are set at module load time). I wouldn't want to go the route
of loading an arbitrary path - which seems like a terrible idea;
though it works for firmware blobs, and I half thought we could load
this like a firmware blob.

Anyway, assuming a gpu reset is sufficient to remap (docs only clearly
state reset works for disabling, iirc) then I would like to do that.
What is the appropriate interface for that? The dev node? Sysfs?

> > 4. Parse list on driver load, and handle as necessary.
> > 5. goto 1.
> > 
> > Probably the biggest unanswered question is exactly when in the HW
> > loading do we have to finish remapping. If it can happen at any time
> > while the card is running, I don't need the filesystem stuff, but I
> > believe I need to remap the lines quite early in the device
> > bootstrap.
> I believe so, too ;-)
> > The only alternative I have is a huge comma separated string for a
> > module parameter, but I kind of like reading the file better.
> Well, you can't read a file from the kernel because we might init the
> driver without any userspace present (when the driver is built-in).

Userspace should still be present in this case, right? The kernel
command line should suffice, I think.

> > Any feedback is highly appreciated. I couldn't really find much
> > precedent for doing this in other drivers, so pointers to similar
> > things would also be highly welcome.
> -Daniel


