[Intel-gfx] [RFC] algorithm for handling bad cachelines

Daniel Vetter daniel at ffwll.ch
Tue Mar 27 17:33:01 CEST 2012


On Tue, Mar 27, 2012 at 08:09:31AM -0700, Ben Widawsky wrote:
> On Tue, 27 Mar 2012 16:50:39 +0200
> Daniel Vetter <daniel at ffwll.ch> wrote:
> 
> > On Tue, Mar 27, 2012 at 07:19:43AM -0700, Ben Widawsky wrote:
> > > I wanted to run this by folks before I start doing any actual work.
> > > 
> > > This is primarily for GPGPU, or perhaps *really* accurate rendering
> > > requirements.
> > > 
> > > IVB+ has an interrupt to tell us when a cacheline seems to be going
> > > bad. There is also a mechanism to remap the bad cachelines. The
> > > implementation details aren't quite clear to me yet, but I'd like to
> > > enable this feature for userspace.
> > > 
> > > Here is my current plan, but it involves filesystem access, so it's
> > > probably going to get a lot of flames.
> > > 
> > > 1. Handle cache line going bad interrupt.
> > > <After n number of these interrupts to the same line,>
> > > 2. send a uevent
> > > 2.5 reset the GPU (docs tell us to)
> > > <On module load>
> > > 3. Read  a module parameter with a path in the filesystem
> > > of the list of bad lines. It's not clear to me yet exactly what I
> > > need to store, but it should be a relatively simple list.
> > 
> > .... path in filesystem is no-go for kernel interface. So bad
> > cachelines need to go into the modele parameter itself. Or we add a
> > sysfs interface and reset the gpu (because if my understanding is
> > right, we can't disable cachelines once the gpu has used them).
> 
> I think we have to assume the list could get quite long. So long in
> fact, I imagine the user may often want to reset it and try his/her
> luck again with some lines.
> 
> Could you elaborate more on why it's a no-go? The module parameter
> setting itself is limited to root. I was trying to clearly understand
> exactly why this can't be done, and some of the lore behind why file
> access in the kernel is such a bad thing (assuming the files being
> accessed are set at module load time). I wouldn't want to go the route
> of loading an arbitrary path - which seems like a terrible idea;
> though it works for firmware blobs, and I half thought we could load
> this like a firmware blob.
> 
> Anyway, assuming a gpu reset is sufficient to remap (docs only clearly
> state reset works for disabling, iirc) then I would like to do that.
> What is the appropriate interface for that? The dev node? Sysfs?

I personally prefer sysfs for this. Albeit you might have some issues with
the one value per file limit ... I guess a list of hex values is ok
though.

> > > 4. Parse list on driver load, and handle as necessary.
> > > 5. goto 1.
> > > 
> > > Probably the biggest unanswered question is exactly when in the HW
> > > loading do we have to finish remapping. If it can happen at any time
> > > while the card is running, I don't need the filesystem stuff, but I
> > > believe I need to remap the lines quite early in the device
> > > bootstrap.
> > 
> > I believe so, too ;-)
> > 
> > > The only alternative I have is a huge comma separated string for a
> > > module parameter, but I kind of like reading the file better.
> > 
> > Well, you can't read a file from the kernel because we might init the
> > driver without any userspace present (when the driver is built-in).
> 
> Userspace should still be present in this case, right? The kernel
> command line should suffice, I think.

Somewhen later on, but only after the hw is intialized. But if you're
going the runtime interface route anyway, it doesn't matter.
-Daniel
-- 
Daniel Vetter
Mail: daniel at ffwll.ch
Mobile: +41 (0)79 365 57 48



More information about the Intel-gfx mailing list