i915 modeset memory corruption issues? (Fwd: Oops in ext3_block_to_path.isra.40+0x26/0x11b)

Fri Mar 16 09:11:54 PDT 2012

Guys,
 I don't know if these kinds of things have been forwarded to you, but
there's apparently been several things like this going on - with the
finger pointing to the i915 driver apparently clearing random memory.
Often the end result seems to be list corruption or a NULL pointer
dereference in the filesystem layer.

                         Linus

---------- Forwarded message ----------
From: Jiri Kosina <jkosina at suse.cz>
Date: Fri, Mar 16, 2012 at 8:25 AM
Subject: Re: Oops in ext3_block_to_path.isra.40+0x26/0x11b
To: George Spelvin <linux at horizon.com>
Cc: jack at suse.cz, linux-ext4 at vger.kernel.org,
linux-kernel at vger.kernel.org, torvalds at linux-foundation.org

On Fri, 16 Mar 2012, George Spelvin wrote:

> > I am not aware of anything, but I have a question -- George, did the
> > machine get suspended/resumed before this happened?
>
> I'm pretty sure it didn't.  It's a desktop machine, and I don't ever
> suspend it.  It had been up for a while, and /sys/power/state exists, and
> *maybe* I played with it sometime since boot, but the backup runs nightly
> and I *definitely* did not suspend it the day (or even several days)
> before the oops.
>
> (I tried to preserve the machine state, but processes started getting
> stuck in the kernel a few hours after the report, so I had to reboot it.)
>
> Jan Kara asked:
> >   And by any chance, do you use i915 driver? Because that one seems to
> > cause corruption - see: https://lkml.org/lkml/2012/3/9/217. I believe
> > Jiri's corruption is likely caused by that...
>
> Yes!  lspci -nn and abbreviated .config attached.  But, as mentioned, the machine
> hasn't been suspended...

So it might be the culprit. As the reason of the corruption is not yet
understood, it might be that suspend/resume cycle is not necessary
pre-requisite for this to trigger, it might just make it more likely.

And the corruption is observed to be indeed several writes of 0x00000000,
so it could easily lead to null pointer dereferences all over the place.

Are you able to reproduce the problem if you turn kernel modesetting off?

--
Jiri Kosina
SUSE Labs