[Intel-gfx] Panic after S3 resume and modeset with MST

Thu Mar 30 20:01:31 UTC 2017

On Thu, 2017-03-30 at 20:50 +0200, Takashi Iwai wrote:
<snip>
> 
> Sure, if we get a proper stack dump, we can analyze it somehow.  You
> can use addr2line, or even check objdump output manually.
> But in this case, as already mentioned, it was impossible to get any
> sensible stack trace on my machine with 4.11-rc, so far,
> unfortunately.  So no material to read.

huh? I thought that was what the file called "screenshot showing kernel
panic trace" on the bugzilla was (although that backtrace definitely
didn't look too relevant)... anyway if you are having trouble getting
just a stack trace though, one of my coworkers here has taught me a
trick called divide and conquer.

The idea is pretty simple. Let's say we have a block of code like this
in the kernel

void some_resume_func() {
	cool_function_call();
	this_is_neat_too();

	foo();
	bar();
	death();
	baz();
	zab();
}

And you know it's crashing inside this function on resume (e.g. it
could be in foo(), bar(), or that suspicious death() function) but you
have no way of getting a back trace.

This is where the trick comes in: while you might not be able to get a
stack trace, you can probably at least tell the difference between when
the machine reboots immediately as a result of calling
emergency_restart(), and whether it's just hanging due to the bug.

So what you do is kind of like bisecting, except instead of testing
different commits you see what happens when you insert a call to
emergency_restart() and move it around:

- Try #1:

void some_resume_func() {
	cool_function_call();
	this_is_neat_too();

	foo();
	emergency_restart();
	bar();
	death();
	baz();
	zab();
}

The machine immediately reboots, so the problem is below where we
inserted the emergency_reboot() call

- Try #2:

void some_resume_func() {
	cool_function_call();
	this_is_neat_too();

	foo();
	bar();
	death();
	emergency_restart();
	baz();
	zab();
}

The machine hangs, so we know the problem's either in the call to bar()
or death().

- Try #3:

void some_resume_func() {
	cool_function_call();
	this_is_neat_too();

	foo();
	bar();
	emergency_restart();
	death();
	baz();
	zab();
}

The machine reboots immediately this time, which means that the problem
has to be occurring inside the suspicious death() function. Of course,
if we want to keep debugging further we can go into the death()
function itself and try the same thing to figure out which line inside
it is causing the issue.

So if you do this except around wherever it looks like this crash might
be happening. From:

https://bugzilla.suse.com/show_bug.cgi?id=1029634#c5

It sounds like this happens on hotplugging, so the place to start this
would probably be i915_hotplug_work_func(). Keep going down the call
stack there and you should eventually find the culprit.

The only complication I foresee here is that you'll have to write a
little bit of additional debugging code so that
i915_hotplug_work_func() doesn't actually call emergency_restart()
until right before the moment where the crash happens. This shouldn't
be too difficult, you could do something like add a module parameter to
i915 that you change right before the final step of reproducing the bug
that enables the calls to emergency_restart(). If you have any trouble
with this part, feel free to let me know and I'll hack together a quick
patch you can use.

Lemme know if this helps at all :).

> 
> That is, the problem isn't how to translate it, but how to get it.
> Normal ways didn't work.  Maybe I can try AMT, but I doubt that it'll
> give any output since kdump already failed...
> 
> 
> thanks,
> 
> Takashi