[Intel-gfx] Panic after S3 resume and modeset with MST

Thu Mar 30 20:27:12 UTC 2017

On Thu, 30 Mar 2017 22:01:31 +0200,
Lyude Paul wrote:
> 
> On Thu, 2017-03-30 at 20:50 +0200, Takashi Iwai wrote:
> <snip>
> > 
> > Sure, if we get a proper stack dump, we can analyze it somehow.  You
> > can use addr2line, or even check objdump output manually.
> > But in this case, as already mentioned, it was impossible to get any
> > sensible stack trace on my machine with 4.11-rc, so far,
> > unfortunately.  So no material to read.
> 
> huh? I thought that was what the file called "screenshot showing kernel
> panic trace" on the bugzilla was (although that backtrace definitely
> didn't look too relevant)...

It's not from my machine, and it's not from 4.11-rc.  It's a
screenshot taken on 4.4.x openSUSE kernel with the backport of your
fix.  So it might be some help, but the stack trace there is merely a
red herring.

The reason I couldn't get such a screenshot is that VT switching is
broken on 4.11 in multiple ways.  One VT bug got fixed in 4.11-rc4,
but another still remains....

> anyway if you are having trouble getting
> just a stack trace though, one of my coworkers here has taught me a
> trick called divide and conquer.
> 
> The idea is pretty simple. Let's say we have a block of code like this
> in the kernel
> 
> void some_resume_func() {
> 	cool_function_call();
> 	this_is_neat_too();
> 
> 	foo();
> 	bar();
> 	death();
> 	baz();
> 	zab();
> }
> 
> And you know it's crashing inside this function on resume (e.g. it
> could be in foo(), bar(), or that suspicious death() function) but you
> have no way of getting a back trace.
> 
> This is where the trick comes in: while you might not be able to get a
> stack trace, you can probably at least tell the difference between when
> the machine reboots immediately as a result of calling
> emergency_restart(), and whether it's just hanging due to the bug.
> 
> So what you do is kind of like bisecting, except instead of testing
> different commits you see what happens when you insert a call to
> emergency_restart() and move it around:
> 
> - Try #1:
> 
> void some_resume_func() {
> 	cool_function_call();
> 	this_is_neat_too();
> 
> 	foo();
> 	emergency_restart();
> 	bar();
> 	death();
> 	baz();
> 	zab();
> }
> 
> The machine immediately reboots, so the problem is below where we
> inserted the emergency_reboot() call
> 
> - Try #2:
> 
> void some_resume_func() {
> 	cool_function_call();
> 	this_is_neat_too();
> 
> 	foo();
> 	bar();
> 	death();
> 	emergency_restart();
> 	baz();
> 	zab();
> }
> 
> The machine hangs, so we know the problem's either in the call to bar()
> or death().
> 
> - Try #3:
> 
> void some_resume_func() {
> 	cool_function_call();
> 	this_is_neat_too();
> 
> 	foo();
> 	bar();
> 	emergency_restart();
> 	death();
> 	baz();
> 	zab();
> }
> 
> The machine reboots immediately this time, which means that the problem
> has to be occurring inside the suspicious death() function. Of course,
> if we want to keep debugging further we can go into the death()
> function itself and try the same thing to figure out which line inside
> it is causing the issue.

Heh, the divide-and-conquer is also the strategy how I reached to my
patch :)  I divided the possible cause (the call of
intel_dp_mst_resume()), split them, and luckily it worked by the first
shot.

> So if you do this except around wherever it looks like this crash might
> be happening. From:
> 
> https://bugzilla.suse.com/show_bug.cgi?id=1029634#c5
> 
> It sounds like this happens on hotplugging, so the place to start this
> would probably be i915_hotplug_work_func(). Keep going down the call
> stack there and you should eventually find the culprit.
> 
> The only complication I foresee here is that you'll have to write a
> little bit of additional debugging code so that
> i915_hotplug_work_func() doesn't actually call emergency_restart()
> until right before the moment where the crash happens. This shouldn't
> be too difficult, you could do something like add a module parameter to
> i915 that you change right before the final step of reproducing the bug
> that enables the calls to emergency_restart(). If you have any trouble
> with this part, feel free to let me know and I'll hack together a quick
> patch you can use.

Right, that's the most difficult part; for reproducing the crash, we
need multiple suspend/resume and dock/undock, so the code path may be
executed multiple times.  And tracking i915_hotplug_work_func() in the
way you suggested isn't so trivial, as it's with full of indirect
calls...

A trick I often used instead is to put additional delays (very long
ones) between the suspected code lines with marking via trace_ or
normal printk, and track at which point we could reach.  Then you
don't need a frequent reboot but just a few long runs.  Of course, it
can't be used for irq context, but for the work, it's OK.

Maybe I'll give it a try, but likely later in the next week; I'll be
very busy for other tasks in tomorrow, sorry.

thanks,

Takashi

> 
> Lemme know if this helps at all :).
> 
> > 
> > That is, the problem isn't how to translate it, but how to get it.
> > Normal ways didn't work.  Maybe I can try AMT, but I doubt that it'll
> > give any output since kdump already failed...
> > 
> > 
> > thanks,
> > 
> > Takashi
>