[Intel-gfx] Panic after S3 resume and modeset with MST
Takashi Iwai
tiwai at suse.de
Thu Mar 30 20:27:12 UTC 2017
On Thu, 30 Mar 2017 22:01:31 +0200,
Lyude Paul wrote:
>
> On Thu, 2017-03-30 at 20:50 +0200, Takashi Iwai wrote:
> <snip>
> >
> > Sure, if we get a proper stack dump, we can analyze it somehow. You
> > can use addr2line, or even check objdump output manually.
> > But in this case, as already mentioned, it was impossible to get any
> > sensible stack trace on my machine with 4.11-rc, so far,
> > unfortunately. So no material to read.
>
> huh? I thought that was what the file called "screenshot showing kernel
> panic trace" on the bugzilla was (although that backtrace definitely
> didn't look too relevant)...
It's not from my machine, and it's not from 4.11-rc. It's a
screenshot taken on 4.4.x openSUSE kernel with the backport of your
fix. So it might be some help, but the stack trace there is merely a
red herring.
The reason I couldn't get such a screenshot is that VT switching is
broken on 4.11 in multiple ways. One VT bug got fixed in 4.11-rc4,
but another still remains....
> anyway if you are having trouble getting
> just a stack trace though, one of my coworkers here has taught me a
> trick called divide and conquer.
>
> The idea is pretty simple. Let's say we have a block of code like this
> in the kernel
>
> void some_resume_func() {
> cool_function_call();
> this_is_neat_too();
>
> foo();
> bar();
> death();
> baz();
> zab();
> }
>
> And you know it's crashing inside this function on resume (e.g. it
> could be in foo(), bar(), or that suspicious death() function) but you
> have no way of getting a back trace.
>
> This is where the trick comes in: while you might not be able to get a
> stack trace, you can probably at least tell the difference between when
> the machine reboots immediately as a result of calling
> emergency_restart(), and whether it's just hanging due to the bug.
>
> So what you do is kind of like bisecting, except instead of testing
> different commits you see what happens when you insert a call to
> emergency_restart() and move it around:
>
> - Try #1:
>
> void some_resume_func() {
> cool_function_call();
> this_is_neat_too();
>
> foo();
> emergency_restart();
> bar();
> death();
> baz();
> zab();
> }
>
> The machine immediately reboots, so the problem is below where we
> inserted the emergency_reboot() call
>
> - Try #2:
>
> void some_resume_func() {
> cool_function_call();
> this_is_neat_too();
>
> foo();
> bar();
> death();
> emergency_restart();
> baz();
> zab();
> }
>
> The machine hangs, so we know the problem's either in the call to bar()
> or death().
>
> - Try #3:
>
> void some_resume_func() {
> cool_function_call();
> this_is_neat_too();
>
> foo();
> bar();
> emergency_restart();
> death();
> baz();
> zab();
> }
>
> The machine reboots immediately this time, which means that the problem
> has to be occurring inside the suspicious death() function. Of course,
> if we want to keep debugging further we can go into the death()
> function itself and try the same thing to figure out which line inside
> it is causing the issue.
Heh, the divide-and-conquer is also the strategy how I reached to my
patch :) I divided the possible cause (the call of
intel_dp_mst_resume()), split them, and luckily it worked by the first
shot.
> So if you do this except around wherever it looks like this crash might
> be happening. From:
>
> https://bugzilla.suse.com/show_bug.cgi?id=1029634#c5
>
> It sounds like this happens on hotplugging, so the place to start this
> would probably be i915_hotplug_work_func(). Keep going down the call
> stack there and you should eventually find the culprit.
>
> The only complication I foresee here is that you'll have to write a
> little bit of additional debugging code so that
> i915_hotplug_work_func() doesn't actually call emergency_restart()
> until right before the moment where the crash happens. This shouldn't
> be too difficult, you could do something like add a module parameter to
> i915 that you change right before the final step of reproducing the bug
> that enables the calls to emergency_restart(). If you have any trouble
> with this part, feel free to let me know and I'll hack together a quick
> patch you can use.
Right, that's the most difficult part; for reproducing the crash, we
need multiple suspend/resume and dock/undock, so the code path may be
executed multiple times. And tracking i915_hotplug_work_func() in the
way you suggested isn't so trivial, as it's with full of indirect
calls...
A trick I often used instead is to put additional delays (very long
ones) between the suspected code lines with marking via trace_ or
normal printk, and track at which point we could reach. Then you
don't need a frequent reboot but just a few long runs. Of course, it
can't be used for irq context, but for the work, it's OK.
Maybe I'll give it a try, but likely later in the next week; I'll be
very busy for other tasks in tomorrow, sorry.
thanks,
Takashi
>
> Lemme know if this helps at all :).
>
> >
> > That is, the problem isn't how to translate it, but how to get it.
> > Normal ways didn't work. Maybe I can try AMT, but I doubt that it'll
> > give any output since kdump already failed...
> >
> >
> > thanks,
> >
> > Takashi
>
More information about the Intel-gfx
mailing list