[Nouveau] Tracking down severe regression in 5.3-rc4/5.4 for TU116 - assistance needed

Ilia Mirkin imirkin at alum.mit.edu
Thu Dec 19 20:38:10 UTC 2019


Let's add Mika and Rafael, as they were responsible for that commit.
Mika/Rafael - any ideas? The commit in question is

0617bdede5114a0002298b12cd0ca2b0cfd0395d

Marcin -- would be nice if you could confirm that taking a recent
kernel + "git revert 0617bdede5114a0002298b12cd0ca2b0cfd0395d" works
well for you.

On Thu, Dec 19, 2019 at 3:27 PM Marcin Zajączkowski <mszpak at wp.pl> wrote:
>
> On 2019-12-16 19:45, Ilia Mirkin wrote:
> > The obvious candidate based on a quick scan is
> > 0acf5676dc0ffe0683543a20d5ecbd112af5b8ee -- it merges a fix that
> > messes with PCI stuff, and there lie dragons. You could try building
> > that commit, and if things still work, then I have no idea (and you've
>
> Nice shot Ilia!
>
> I managed to build kernel from suspected bd112af5b8ee and it fails

Took me a while, but this is the end of the hash. Normally you list
the start of the hash (and that's what all the git tools accept). In
this case this is commit

0acf5676dc0ffe0683543a20d5ecbd112af5b8ee

> miserably (as previously described). The build from the previous commit
> 86a04561920b works fine.

e577dc152e232c78e5774e4c9b5486a04561920b

>
> > narrowed the range). Also I'd recommend ensuring that the good kernel
> > is really good and the bad kernel is really bad -- boot them a few
> > times.
>
> Well, this problem is reproducible in 100% in newer kernels. I see the
> errors on boot logs and after login to Gnome Shell the first execution
> of xrandr (or opening a lid) hangs the system (the graphic card). On the
> other side I haven't seen that problem in any earlier kernel. Therefore,
> the situation is rather clear in my case. Nevertheless, I will stay with
> that self-build good kernel (5.3.0-0.rc3 + git) to check it further.
>
>
> How would you see it, Ilia? Is there anything in nouveau that needs to
> be adjusted to that changes or rather those changes break something in
> nouveau that would be best to fix/revert them (and it would be good to
> let the committer know about the problem)?
>
> Marcin
>
>
>
> > On Mon, Dec 16, 2019 at 12:42 PM Marcin Zajączkowski <mszpak at wp.pl> wrote:
> >>
> >> On 2019-12-16 18:08, Ilia Mirkin wrote:
> >>> Hi Marcin,
> >>>
> >>> You should do a git bisect rather than guessing about commits. I
> >>> suspect that searching for "kernel git bisect fedora" should prove
> >>> instructive if you're not sure how to do this.
> >>
> >> Thanks for your suggestion. I realize that I can do it at the Git level
> >> and it is the ultimate way to go. However, building the kernel version
> >> from sources takes some time (in addition to a regular time needed to
> >> install/restart/verify which I already experienced narrowing down to a
> >> "just" ~250 commits).
> >>
> >> Therefore, I would be really thankful for a suggestion which commits
> >> could be good to check first - having 2, 4 is better than 8-10 (assuming
> >> someone is right :) ).
> >>
> >> Marcin
> >>
> >>
> >>
> >>> On Mon, Dec 16, 2019 at 11:42 AM Marcin Zajączkowski <mszpak at wp.pl> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> I've encountered a severe regression in TU116 (probably also TU117)
> >>>> introduced in 5.3-rc4 (valid also for recent 5.4.2) [1]. The system
> >>>> usually hangs on the subsequent graphic mode related operation (calling
> >>>> xrandr after login is enough) with the following error:
> >>>>
> >>>>> kernel: nouveau 0000:01:00.0: fifo: SCHED_ERROR 08 []
> >>>> ...
> >>>>> kernel: nouveau 0000:01:00.0: DRM: failed to idle channel 0 [DRM]
> >>>>> kernel: nouveau 0000:01:00.0: i2c: aux 0007: begin idle timeout ffffffff
> >>>>> kernel: nouveau 0000:01:00.0: tmr: stalled at ffffffffffffffff
> >>>>> kernel: ------------[ cut here ]------------
> >>>>> kernel: nouveau 0000:01:00.0: timeout
> >>>>> kernel: WARNING: CPU: 10 PID: 384 at drivers/gpu/drm/nouveau/nvkm/subdev/bar/g84.c:35 g84_bar_flush+0xcf/> 0xe0 [nouveau]
> >>>>
> >>>> (detailed log in a corresponding issue - [1])
> >>>>
> >>>> With earlier kernels there was no hardware acceleration for NVidia GTX
> >>>> 1660 Ti, but at least I could use nouveau to disable it (to save
> >>>> battery, trees and lower temperature) or even have an external output
> >>>> (with Wayland). Now, the system is unusable with nouveau :(.
> >>>>
> >>>> I spent some time trying to narrow the scope using on the existing
> >>>> kernel builds for Fedora. I was able to determine that the problem was
> >>>> introduced between 5.3.0-0.rc3.git1.1 (commit 33920f1ec5bf - works fine)
> >>>> and 5.3.0-0.rc4.git0.1 (tag v5.3-rc4 - fails with errors).
> >>>>
> >>>> It's just a few days (7-11 Aug) and "only" around 250 commits. I went
> >>>> through them, but (based on the commits name) I haven't seen any nouveau
> >>>> related changes and in general no very suspected drm related changes.
> >>>>
> >>>>> git log 33920f1ec5bf..v5.3-rc4 --stat
> >>>>
> >>>>
> >>>> Maybe some of more nouveau/drm-experienced developers could take a look
> >>>> at that to determine which commit could break it (to make it easier to
> >>>> find out what should be fixed to prevent that regression)?
> >>>>
> >>>>
> >>>> [1] -
> >>>> https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/516
> >>>>
> >>>> Thanks in advance
> >>>> Marcin


More information about the Nouveau mailing list