[Nouveau] [Bug 40886] New: Improve our lockup detection, reporting and recovery

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Wed Sep 14 12:38:55 PDT 2011


https://bugs.freedesktop.org/show_bug.cgi?id=40886

           Summary: Improve our lockup detection, reporting and recovery
           Product: xorg
           Version: unspecified
          Platform: Other
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: medium
         Component: Driver/nouveau
        AssignedTo: nouveau at lists.freedesktop.org
        ReportedBy: martin.peres at ensi-bourges.fr
         QAContact: xorg-team at lists.x.org


At the moment, we only output the errors that the GPU reports in the kernel
logs. However, these are usually not helpful in any way.

To improve the quality of bug reports, it is also necessary to output
meaningful registers values and try to understand roughly were the problem is.
If possible, an error code should be generated to help merging bug reports into
a meaningful one.

This task is *very* suitable for students who want to learn about nouveau. If
you consider applying to this, please know that you will have a lot of
documentation to read and you will also be required to ask many questions to
the Nouveau developers. The actual implementation should be quite small.

The Ubuntu xorg team proposed us to improve our bug "reportability". Here is
what they have available on the intel driver that we could actually try to
copy.

# Jesse Barnes on ubuntu-devel at lists.ubuntu.com:
#   You'll get three events, one when the error is detected, one before
#   the reset and one after.  Each has a different environment variable set;
#   the initial error has ERROR=1, the pre-reset event has RESET=1 and the
#   post-reset event has ERROR=0.

# Disable freeze hook.
SUBSYSTEM=="drm", ACTION=="change", ENV{ERROR}=="1",
RUN+="/usr/share/apport/apport-gpu-error-intel.py"

The python script copies dmesg, Xorg.0.log, and
/sys/kernel/debug/dri/0/i915_error_state.  The latter is an
intel-specific error dump they use to help diagnose bugs.
We also capture a variety of other data and files, but those three seem
to be what the devs want, mostly.

We extract a couple error codes from the error_state file to use as a
way of automatically detecting dupes.

Here's a few examples of the results of all this:

  - https://bugs.freedesktop.org/show_bug.cgi?id=35854
  - https://bugs.freedesktop.org/show_bug.cgi?id=34014
  - https://bugs.freedesktop.org/show_bug.cgi?id=34307

-- 
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


More information about the Nouveau mailing list