<html>
    <head>
      <base href="https://bugs.freedesktop.org/">
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW - gnome-shell stuck because of amdgpu driver 5.3 kernel"
   href="https://bugs.freedesktop.org/show_bug.cgi?id=111689">111689</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>gnome-shell stuck because of amdgpu driver 5.3 kernel
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>DRI
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>XOrg git
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>Other
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>All
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>not set
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>not set
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>DRM/AMDgpu
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>dri-devel@lists.freedesktop.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>mikhail.v.gavrilov@gmail.com
          </td>
        </tr></table>
      <p>
        <div>
        <pre>Created <span class=""><a href="attachment.cgi?id=145368" name="attach_145368" title="dmesg">attachment 145368</a> <a href="attachment.cgi?id=145368&action=edit" title="dmesg">[details]</a></span>
dmesg

During 5.3 RC5 cycle, I accidentally noted that when I left unblocked
gnome-shell at noon, and when I returned at the evening I discovered than
monitor not sleeping and show open gnome activity. At first, I thought that
some application did not let fall asleep the system. But when I try to move the
mouse, I realized that the system hanged. So I connect via ssh and tried to
investigate the problem. I did not see anything strange in kernel logs. And my
last idea before trying to kill the gnome-shell process was dumps tasks that
are in uninterruptable (blocked) state.

After [Alt + PrnScr + W] I saw this:

[32840.701909] sysrq: Show Blocked State
[32840.701976]   task                        PC stack   pid father
[32840.702407] gnome-shell     D11240  1900   1830 0x00000000
[32840.702438] Call Trace:
[32840.702446]  ? __schedule+0x352/0x900
[32840.702453]  schedule+0x3a/0xb0
[32840.702457]  schedule_timeout+0x289/0x3c0
[32840.702461]  ? find_held_lock+0x32/0x90
[32840.702464]  ? find_held_lock+0x32/0x90
[32840.702469]  ? mark_held_locks+0x50/0x80
[32840.702473]  ? _raw_spin_unlock_irqrestore+0x4b/0x60
[32840.702478]  dma_fence_default_wait+0x1f5/0x340
[32840.702482]  ? dma_fence_free+0x20/0x20
[32840.702487]  dma_fence_wait_timeout+0x182/0x1e0
[32840.702533]  amdgpu_fence_wait_empty+0xe7/0x210 [amdgpu]
[32840.702577]  amdgpu_pm_compute_clocks+0x70/0x5f0 [amdgpu]
[32840.702641]  dm_pp_apply_display_requirements+0x19e/0x1c0 [amdgpu]
[32840.702705]  dce12_update_clocks+0xd8/0x110 [amdgpu]
[32840.702766]  dc_commit_state+0x414/0x590 [amdgpu]
[32840.702834]  amdgpu_dm_atomic_commit_tail+0xd1e/0x1cf0 [amdgpu]
[32840.702840]  ? reacquire_held_locks+0xed/0x210
[32840.702848]  ? ttm_eu_backoff_reservation+0xa5/0x160 [ttm]
[32840.702853]  ? find_held_lock+0x32/0x90
[32840.702855]  ? find_held_lock+0x32/0x90
[32840.702860]  ? __lock_acquire+0x247/0x1910
[32840.702867]  ? find_held_lock+0x32/0x90
[32840.702871]  ? mark_held_locks+0x50/0x80
[32840.702874]  ? _raw_spin_unlock_irq+0x29/0x40
[32840.702877]  ? lockdep_hardirqs_on+0xf0/0x180
[32840.702881]  ? _raw_spin_unlock_irq+0x29/0x40
[32840.702884]  ? wait_for_completion_timeout+0x75/0x190
[32840.702895]  ? commit_tail+0x3c/0x70 [drm_kms_helper]
[32840.702902]  commit_tail+0x3c/0x70 [drm_kms_helper]
[32840.702909]  drm_atomic_helper_commit+0xe3/0x150 [drm_kms_helper]
[32840.702922]  drm_atomic_connector_commit_dpms+0xd7/0x100 [drm]
[32840.702936]  set_property_atomic+0xcc/0x140 [drm]
[32840.702955]  drm_mode_obj_set_property_ioctl+0xcb/0x1c0 [drm]
[32840.702968]  ? drm_mode_obj_find_prop_id+0x40/0x40 [drm]
[32840.702978]  drm_ioctl_kernel+0xaa/0xf0 [drm]
[32840.702990]  drm_ioctl+0x208/0x390 [drm]
[32840.703003]  ? drm_mode_obj_find_prop_id+0x40/0x40 [drm]
[32840.703007]  ? sched_clock_cpu+0xc/0xc0
[32840.703012]  ? lockdep_hardirqs_on+0xf0/0x180
[32840.703053]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[32840.703058]  do_vfs_ioctl+0x411/0x750
[32840.703065]  ksys_ioctl+0x5e/0x90
[32840.703069]  __x64_sys_ioctl+0x16/0x20
[32840.703072]  do_syscall_64+0x5c/0xb0
[32840.703076]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[32840.703079] RIP: 0033:0x7f8bcab0f00b
[32840.703084] Code: Bad RIP value.
[32840.703086] RSP: 002b:00007ffe76c62338 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[32840.703089] RAX: ffffffffffffffda RBX: 00007ffe76c62370 RCX:
00007f8bcab0f00b
[32840.703092] RDX: 00007ffe76c62370 RSI: 00000000c01864ba RDI:
0000000000000009
[32840.703094] RBP: 00000000c01864ba R08: 0000000000000003 R09:
00000000c0c0c0c0
[32840.703096] R10: 000056476c86a018 R11: 0000000000000246 R12:
000056476c8ad940
[32840.703098] R13: 0000000000000009 R14: 0000000000000002 R15:
0000000000000003
[root@localhost ~]#
[root@localhost ~]# ps aux | grep gnome-shell
mikhail     1900  0.3  1.1 6447496 378696 tty2   Dl+  Aug24   2:10
/usr/bin/gnome-shell
mikhail     2099  0.0  0.0 519984 23392 ?        Ssl  Aug24   0:00
/usr/libexec/gnome-shell-calendar-server
mikhail    12214  0.0  0.0 399484 29660 pts/2    Sl+  Aug24   0:00
/usr/bin/python3 /usr/bin/chrome-gnome-shell
chrome-extension://gphhapmejobijbbhgpjhcjognlahblep/
root       22957  0.0  0.0 216120  2456 pts/10   S+   03:59   0:00 grep
--color=auto gnome-shell

After it, I tried to kill gnome-shell process with signal 9, but the process
won't terminate after several unsuccessful attempts.

Only [Alt + PrnScr + B] helped reboot the hanging system.
I am writing here because I hope some ampgpu hackers cal look in the trace and
understand that is happening.

In dri-devel mailing list, Hillf Danton proposed two patches [1]
I tested both patches on top of 5.3 RC and didn't seen any problems with them.
But due to the fact that the initial problem was accidental, I can’t confirm
that it was definitely fixed by the patch.

Then Daniel Vetter proposed to add debug messages to patch [2] for seen which
fence is stuck.

After all the changes I see regularly appearing trace in the kernel logs that
appear only when computer blocked and monitor in power save mode.

Christian König supposes that problem is simply that PM turns of a block before
all work is done on that block. And suggested writing a bug report because it
is very difficult to get all the information from the mail thread.

[1] <a href="https://lists.freedesktop.org/archives/dri-devel/2019-August/232853.html">https://lists.freedesktop.org/archives/dri-devel/2019-August/232853.html</a>
[2] <a href="https://lists.freedesktop.org/archives/dri-devel/2019-September/234321.html">https://lists.freedesktop.org/archives/dri-devel/2019-September/234321.html</a>
[3] <a href="https://lists.freedesktop.org/archives/dri-devel/2019-September/234821.html">https://lists.freedesktop.org/archives/dri-devel/2019-September/234821.html</a></pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are the assignee for the bug.</li>
      </ul>
    </body>
</html>