[Intel-gfx] [PATCH] drm/i915: Don't mark an execlists context-switch when idle

Tue Apr 11 11:29:46 UTC 2017

On Tue, Apr 11, 2017 at 10:18:56AM +0100, Chris Wilson wrote:
> If we *know* that the engine is idle, i.e. we have not more contexts in
> lift, we can skip any spurious CSB idle interrupts. These spurious
> interrupts seem to arrive long after we assert that the engines are
> completely idle, triggering later assertions:
> 
> [  178.896646] intel_engine_is_idle(bcs): interrupt not handled, irq_posted=2
> [  178.896655] ------------[ cut here ]------------
> [  178.896658] kernel BUG at drivers/gpu/drm/i915/intel_engine_cs.c:226!
> [  178.896661] invalid opcode: 0000 [#1] SMP
> [  178.896663] Modules linked in: i915(E) x86_pkg_temp_thermal(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) ghash_clmulni_intel(E) nls_ascii(E) nls_cp437(E) vfat(E) fat(E) intel_gtt(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) aesni_intel(E) prime_numbers(E) evdev(E) aes_x86_64(E) drm(E) crypto_simd(E) cryptd(E) glue_helper(E) mei_me(E) mei(E) lpc_ich(E) efivars(E) mfd_core(E) battery(E) video(E) acpi_pad(E) button(E) tpm_tis(E) tpm_tis_core(E) tpm(E) autofs4(E) i2c_i801(E) fan(E) thermal(E) i2c_designware_platform(E) i2c_designware_core(E)
> [  178.896694] CPU: 1 PID: 522 Comm: gem_exec_whispe Tainted: G            E   4.11.0-rc5+ #14
> [  178.896702] task: ffff88040aba8d40 task.stack: ffffc900003f0000
> [  178.896722] RIP: 0010:intel_engine_init_global_seqno+0x1db/0x1f0 [i915]
> [  178.896725] RSP: 0018:ffffc900003f3ab0 EFLAGS: 00010246
> [  178.896728] RAX: 0000000000000000 RBX: ffff88040af54000 RCX: 0000000000000000
> [  178.896731] RDX: ffff88041ec933e0 RSI: ffff88041ec8cc48 RDI: ffff88041ec8cc48
> [  178.896734] RBP: ffffc900003f3ac8 R08: 0000000000000000 R09: 000000000000047d
> [  178.896736] R10: 0000000000000040 R11: ffff88040b344f80 R12: 0000000000000000
> [  178.896739] R13: ffff88040bce0000 R14: ffff88040bce52d8 R15: ffff88040bce0000
> [  178.896742] FS:  00007f2cccc2d8c0(0000) GS:ffff88041ec80000(0000) knlGS:0000000000000000
> [  178.896746] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  178.896749] CR2: 00007f41ddd8f000 CR3: 000000040bb03000 CR4: 00000000001406e0
> [  178.896752] Call Trace:
> [  178.896768]  reset_all_global_seqno.part.33+0x4e/0xd0 [i915]
> [  178.896782]  i915_gem_request_alloc+0x304/0x330 [i915]
> [  178.896795]  i915_gem_do_execbuffer+0x8a1/0x17d0 [i915]
> [  178.896799]  ? remove_wait_queue+0x48/0x50
> [  178.896812]  ? i915_wait_request+0x300/0x590 [i915]
> [  178.896816]  ? wake_up_q+0x70/0x70
> [  178.896819]  ? refcount_dec_and_test+0x11/0x20
> [  178.896823]  ? reservation_object_add_excl_fence+0xa5/0x100
> [  178.896835]  i915_gem_execbuffer2+0xab/0x1f0 [i915]
> [  178.896844]  drm_ioctl+0x1e6/0x460 [drm]
> [  178.896858]  ? i915_gem_execbuffer+0x260/0x260 [i915]
> [  178.896862]  ? dput+0xcf/0x250
> [  178.896866]  ? full_proxy_release+0x66/0x80
> [  178.896869]  ? mntput+0x1f/0x30
> [  178.896872]  do_vfs_ioctl+0x8f/0x5b0
> [  178.896875]  ? ____fput+0x9/0x10
> [  178.896878]  ? task_work_run+0x80/0xa0
> [  178.896881]  SyS_ioctl+0x3c/0x70
> [  178.896885]  entry_SYSCALL_64_fastpath+0x17/0x98
> [  178.896888] RIP: 0033:0x7f2ccb455ca7
> [  178.896890] RSP: 002b:00007ffcabec72d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [  178.896894] RAX: ffffffffffffffda RBX: 000055f897a44b90 RCX: 00007f2ccb455ca7
> [  178.896897] RDX: 00007ffcabec74a0 RSI: 0000000040406469 RDI: 0000000000000003
> [  178.896900] RBP: 00007f2ccb70a440 R08: 00007f2ccb70d0a4 R09: 0000000000000000
> [  178.896903] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> [  178.896905] R13: 000055f89782d71a R14: 00007ffcabecf838 R15: 0000000000000003
> [  178.896908] Code: 00 31 d2 4c 89 ef 8d 70 48 41 ff 95 f8 06 00 00 e9 68 fe ff ff be 0f 00 00 00 48 c7 c7 48 dc 37 a0 e8 fa 33 d6 e0 e9 0b ff ff ff <0f> 0b 0f 0b 0f 0b 0f 0b 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00
> 
> On the other hand, by ignoring the interrupt do we risk running out of
> space in CSB ring? Testing for a few hours suggests not, i.e. that we
> only seem to get the odd delayed CSB idle notification.

The alternative is not to report upon engine->irq_posted from within
intel_engine_is_idle() but that loses us an important safety check for
e.g. suspend.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre