[Intel-gfx] [PATCH 2/8] drm/i915: Complete the fences as they are cancelled due to wedging

Tue Dec 4 10:30:15 UTC 2018

On 03/12/2018 17:36, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-12-03 17:11:59)
>>
>> On 03/12/2018 11:36, Chris Wilson wrote:
>>> We inspect the requests under the assumption that they will be marked as
>>> completed when they are removed from the queue. Currently however, in the
>>> process of wedging the requests will be removed from the queue before they
>>> are completed, so rearrange the code to complete the fences before the
>>> locks are dropped.
>>>
>>> <1>[  354.473346] BUG: unable to handle kernel NULL pointer dereference at 0000000000000250
>>> <6>[  354.473363] PGD 0 P4D 0
>>> <4>[  354.473370] Oops: 0000 [#1] PREEMPT SMP PTI
>>> <4>[  354.473380] CPU: 0 PID: 4470 Comm: gem_eio Tainted: G     U            4.20.0-rc4-CI-CI_DRM_5216+ #1
>>> <4>[  354.473393] Hardware name: Intel Corporation NUC7CJYH/NUC7JYB, BIOS JYGLKCPX.86A.0027.2018.0125.1347 01/25/2018
>>> <4>[  354.473480] RIP: 0010:__i915_schedule+0x311/0x5e0 [i915]
>>> <4>[  354.473490] Code: 49 89 44 24 20 4d 89 4c 24 28 4d 89 29 44 39 b3 a0 04 00 00 7d 3a 41 8b 44 24 78 85 c0 74 13 48 8b 93 78 04 00 00 48 83 e2 fc <39> 82 50 02 00 00 79 1e 44 89 b3 a0 04 00 00 48 8d bb d0 03 00 00
>>
>> This confuses me, isn't the code segment usually at the end?
> 
> *shrug* It was cut and paste.
> 
>> And then
>> you have another after the call trace which doesn't match
>> __i915_scheduel.. anyways, _this_ code seems to be this part:
>>
>>                   if (node_to_request(node)->global_seqno &&
>>    90d:   8b 43 78                mov    eax,DWORD PTR [rbx+0x78]
>>    910:   85 c0                   test   eax,eax
>>    912:   74 13                   je     927 <__i915_schedule+0x317>
>>   
>> i915_seqno_passed(port_request(engine->execlists.port)->global_seqno,
>>    914:   49 8b 97 c0 04 00 00    mov    rdx,QWORD PTR [r15+0x4c0]
>>    91b:   48 83 e2 fc             and    rdx,0xfffffffffffffffc
>>                   if (node_to_request(node)->global_seqno &&
>>    91f:   39 82 50 02 00 00       cmp    DWORD PTR [rdx+0x250],eax
>>    925:   79 1e                   jns    945 <__i915_schedule+0x335>
>>
>>> <4>[  354.473515] RSP: 0018:ffffc900001bba90 EFLAGS: 00010046
>>> <4>[  354.473524] RAX: 0000000000000003 RBX: ffff8882624c8008 RCX: f34a737800000000
>>> <4>[  354.473535] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8882624c8048
>>> <4>[  354.473545] RBP: ffffc900001bbab0 R08: 000000005963f1f1 R09: 0000000000000000
>>> <4>[  354.473556] R10: ffffc900001bba10 R11: ffff8882624c8060 R12: ffff88824fdd7b98
>>> <4>[  354.473567] R13: ffff88824fdd7bb8 R14: 0000000000000001 R15: ffff88824fdd7750
>>> <4>[  354.473578] FS:  00007f44b4b5b980(0000) GS:ffff888277e00000(0000) knlGS:0000000000000000
>>> <4>[  354.473590] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> <4>[  354.473599] CR2: 0000000000000250 CR3: 000000026976e000 CR4: 0000000000340ef0
>>
>> Given the registers above, I think it means this - eax is global_seqno
>> of the node rq. rdx is is port_request so NULL and bang. No request in
>> port, but why would there always be one at the point we are scheduling
>> in a new request to the runnable queue?
> 
> Correct. The answer, as I chose to interpret it, is because of the
> incomplete submitted+dequeued requests during cancellation which this
> patch attempts to address.

I couldn't find any other route to this state myself, so on the basis of 
that, but with a little bit of fear from "Could it have really been so 
much simpler all along?!":

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>

Regards,

Tvrtko