[Intel-gfx] [PATCH 088/190] drm/i915: Move execlists interrupt based submission to a bottom-half

Fri Feb 19 14:10:44 UTC 2016

On 19/02/16 12:29, Chris Wilson wrote:
> On Fri, Feb 19, 2016 at 12:08:14PM +0000, Tvrtko Ursulin wrote:
>>
>> Hi,
>>
>> On 11/01/16 10:44, Chris Wilson wrote:
>>> [  196.988204] clocksource: timekeeping watchdog: Marking clocksource 'tsc' as unstable because the skew is too large:
>>> [  196.988512] clocksource:                       'refined-jiffies' wd_now: ffff9b48 wd_last: ffff9acb mask: ffffffff
>>> [  196.988559] clocksource:                       'tsc' cs_now: 4fcfa84354 cs_last: 4f95425e98 mask: ffffffffffffffff
>>> [  196.992115] clocksource: Switched to clocksource refined-jiffies
>>>
>>> Followed by a hard lockup.
>>
>> What does the above mean? Irq handler taking too long interferes
>> with time keeping ?
>
> That's exactly what it means, we run for too long in interrupt context
> (i.e. with interrupts disabled).

Okay, just please spell it out in the commit.

>> I like it BTW. Just the commit message needs more work. :)
>>
>> How is performance impact with just this patch in isolation? Worth
>> progressing it on its own?
>
> I only looked for regressions, which I didn't find. It fixes a machine
> freeze/panic, so I wasn't looking for any other reason to justify the
> patch!

Then both of the above also need to be documented in the commit message.

>>> -void intel_lrc_irq_handler(struct intel_engine_cs *ring)
>>> +static int intel_execlists_submit(void *arg)
>>>   {
>>> -	struct drm_i915_private *dev_priv = ring->dev->dev_private;
>>> -	u32 status_pointer;
>>> -	u8 read_pointer;
>>> -	u8 write_pointer;
>>> -	u32 status = 0;
>>> -	u32 status_id;
>>> -	u32 submit_contexts = 0;
>>> +	struct intel_engine_cs *ring = arg;
>>> +	struct drm_i915_private *dev_priv = ring->i915;
>>>
>>> -	status_pointer = I915_READ(RING_CONTEXT_STATUS_PTR(ring));
>>> +	set_rtpriority();
>>>
>>> -	read_pointer = ring->next_context_status_buffer;
>>> -	write_pointer = GEN8_CSB_WRITE_PTR(status_pointer);
>>> -	if (read_pointer > write_pointer)
>>> -		write_pointer += GEN8_CSB_ENTRIES;
>>> +	do {
>>> +		u32 status;
>>> +		u32 status_id;
>>> +		u32 submit_contexts;
>>> +		u8 head, tail;
>>>
>>> -	spin_lock(&ring->execlist_lock);
>>> +		set_current_state(TASK_INTERRUPTIBLE);
>>
>> Hm, what is the effect of setting TASK_INTERRUPTIBLE at this stage
>> rather than just before the call to schedule?
>
> We need to set the task state before doing the checks, so that we do not
> miss a wakeup from the interrupt handler. Otherwise, we may check the
> register, decide there is nothing to do, be interrupted by the irq
> handler which then sets us to TASK_RUNNING, and then proceed with going
> to sleep by setting TASK_INTERRUPTIBLE (and so missing that the
> context-switch had just occurred and sleep forever).

Doh of course - haven't been exposed to this for a while.

>> And why interruptible?
>
> To hide ourselves from contributing to the system load and appear as
> sleeping (rather than blocked) in the process lists.

Oh yes, definitely.

>>> +		while (head++ < tail) {
>>> +			status = I915_READ(RING_CONTEXT_STATUS_BUF_LO(ring, head % GEN8_CSB_ENTRIES));
>>> +			status_id = I915_READ(RING_CONTEXT_STATUS_BUF_HI(ring, head % GEN8_CSB_ENTRIES));
>>
>> One potentially cheap improvement I've been thinking of is to move
>> the CSB reading outside the execlist_lock, to avoid slow MMIO
>> contending the lock.
>
> Yes, that should be a small but noticeable improvement.
>
>> We could fetch all valid entries into a temporary local buffer, then
>> grab the execlist_lock and process completions and submission from
>> there.
>
> If you look at the next patch, you will see that's what I did do :)

I have just about started to understand (or pretend to understand) the 
current code. I don't dare to look at a complete rewrite.

So maybe start with this one and go with small incremental changes?

>>> -	ring->next_context_status_buffer = write_pointer % GEN8_CSB_ENTRIES;
>>> +		spin_unlock(&ring->execlist_lock);
>>
>> Would it be worth trying to grab the mutex and unpin the LRCs at
>> this point? It would slow down the thread a bit but would get rid of
>> the retired work queue. With the vma->iomap thingy could be quite
>> cheap?
>
> Exactly, we want the iomap/vmap caching thingy first :) But the
> retired work queue disappears as a fallout of your previous-context idea
> anyway plus the fix to avoid the struct_mutex when freeing requests.

I did not get that working yet. I think it need N previous contexts 
pinned in a request, where N is equal to CSB size. Since most 
pessimistic thinking is we could get that many context complete events 
in a single interrupt.

Regards,

Tvrtko