[Intel-gfx] [PATCH 24/46] drm/i915: Do a synchronous switch-to-kernel-context on idling

Thu Feb 21 22:53:41 UTC 2019

On 2/21/19 1:42 PM, Chris Wilson wrote:
> Quoting Daniele Ceraolo Spurio (2019-02-21 21:31:45)
>>
>>
>> On 2/21/19 1:17 PM, Chris Wilson wrote:
>>> Quoting Daniele Ceraolo Spurio (2019-02-21 19:48:01)
>>>>
>>>> <snip>
>>>>
>>>>> @@ -4481,19 +4471,7 @@ int i915_gem_suspend(struct drm_i915_private *i915)
>>>>>          * state. Fortunately, the kernel_context is disposable and we do
>>>>>          * not rely on its state.
>>>>>          */
>>>>> -     if (!i915_terminally_wedged(&i915->gpu_error)) {
>>>>> -             ret = i915_gem_switch_to_kernel_context(i915);
>>>>> -             if (ret)
>>>>> -                     goto err_unlock;
>>>>> -
>>>>> -             ret = i915_gem_wait_for_idle(i915,
>>>>> -                                          I915_WAIT_INTERRUPTIBLE |
>>>>> -                                          I915_WAIT_LOCKED |
>>>>> -                                          I915_WAIT_FOR_IDLE_BOOST,
>>>>> -                                          HZ / 5);
>>>>> -             if (ret == -EINTR)
>>>>> -                     goto err_unlock;
>>>>> -
>>>>> +     if (!switch_to_kernel_context_sync(i915)) { >                   /* Forcibly cancel outstanding work and leave the gpu quiet. */
>>>>>                 i915_gem_set_wedged(i915);
>>>>>         }
>>>>
>>>> GuC-related question: what's your expectation here in regards to GuC
>>>> status? The current i915 flow expect either uc_reset_prepare() or
>>>> uc_suspend() to be called to clean up the guc status, but we're calling
>>>> neither of them here if the switch is successful. Do you expect the
>>>> resume code to always blank out the GuC status before a reload?
>>>
>>> (A few patches later on I propose that we always just do a reset+wedge
>>> on suspend in lieu of hangcheck.)
>>>
>>> On resume, we have to bring the HW up from scratch and do another reset
>>> in the process. Some platforms have been known to survive the trips to
>>> PCI_D3 (someone is lying!) and so we _have_ to do a reset to be sure we
>>> clear the HW state. I expect we would need to force a reset on resume
>>> even for the guc, to be sure we cover all cases such as kexec.
>>> -Chris
>>>
>> More than about the HW state, my question here was about the SW
>> tracking. At which point do we go and stop guc communication and mark
>> guc as not loaded/accessible? e.g. we need to disable and re-enable CT
>> buffers before GuC is reset/suspended to make sure the shared memory
>> area is cleaned correctly (we currently avoid memsetting all of it on
>> reload since it is quite big). Also, communication with GuC is going to
>> increase going forward, so we'll need to make sure we accurately track
>> its state and do all the relevant cleanups.
> 
> Across suspend/resume, we issue a couple of resets and scrub/sanitize our
> state tracking. By the time we load the fw again, both the fw and our
> state should be starting from scratch.
> 
> That all seems unavoidable, so I am not understanding the essence of
> your question.
> -Chris
> 

We're not doing the state scrubbing for guc in all paths at the moment. 
There is logic in gem_suspend_late(), but that doesn't seem to be called 
on all paths; e.g. it isn't when we run 
igt at gem_exec_suspend@basic-s4-devices and that's why Suja's patch moved 
the disabling of communication from uc_sanitize to uc_suspend. The guc 
resume code also doesn't currently clean everything as some of the 
structures (including stuff we allocate for guc usage) are carried over. 
We can either add something more in the cleanup path or go and rework 
the resume to blank everything (which would be time consuming since 
there is tens of MBs involved), but before putting down any code one way 
or another I wanted to understand what the expectation is.

Thanks,
Daniele