[Intel-gfx] ✗ Fi.CI.BAT: failure for drm/i915: implement internal workqueues (rev3)
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Tue Jun 6 13:33:14 UTC 2023
On 06/06/2023 12:06, Coelho, Luciano wrote:
> On Tue, 2023-06-06 at 11:06 +0100, Tvrtko Ursulin wrote:
>> On 05/06/2023 16:06, Jani Nikula wrote:
>>> On Wed, 31 May 2023, Patchwork <patchwork at emeril.freedesktop.org> wrote:
>>>> #### Possible regressions ####
>>>>
>>>> * igt at gem_close_race@basic-process:
>>>> - fi-blb-e6850: [PASS][1] -> [ABORT][2]
>>>> [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-blb-e6850/igt@gem_close_race@basic-process.html
>>>> [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-blb-e6850/igt@gem_close_race@basic-process.html
>>>> - fi-hsw-4770: [PASS][3] -> [ABORT][4]
>>>> [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-hsw-4770/igt@gem_close_race@basic-process.html
>>>> [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-hsw-4770/igt@gem_close_race@basic-process.html
>>>> - fi-elk-e7500: [PASS][5] -> [ABORT][6]
>>>> [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/fi-elk-e7500/igt@gem_close_race@basic-process.html
>>>> [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/fi-elk-e7500/igt@gem_close_race@basic-process.html
>>>>
>>>> * igt at i915_selftest@live at evict:
>>>> - bat-adlp-9: [PASS][7] -> [ABORT][8]
>>>> [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlp-9/igt@i915_selftest@live@evict.html
>>>> [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlp-9/igt@i915_selftest@live@evict.html
>>>> - bat-rpls-2: [PASS][9] -> [ABORT][10]
>>>> [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-2/igt@i915_selftest@live@evict.html
>>>> [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-2/igt@i915_selftest@live@evict.html
>>>> - bat-adlm-1: [PASS][11] -> [ABORT][12]
>>>> [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-adlm-1/igt@i915_selftest@live@evict.html
>>>> [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-adlm-1/igt@i915_selftest@live@evict.html
>>>> - bat-rpls-1: [PASS][13] -> [ABORT][14]
>>>> [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13203/bat-rpls-1/igt@i915_selftest@live@evict.html
>>>> [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_117618v3/bat-rpls-1/igt@i915_selftest@live@evict.html
>>>
>>> This still fails consistently, I have no clue why, and the above aren't
>>> even remotely related to display.
>>>
>>> What now? Tvrtko?
>>
>> Hmm..
>>
>> <4> [46.782321] Chain exists of:
>> (wq_completion)i915 --> (work_completion)(&i915->mm.free_work) --> &vm->mutex
>> <4> [46.782329] Possible unsafe locking scenario:
>> <4> [46.782332] CPU0 CPU1
>> <4> [46.782334] ---- ----
>> <4> [46.782337] lock(&vm->mutex);
>> <4> [46.782340] lock((work_completion)(&i915->mm.free_work));
>> <4> [46.782344] lock(&vm->mutex);
>> <4> [46.782348] lock((wq_completion)i915);
>>
>>
>> "(wq_completion)i915"
>>
>> So it's not about the new wq even. Perhaps it is this hunk:
>>
>> --- a/drivers/gpu/drm/i915/intel_wakeref.c
>> +++ b/drivers/gpu/drm/i915/intel_wakeref.c
>> @@ -75,7 +75,7 @@ void __intel_wakeref_put_last(struct intel_wakeref *wf, unsigned long flags)
>>
>> /* Assume we are not in process context and so cannot sleep. */
>> if (flags & INTEL_WAKEREF_PUT_ASYNC || !mutex_trylock(&wf->mutex)) {
>> - mod_delayed_work(system_wq, &wf->work,
>> + mod_delayed_work(wf->i915->wq, &wf->work,
>>
>> Transformation from this patch otherwise is system_wq with the new unordered wq, so I'd try that first.
>
> Indeed this seems to be exactly the block that is causing the issue. I
> was sort of bisecting through all these changes and reverting this one
> prevents the lockdep splat from happening...
>
> So there's something that needs to be synced with the system_wq here,
> but what? I need to dig into it.
AFAICT it is saying that i915->mm.free_work and engine->wakeref.work
must not be on the same ordered wq. Otherwise execbuf call trace
flushing under vm->mutex can deadlock against the free worker trying to
grab vm->mutex. If engine->wakeref.work is on a separate unordered wq it
would be safe since then execution will not be serialized with the
free_work. So just using the new i915->unordered_wq for this hunk should
work.
Regards,
Tvrtko
More information about the Intel-gfx
mailing list