[PATCH] drm/i915/guc: Add synchronization on interrupt enable flag
Dong, Zhanjun
zhanjun.dong at intel.com
Thu Aug 21 14:39:14 UTC 2025
On 2025-08-20 6:05 p.m., Andi Shyti wrote:
> Hi Zhanjun,
>
> ...
>
>>> On Tue, Aug 19, 2025 at 12:00:10PM -0400, Zhanjun Dong wrote:
>>>> Boolean flag access from interrupt context might have synchronous issue on
>>>
>>> /synchronous/synchronization/
>>> /issue/issues/
>>>
>>>> multiple processor platform, flage modified by one core might be read as an
>>>
>>> /flage/flags/
>>>
>>>> old value by another core. This issue on interrupt enable flag might causes
>>>> interrupt missing or leakage.
>>>
>>> /missing/misses/
>>>
>>>> Fixed by change the data type as automic to add memory synchronization.
>>>
>>> This can be re-written: "Change the interrupts.enable type to
>>> atomic to ensure memory synchronization."
>> Will follow all above comments
>
> No need to resend for this if the patch is fine.
>
>>>> Fixes: a187f13d51fa0 ("drm/i915/guc: handle interrupts from media GuC")
>>>
>>> What issue are you exactly trying to fix? And have you tested
>> Will add:
>> Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14834
>>
>>> that this patch is this really fixing it? Where is the flag's
>> CI report this issue about every 1 or 2 months and close it because no
>> reproduce.
>> I can reproduce it in about 1000 rounds(about 9 hours) of IGT test but not
>> at 100% rate, so it is hard to say really fixed because of hard to
>> reproduce.
>> My latest test runs 2200 rounds in total has no reproduce.
>>
>>> misalignment happening?
>>>
>>> Please remember that when in interrupt context you are already
>>> running in atomic, so that probably you don't need to have an
>>> additional atomic access to variables.
>> Interrupt context only read it. When the flag was changed and interrupt was
>> triggered in very short time, the flag read at interrupt context might read
>> out old value, if other core(non interrupt context) set the flag and ISR
>> read out 0, ISR will do simple return and misses interrupt handling, making
>> it appear as lost interrupt; if other core clear the flag and ISR read out
>> 1, ISR will continue to run and not stop as expected, making it appear as an
>> interrupt leak. >
>
> Heh... it looks to me very unlikely to happen, but if you say
> that this fixes it, then I'm OK with the patch.
>
> I see still one case missing: the error comes here:
>
> ct_try_receive_message+0x336/0xa10
>
> that is the tasklet that starts after the IRQ. Has the flag
> changed from the irq to the tasklet? :-)
>
> Would it make sense to add something like:
>
> @@ -1338,6 +1338,9 @@ static void ct_receive_tasklet_func(struct tasklet_struct *t)
> {
> struct intel_guc_ct *ct = from_tasklet(ct, t, receive_tasklet);
>
> + if (!atomic_read(&guc->interrupts.enabled))
> + return;
> +
> ct_try_receive_message(ct);
> }
Yes, I did test with this check, result looks the same as not add.
While, consider if the system is under heavy load, tasklet might get
effected as well, so I will add it in next rev.
>
> BTW, have you tried adding a simple memory barrier (even though
> in the i915 community people dislike memory barriers, but with a
> proper explanation it might save us all this churn).
Yes, will add it.
New rev to be posted after another 9 hours test.
Regards,
Zhanjun Dong.
>
> Thanks,
> Andi
More information about the Intel-gfx
mailing list