[PATCH] drm/i915/guc: Add synchronization on interrupt enable flag

Dong, Zhanjun zhanjun.dong at intel.com
Thu Aug 21 14:39:14 UTC 2025



On 2025-08-20 6:05 p.m., Andi Shyti wrote:
> Hi Zhanjun,
> 
> ...
> 
>>> On Tue, Aug 19, 2025 at 12:00:10PM -0400, Zhanjun Dong wrote:
>>>> Boolean flag access from interrupt context might have synchronous issue on
>>>
>>> /synchronous/synchronization/
>>> /issue/issues/
>>>
>>>> multiple processor platform, flage modified by one core might be read as an
>>>
>>> /flage/flags/
>>>
>>>> old value by another core. This issue on interrupt enable flag might causes
>>>> interrupt missing or leakage.
>>>
>>> /missing/misses/
>>>
>>>> Fixed by change the data type as automic to add memory synchronization.
>>>
>>> This can be re-written: "Change the interrupts.enable type to
>>> atomic to ensure memory synchronization."
>> Will follow all above comments
> 
> No need to resend for this if the patch is fine.
> 
>>>> Fixes: a187f13d51fa0 ("drm/i915/guc: handle interrupts from media GuC")
>>>
>>> What issue are you exactly trying to fix? And have you tested
>> Will add:
>> Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14834
>>
>>> that this patch is this really fixing it? Where is the flag's
>> CI report this issue about every 1 or 2 months and close it because no
>> reproduce.
>> I can reproduce it in about 1000 rounds(about 9 hours) of IGT test but not
>> at 100% rate, so it is hard to say really fixed because of hard to
>> reproduce.
>> My latest test runs 2200 rounds in total has no reproduce.
>>
>>> misalignment happening?
>>>
>>> Please remember that when in interrupt context you are already
>>> running in atomic, so that probably you don't need to have an
>>> additional atomic access to variables.
>> Interrupt context only read it. When the flag was changed and interrupt was
>> triggered in very short time, the flag read at interrupt context might read
>> out old value, if other core(non interrupt context) set the flag and ISR
>> read out 0, ISR will do simple return and misses interrupt handling, making
>> it appear as lost interrupt; if other core clear the flag and ISR read out
>> 1, ISR will continue to run and not stop as expected, making it appear as an
>> interrupt leak. >
> 
> Heh... it looks to me very unlikely to happen, but if you say
> that this fixes it, then I'm OK with the patch.
> 
> I see still one case missing: the error comes here:
> 
>    ct_try_receive_message+0x336/0xa10
> 
> that is the tasklet that starts after the IRQ. Has the flag
> changed from the irq to the tasklet? :-)
> 
> Would it make sense to add something like:
> 
> @@ -1338,6 +1338,9 @@ static void ct_receive_tasklet_func(struct tasklet_struct *t)
>   {
>          struct intel_guc_ct *ct = from_tasklet(ct, t, receive_tasklet);
> 
> +       if (!atomic_read(&guc->interrupts.enabled))
> +               return;
> +
>          ct_try_receive_message(ct);
>   }
Yes, I did test with this check, result looks the same as not add.
While, consider if the system is under heavy load, tasklet might get 
effected as well, so I will add it in next rev.
> 
> BTW, have you tried adding a simple memory barrier (even though
> in the i915 community people dislike memory barriers, but with a
> proper explanation it might save us all this churn).
Yes, will add it.

New rev to be posted after another 9 hours test.

Regards,
Zhanjun Dong.
> 
> Thanks,
> Andi



More information about the Intel-gfx mailing list