[PATCH 1/1] drm/xe: serialize store_data and user_interrupt for ufence wait

Fri Aug 15 21:28:32 UTC 2025

> On Fri, Aug 15, 2025 at 03:03:33PM -0600, Yang, Fei wrote:
>>> On Tue, Aug 12, 2025 at 11:28:46AM -0700, fei.yang at intel.com wrote:
>>>> From: Fei Yang <fei.yang at intel.com>
>>>>
>>>> Quote BSpec, MI_STORE_DATA_IMM "simply initiates the write
>>>> operation with command execution proceeding normally. Although the
>>>> write operation is guaranteed to complete eventually, there is no
>>>> mechanism to synchronize command execution with the completion (or
>>>> even
>>>> initiation) of these operations."
>>>
>>> Can we not just use update emit_store_imm_ppgtt_posted to use
>>> MI_FLUSH_IMM + a PPGTT address? I think we can get rid of prior flush added in 3ad7d18c5dad7?
>>>
>>> According to bspec 45725 for MI_FLUSH_IMM:
>>>
>>> 'Usage note: After this command is completed with a Store DWord
>>> enabled, CPU access to graphics memory willbe coherent (assuming the
>>> Render Cache flush is not inhibited).'
>>
>> We already have a MI_FLUSH_DW in between the MI_STORE_DATA_IMM and
>> MI_USER_INTERRUPT (See __emit_job_gen12_simple), but the issue was
>> still reporduced.
>>
>
> That is not what I'm suggesting. I'm suggesting convert emit_store_imm_ppgtt_posted
> to use a MI_FLUSH_DW intruction rather than MI_STORE_DATA_IMM.

Oh, sorry I misunderstood. Yes, that looks promising. At least the BSpec
says CPU access would be coherent, but I'm not sure what would happen
in the case where the write is triggering a page fault. Will give it a
shot and send a different patch if it works.

-Fei

> Matt
>
>> static int emit_flush_imm_ggtt(u32 addr, u32 value, u32 flags, u32
>> *dw, int i) {
>>         dw[i++] = MI_FLUSH_DW | MI_FLUSH_DW_OP_STOREDW | MI_FLUSH_IMM_DW |
>>                   flags;
>>         dw[i++] = addr | MI_FLUSH_DW_USE_GTT;
>>         dw[i++] = 0;
>>         dw[i++] = value;
>>
>>         return i;
>> }
>>
>>> Matt
>>>
>>>> The KMD currently emit MI_STORE_DATA_IMM and MI_USER_INTERRUPT
>>>> consecutively to implement user fence. However, according to the
>>>> BSpec, the data write operation is not guaranteed to be completed
>>>> when triggering the interrupt, that would cause the
>>>> xe_wait_user_fence_ioctl to wait until the full user specified
>>>> timeout is reached before checking the fence value again. Great
>>>> performance degradation has been observed in IGT xe_exec_fault_mode
>>>> test cases due to this unnecessary wait. The worst case is that if
>>>> user set the timeout to MAX_INT32, the wait could end up being a hang until some other random program triggers a user interrupt to wake it up.
>>>> A semaphore wait is added right after the data write to avoid the
>>>> unexpected wait.
>>>>
>>>> Signed-off-by: Fei Yang <fei.yang at intel.com>
>>>> ---
>>>>  drivers/gpu/drm/xe/instructions/xe_mi_commands.h | 11 +++++++++++
>>>>  drivers/gpu/drm/xe/xe_ring_ops.c                 | 14 ++++++++++++++
>>>>  2 files changed, 25 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/instructions/xe_mi_commands.h
>>>> b/drivers/gpu/drm/xe/instructions/xe_mi_commands.h
>>>> index c47b290e0e9f..1c9e7b35c665 100644
>>>> --- a/drivers/gpu/drm/xe/instructions/xe_mi_commands.h
>>>> +++ b/drivers/gpu/drm/xe/instructions/xe_mi_commands.h
>>>> @@ -34,6 +34,17 @@
>>>>  #define MI_FORCE_WAKEUP                     __MI_INSTR(0x1D)
>>>>  #define MI_MATH(n)                  (__MI_INSTR(0x1A) | XE_INSTR_NUM_DW((n) + 1))
>>>>
>>>> +#define MI_SEMAPHORE_WAIT           (__MI_INSTR(0x1c) | XE_INSTR_NUM_DW(5))
>>>> +#define   MI_SEMAPHORE_REGISTER_POLL        REG_BIT(16)
>>>> +#define   MI_SEMAPHORE_POLL         REG_BIT(15)
>>>> +#define   MI_SEMAPHORE_COMP_OP              GENMASK(14, 12)
>>>> +#define   MI_SEMAPHORE_SAD_GT_SDD   REG_FIELD_PREP(MI_SEMAPHORE_COMP_OP, 0)
>>>> +#define   MI_SEMAPHORE_SAD_GTE_SDD  REG_FIELD_PREP(MI_SEMAPHORE_COMP_OP, 1)
>>>> +#define   MI_SEMAPHORE_SAD_LT_SDD   REG_FIELD_PREP(MI_SEMAPHORE_COMP_OP, 2)
>>>> +#define   MI_SEMAPHORE_SAD_LTE_SDD  REG_FIELD_PREP(MI_SEMAPHORE_COMP_OP, 3)
>>>> +#define   MI_SEMAPHORE_SAD_EQ_SDD   REG_FIELD_PREP(MI_SEMAPHORE_COMP_OP, 4)
>>>> +#define   MI_SEMAPHORE_SAD_NEQ_SDD  REG_FIELD_PREP(MI_SEMAPHORE_COMP_OP, 5)
>>>> +
>>>>  #define MI_STORE_DATA_IMM           __MI_INSTR(0x20)
>>>>  #define   MI_SDI_GGTT                       REG_BIT(22)
>>>>  #define   MI_SDI_LEN_DW                     GENMASK(9, 0)
>>>> diff --git a/drivers/gpu/drm/xe/xe_ring_ops.c
>>>> b/drivers/gpu/drm/xe/xe_ring_ops.c
>>>> index 5f15360d14bf..189e764e3914 100644
>>>> --- a/drivers/gpu/drm/xe/xe_ring_ops.c
>>>> +++ b/drivers/gpu/drm/xe/xe_ring_ops.c
>>>> @@ -169,6 +169,20 @@ static int emit_store_imm_ppgtt_posted(u64 addr, u64 value,
>>>>      dw[i++] = upper_32_bits(addr);
>>>>      dw[i++] = lower_32_bits(value);
>>>>      dw[i++] = upper_32_bits(value);
>>>> +    dw[i++] = MI_SEMAPHORE_WAIT |
>>>> +              MI_SEMAPHORE_POLL |
>>>> +              MI_SEMAPHORE_SAD_EQ_SDD;
>>>> +    dw[i++] = lower_32_bits(value);
>>>> +    dw[i++] = lower_32_bits(addr);
>>>> +    dw[i++] = upper_32_bits(addr);
>>>> +    dw[i++] = 0;
>>>> +    dw[i++] = MI_SEMAPHORE_WAIT |
>>>> +              MI_SEMAPHORE_POLL |
>>>> +              MI_SEMAPHORE_SAD_EQ_SDD;
>>>> +    dw[i++] = upper_32_bits(value);
>>>> +    dw[i++] = lower_32_bits(addr + 4);
>>>> +    dw[i++] = upper_32_bits(addr);
>>>> +    dw[i++] = 0;
>>>>
>>>>      return i;
>>>>  }
>>>> --
>>>> 2.43.0
>>>>