[PATCH 2/3] drm/xe/xelp: Implement HSDES#16010904313 workarounds

Fri May 23 17:41:35 UTC 2025

On Fri, May 23, 2025 at 11:33:16AM +0100, Tvrtko Ursulin wrote:
> 
> Hi Matt,
> 
> One clarification below:
> 
> On 16/05/2025 20:00, Matt Roper wrote:
> > On Fri, May 16, 2025 at 08:44:11AM +0100, Tvrtko Ursulin wrote:
> > > +static u32 *
> > > +xelp_emit_timestamp_wa(struct xe_lrc *lrc, u32 *cmd)
> > > +{
> > > +	*cmd++ = MI_LOAD_REGISTER_MEM | MI_LRM_USE_GGTT | MI_LRI_LRM_CS_MMIO;
> > > +	*cmd++ = CS_GPR_REG(0, 0).addr;
> > > +	*cmd++ = __xe_lrc_regs_ggtt_addr(lrc) + CTX_TIMESTAMP * sizeof(u32);
> > > +	*cmd++ = 0;
> > > +
> > > +	*cmd++ = MI_LOAD_REGISTER_REG | MI_LRR_SRC_CS_MMIO | MI_LRI_LRM_CS_MMIO;
> > > +	*cmd++ = CS_GPR_REG(0, 0).addr;
> > > +	*cmd++ = RING_CTX_TIMESTAMP(0).addr;
> > > +
> > > +	*cmd++ = MI_LOAD_REGISTER_REG | MI_LRR_SRC_CS_MMIO | MI_LRI_LRM_CS_MMIO;
> > > +	*cmd++ = CS_GPR_REG(0, 0).addr;
> > > +	*cmd++ = RING_CTX_TIMESTAMP(0).addr;
> > 
> > This doesn't seem to actually match the documented workaround details.
> > The workaround says we need to do three back-to-back MI_LRM instructions
> > to read CTX_TIMESTAMP into $lrc+0x108C.  The first two are supposed to

I misspoke here; this should have said "*from*" rather than "into."
Although "CTX_TIMESTAMP" is the formal name for the register, our driver
actually calls it RING_CTX_TIMESTAMP for historical reasons.

You can see the workaround description on page 20 of the public TGL PRM
document here:  https://cdrdv2.intel.com/v1/dl/getcontent/705836 (although the
formatting there is a bit painful).  Here's the text from the WA database,
which restores some of the linebreaks and formatting that was lost in the PRM:

  """
  The below workaround must be used to overcome the ctx timestamp issue  
  1\. For BCS/VCS/VECS:  
  \-- In the Per-Context WABB (workaround batch buffer) Software must program 3
  back to back LRM (MI_LOAD_REGISTER_MEM) commands with  
  \- For RCS/CCS  
  \-- In the Indirect Context Pointer, Software must program 3 back to back LRM
  (MI_LOAD_REGISTER_MEM) commands with  
  Dw0[19] = 1, Register Address = CTX_TIMESTAMP and Memory Address = LRCA +
  108Ch.  
  2\. The first two MI_LOAD_REGISTER_MEM commands must have Dw0 bit 21 = 1  
  3\. The third MI_LOAD_REGISTER_MEM command must have Dw0 bit 21 = 0  
  4\. All three commands must have “Add CS MMIO Start Offset” Dw0[19] = 1 to
  enable auto addition of CS MMIO Start Offset.  

  For Example in case of RCS, if LRCA for a given context is DEADh the below
  commands must be programmed in the per-context workaround batch buffer.  

  1\. MI_LOAD_REGISTER_MEM ( dw0[19] = 1, dw0[21]= 1, REGISTER ADDR = 3a8h,
  Memory Address = DEADh + 108Ch  
  2\. MI_LOAD_REGISTER_MEM ( dw0[19] = 1, dw0[21]= 1, REGISTER ADDR = 3a8h,
  Memory Address = DEADh + 108Ch  
  3\. MI_LOAD_REGISTER_MEM ( dw0[19] = 1, dw0[21]= 0, REGISTER ADDR = 3a8h,
  Memory Address = DEADh + 108Ch
  """

As I look at this closer, I see two strange things with this description.
First the beginning of the workaround specifically states that BCS/VCS/VECS use
per-context WABB and RCS/CCS use indirect context pointer...but then the
example at the bottom is supposedly for the RCS engine, but says "per-context
workaround batch buffer."  Second, as you pointed out below, 0x108C into the
LRC isn't the location where CTX_TIMESTAMP is stored.  The first page of the
LRC is the pphwsp, so once we jump past that we're left with an offset of 0x8C
into the state area, which (if I did the napkin math right) would be the
location of CS_GPR7 (and not the context's value of CS_GPR7, which would be at
0x108C+1, but rather the offset of CS_GPR7).  So it seems like we'd end up
loading a value of ($enginebase+0x618) into the timestamp register which
doesn't really seem to make sense.

So I'm not sure if this workaround is missing steps (e.g., are we supposed to
do stash the timestamp into a GPR at some point?), whether it's making
assumptions about driver behavior that only hold true for the Windows driver,
or whether the memory address as incorrect and it was supposed to be the the
location of the CTX_TIMESTAMP value in the LRC (i.e., 0x22+1).

It sounds like the underlying problem here is basically that the LRI in the
context restore races with the regular operation to periodically increment the
timestamp value, causing the LRI's value to occasionally get lost.  It makes
sense to me that that doing the pipelined sequence of or LRM's of the
CTX_TIMESTAMP location within the LRC to forcibly load the state would take
different hardware paths from the simple LRI and might not be succeptible to
the race.  But doing those same LRMs from an unrelated GPR register seems
counterproductive, unless there's some other step that wasn't mentioned.

> > have dw0[21] (i.e., async mode) enabled, and the third should have it
> > disabled.  I'm guessing this got implemented in i915 before the formal
> > workaround description was finalized, or the hardware people updated the
> > description after we'd already implemented the workaround.
> > 
> > That will also prevent us from clobbering the GPR register, which will
> > let us drop the restore_scratch function.
> 
> I only realized this was unclear when I went to work on a respin to address
> it.
> 
> Are you saying the workaround specifies to read from LRC CTX_TIMESTAMP into
> LRC+0x108c? This wouldn't be MI_LRM though.
> 
> Or read from RING_CTX_TIMESTAMP into LRC+0x108c? And this wouldn't be MI_LRM
> either.
> 
> So I don't know what three back-to-back MI_LRM the workaround actually asks
> for.
> 
> Especially when it says first two should have async mode enabled. So maybe
> not to, but _from_ LRC+0x108c? Into RING_CTX_TIMESTAMP three times?
> 
> In which case the question would by why LRC+0x108c. That is where GPR
> registers are saved, right? But it is not GPR0, which one I am not sure
> because I am confused by the fact TGL PRM says that in LRCA there are 64
> dwords for 16 64-bit registers, which should be 32 dwords really, no?

The GPRs are 64-bit registers.  But inside an LRC, restoration of a 64-bit
register is always done as LRI's of the upper and lower dwords separately.  So

   LRI instruction header
   ...
   CS_GPR0 ldw offset
   CS_GPR0 ldw value
   CS_GPR0 udw offset
   CS_GPR0 udw value
   ...

So four dwords in the LRC to restore each GPR register, which results in 64
dwords total.

Note that the CTX_TIMESTAMP register is just 32-bits on these older Xe1
platforms; it was extended to 64-bits in Xe2 and beyond.

Matt

> 
> Even in this case I don't see how writting a random GPR into a
> RING_CTX_TIMESTAMP makes sense.
> 
> Regards,
> 
> Tvrtko
> 
> > 
> > > +
> > > +	return cmd;
> > > +}
> > > +
> > > +static u32 *
> > > +xelp_emit_restore_scratch(struct xe_lrc *lrc, u32 *cmd)
> > > +{
> > > +	*cmd++ = MI_LOAD_REGISTER_MEM | MI_LRM_USE_GGTT | MI_LRI_LRM_CS_MMIO;
> > > +	*cmd++ = CS_GPR_REG(0, 0).addr;
> > > +	*cmd++ = __xe_lrc_regs_ggtt_addr(lrc) + CTX_GPR0 * sizeof(u32);
> > > +	*cmd++ = 0;
> > > +
> > > +	return cmd;
> > > +}
> > > +
> > > +static u32 *
> > > +xelp_emit_cmd_buf_wa(struct xe_lrc *lrc, u32 *cmd)
> > > +{
> > > +	*cmd++ = MI_LOAD_REGISTER_MEM | MI_LRM_USE_GGTT | MI_LRI_LRM_CS_MMIO;
> > > +	*cmd++ = CS_GPR_REG(0, 0).addr;
> > > +	*cmd++ = __xe_lrc_regs_ggtt_addr(lrc)+ CTX_CMD_BUF_CCTL * sizeof(u32);
> > > +	*cmd++ = 0;
> > > +
> > > +	*cmd++ = MI_LOAD_REGISTER_REG | MI_LRR_SRC_CS_MMIO | MI_LRR_DST_CS_MMIO;
> > > +	*cmd++ = CS_GPR_REG(0, 0).addr;
> > > +	*cmd++ = RING_CMD_BUF_CCTL(0).addr;
> > 
> > I don't see any documented workaround asking us to do this restoration
> > of the CMD_BUF_CCTL register.  I suspect the i915 implementation was
> > either something for pre-production hardware that never should have
> > stuck around in the driver, or it was hacking around some other driver
> > bug.  Are you finding this to actually be necessary for something today?
> > 
> > > +
> > > +	return cmd;
> > > +}
> > > +
> > > +static u32 *
> > > +xelp_setup_indirect_ctx_rcs(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
> > > +			    u32 *cmd)
> > > +{
> > > +	/* HSDES#16010904313 */
> > > +	cmd = xelp_emit_timestamp_wa(lrc, cmd);
> > > +	cmd = xelp_emit_cmd_buf_wa(lrc, cmd);
> > > +	cmd = xelp_emit_restore_scratch(lrc, cmd);
> > > +
> > > +	return cmd;
> > > +}
> > > +
> > > +static u32 *
> > > +xelp_setup_indirect_ctx_xcs(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
> > > +			    u32 *cmd)
> > > +{
> > > +	/* HSDES#16010904313 */
> > > +	cmd = xelp_emit_timestamp_wa(lrc, cmd);
> > > +	cmd = xelp_emit_restore_scratch(lrc, cmd);
> > > +
> > > +	return cmd;
> > > +}
> > > +
> > > +static void
> > > +xelp_setup_indirect_ctx(struct xe_lrc *lrc, struct xe_hw_engine *hwe)
> > > +{
> > > +	u32 *start, *cmd, *regs;
> > > +	struct iosys_map map;
> > > +	u32 ggtt, offset;
> > > +
> > > +	if (!(lrc->flags & XE_LRC_FLAG_INDIRECT_CTX))
> > > +		return;
> > > +
> > > +	offset = lrc->size - LRC_INDIRECT_CTX_SIZE;
> > > +	ggtt = xe_bo_ggtt_addr(lrc->bo) + offset;
> > > +
> > > +	map = lrc->bo->vmap;
> > > +	iosys_map_incr(&map, offset);
> > > +	start = cmd = map.vaddr;
> > 
> > Isn't this going to be a problem if the iosys_map is holding iomem?  If
> > we're going to build the indirect context batch directly with CPU
> > pointer accesses as the functions above do, don't we need to do that in
> > a separate buffer and then copy it into the iosys_map?
> > 
> > 
> > > +
> > > +	if (hwe->class == XE_ENGINE_CLASS_RENDER)
> > > +		cmd = xelp_setup_indirect_ctx_rcs(lrc, hwe, cmd);
> > 
> > This also deviates from the documented workaround.  According to the
> > workaround database, INDIRECT_CTX should be used for RCS/CCS, but we're
> > supposed to use BB_PER_CTX_PTR instead for RCS/CCS (both are similar
> > mechanisms, but INDIRECT_CTX happens somewhere in the middle of a
> > context restore, whereas BB_PER_CTX_PTR happens at the very end of the
> > context restore).
> > 
> > 
> > Matt
> > 
> > > +	else
> > > +		cmd = xelp_setup_indirect_ctx_xcs(lrc, hwe, cmd);
> > > +
> > > +	while ((unsigned long)cmd & 0x3f) /* Align to 64B cacheline. */
> > > +		*cmd++ = MI_NOOP;
> > > +
> > > +	map = __xe_lrc_regs_map(lrc);
> > > +	regs = map.vaddr;
> > > +
> > > +	regs[CTX_CS_INDIRECT_CTX] = ggtt | ((cmd - start) * sizeof(u32) / 64); /* Size in CLs. */
> > > +	regs[CTX_CS_INDIRECT_CTX_OFFSET] = XELP_CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT << 6;
> > > +}
> > > +
> > >   #define PVC_CTX_ASID		(0x2e + 1)
> > >   #define PVC_CTX_ACC_CTR_THOLD	(0x2a + 1)
> > > @@ -997,6 +1106,10 @@ static int xe_lrc_init(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
> > >   	lrc->size = ring_size + lrc_size;
> > >   	lrc->ring.size = ring_size;
> > >   	lrc->ring.tail = 0;
> > > +
> > > +	if (xe_needs_indirect_ctx(xe))
> > > +		lrc->flags |= XE_LRC_FLAG_INDIRECT_CTX;
> > > +
> > >   	if (xe_gt_has_indirect_ring_state(gt))
> > >   		lrc->flags |= XE_LRC_FLAG_INDIRECT_RING_STATE;
> > > @@ -1127,6 +1240,8 @@ static int xe_lrc_init(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
> > >   	xe_lrc_setup_utilization(lrc);
> > > +	xelp_setup_indirect_ctx(lrc, hwe);
> > > +
> > >   	return 0;
> > >   err_lrc_finish:
> > > diff --git a/drivers/gpu/drm/xe/xe_lrc_types.h b/drivers/gpu/drm/xe/xe_lrc_types.h
> > > index 559c7c831212..9ce7d02ef210 100644
> > > --- a/drivers/gpu/drm/xe/xe_lrc_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_lrc_types.h
> > > @@ -29,7 +29,8 @@ struct xe_lrc {
> > >   	struct xe_gt *gt;
> > >   	/** @flags: LRC flags */
> > > -#define XE_LRC_FLAG_INDIRECT_RING_STATE		0x1
> > > +#define XE_LRC_FLAG_INDIRECT_CTX		0x1
> > > +#define XE_LRC_FLAG_INDIRECT_RING_STATE		0x2
> > >   	u32 flags;
> > >   	/** @refcount: ref count of this lrc */
> > > -- 
> > > 2.48.0
> > > 
> > 
> 

-- 
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation