[PATCH 1/2] drm/xe/lrc: Use a temporary buffer for WA BB

Thu Jun 5 00:02:49 UTC 2025

On Wed, Jun 04, 2025 at 08:03:05AM -0700, Lucas De Marchi wrote:
> In case the BO is in iomem, we can't simply take the vaddr and write to
> it. Instead, prepare a separate buffer that is later copied into io
> memory. Right now it's just a few words that could be using
> xe_map_write32(), but the intention is to grow the WA BB for other
> uses.
> 
> Fixes: 82b98cadb01f ("drm/xe: Add WA BB to capture active context utilization")
> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
> Signed-off-by: Lucas De Marchi <lucas.demarchi at intel.com>
> ---
>  drivers/gpu/drm/xe/xe_lrc.c | 24 ++++++++++++++++++++----
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
> index 63d74e27f54cf..bf7c3981897de 100644
> --- a/drivers/gpu/drm/xe/xe_lrc.c
> +++ b/drivers/gpu/drm/xe/xe_lrc.c
> @@ -941,11 +941,18 @@ static void xe_lrc_finish(struct xe_lrc *lrc)
>   * store it in the PPHSWP.
>   */
>  #define CONTEXT_ACTIVE 1ULL
> -static void xe_lrc_setup_utilization(struct xe_lrc *lrc)
> +static int xe_lrc_setup_utilization(struct xe_lrc *lrc)
>  {
> -	u32 *cmd;
> +	u32 *cmd, *buf = NULL;
>  
> -	cmd = lrc->bb_per_ctx_bo->vmap.vaddr;
> +	if (lrc->bb_per_ctx_bo->vmap.is_iomem) {
> +		buf = kmalloc(lrc->bb_per_ctx_bo->size, GFP_KERNEL);

I was originally worried about not zero'ing this buffer with kzalloc,
but since we only copy the specific words that we write here into the
real LRC, the garbage in the rest of the space shouldn't matter.

size here is always 4k today; is it okay to be using kmalloc for a bunch
of 4k allocations or should we be using kmem_cache or alloc_page for
these?  I'm not sure what the allocation recommendations are these days
for page-sized allocations.

I'm also wondering if there's any way to avoid doing this independently
for every context by using CURRENT_LRCA with either WPARID or MI_MATH to
eliminate the ggtt offsets (i.e., the only parts of the batch that
differ from one context to the next).  If the actual contents of the
workaround batchbuffers are the same for every single context, then the
allocation here is only needed once, and even the LRC's actual
workaround batchbuffer could be pointer at a single global page rather
than a per-context allocation inside the LRC.  That might be beyond the
scope of this series though...

Matt

> +		if (!buf)
> +			return -ENOMEM;
> +		cmd = buf;
> +	} else {
> +		cmd = lrc->bb_per_ctx_bo->vmap.vaddr;
> +	}
>  
>  	*cmd++ = MI_STORE_REGISTER_MEM | MI_SRM_USE_GGTT | MI_SRM_ADD_CS_OFFSET;
>  	*cmd++ = ENGINE_ID(0).addr;
> @@ -966,9 +973,16 @@ static void xe_lrc_setup_utilization(struct xe_lrc *lrc)
>  
>  	*cmd++ = MI_BATCH_BUFFER_END;
>  
> +	if (buf) {
> +		xe_map_memcpy_to(gt_to_xe(lrc->gt), &lrc->bb_per_ctx_bo->vmap, 0,
> +				 buf, (cmd - buf) * sizeof(*cmd));
> +		kfree(buf);
> +	}
> +
>  	xe_lrc_write_ctx_reg(lrc, CTX_BB_PER_CTX_PTR,
>  			     xe_bo_ggtt_addr(lrc->bb_per_ctx_bo) | 1);
>  
> +	return 0;
>  }
>  
>  #define PVC_CTX_ASID		(0x2e + 1)
> @@ -1125,7 +1139,9 @@ static int xe_lrc_init(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
>  	map = __xe_lrc_start_seqno_map(lrc);
>  	xe_map_write32(lrc_to_xe(lrc), &map, lrc->fence_ctx.next_seqno - 1);
>  
> -	xe_lrc_setup_utilization(lrc);
> +	err = xe_lrc_setup_utilization(lrc);
> +	if (err)
> +		goto err_lrc_finish;
>  
>  	return 0;
>  
> 
> -- 
> 2.49.0
> 

-- 
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation