[PATCH 1/2] drm/xe/lrc: Use a temporary buffer for WA BB

Thu Jun 5 05:15:13 UTC 2025

On Wed, Jun 04, 2025 at 05:02:49PM -0700, Matt Roper wrote:
>On Wed, Jun 04, 2025 at 08:03:05AM -0700, Lucas De Marchi wrote:
>> In case the BO is in iomem, we can't simply take the vaddr and write to
>> it. Instead, prepare a separate buffer that is later copied into io
>> memory. Right now it's just a few words that could be using
>> xe_map_write32(), but the intention is to grow the WA BB for other
>> uses.
>>
>> Fixes: 82b98cadb01f ("drm/xe: Add WA BB to capture active context utilization")
>> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa at intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>> Signed-off-by: Lucas De Marchi <lucas.demarchi at intel.com>
>> ---
>>  drivers/gpu/drm/xe/xe_lrc.c | 24 ++++++++++++++++++++----
>>  1 file changed, 20 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_lrc.c b/drivers/gpu/drm/xe/xe_lrc.c
>> index 63d74e27f54cf..bf7c3981897de 100644
>> --- a/drivers/gpu/drm/xe/xe_lrc.c
>> +++ b/drivers/gpu/drm/xe/xe_lrc.c
>> @@ -941,11 +941,18 @@ static void xe_lrc_finish(struct xe_lrc *lrc)
>>   * store it in the PPHSWP.
>>   */
>>  #define CONTEXT_ACTIVE 1ULL
>> -static void xe_lrc_setup_utilization(struct xe_lrc *lrc)
>> +static int xe_lrc_setup_utilization(struct xe_lrc *lrc)
>>  {
>> -	u32 *cmd;
>> +	u32 *cmd, *buf = NULL;
>>
>> -	cmd = lrc->bb_per_ctx_bo->vmap.vaddr;
>> +	if (lrc->bb_per_ctx_bo->vmap.is_iomem) {
>> +		buf = kmalloc(lrc->bb_per_ctx_bo->size, GFP_KERNEL);
>
>I was originally worried about not zero'ing this buffer with kzalloc,
>but since we only copy the specific words that we write here into the
>real LRC, the garbage in the rest of the space shouldn't matter.
>
>size here is always 4k today; is it okay to be using kmalloc for a bunch
>of 4k allocations or should we be using kmem_cache or alloc_page for
>these?  I'm not sure what the allocation recommendations are these days
>for page-sized allocations.
>
>I'm also wondering if there's any way to avoid doing this independently
>for every context by using CURRENT_LRCA with either WPARID or MI_MATH to
>eliminate the ggtt offsets (i.e., the only parts of the batch that
>differ from one context to the next).  If the actual contents of the
>workaround batchbuffers are the same for every single context, then the
>allocation here is only needed once, and even the LRC's actual
>workaround batchbuffer could be pointer at a single global page rather
>than a per-context allocation inside the LRC.  That might be beyond the
>scope of this series though...

not sure about eliminating the ggtt offsets. What I thought about was to
eventually move the allocation much earlier as you suggest and re-use it
across multiples contexts. However I'd see this as an optimization for
future... right now there are 2 major goals:

1) Fix the bug of potentially using iomem as if it was system memory
2) Prepare the ground for additional WAs (see patches series from
Tvrtko)
3) Prepare the ground for the additional commands from configfs (see the
other patch series linked, which is buggy, but I have it fixed locally).
4) Re-use similar infra to also expose the indirect ptr for the same
things via configfs

Then later I think we can move this to a single buffer that is shared
accross contexts. We will need one per engine class.

Ack?

thanks
Lucas De Marchi

>
>
>Matt
>
>> +		if (!buf)
>> +			return -ENOMEM;
>> +		cmd = buf;
>> +	} else {
>> +		cmd = lrc->bb_per_ctx_bo->vmap.vaddr;
>> +	}
>>
>>  	*cmd++ = MI_STORE_REGISTER_MEM | MI_SRM_USE_GGTT | MI_SRM_ADD_CS_OFFSET;
>>  	*cmd++ = ENGINE_ID(0).addr;
>> @@ -966,9 +973,16 @@ static void xe_lrc_setup_utilization(struct xe_lrc *lrc)
>>
>>  	*cmd++ = MI_BATCH_BUFFER_END;
>>
>> +	if (buf) {
>> +		xe_map_memcpy_to(gt_to_xe(lrc->gt), &lrc->bb_per_ctx_bo->vmap, 0,
>> +				 buf, (cmd - buf) * sizeof(*cmd));
>> +		kfree(buf);
>> +	}
>> +
>>  	xe_lrc_write_ctx_reg(lrc, CTX_BB_PER_CTX_PTR,
>>  			     xe_bo_ggtt_addr(lrc->bb_per_ctx_bo) | 1);
>>
>> +	return 0;
>>  }
>>
>>  #define PVC_CTX_ASID		(0x2e + 1)
>> @@ -1125,7 +1139,9 @@ static int xe_lrc_init(struct xe_lrc *lrc, struct xe_hw_engine *hwe,
>>  	map = __xe_lrc_start_seqno_map(lrc);
>>  	xe_map_write32(lrc_to_xe(lrc), &map, lrc->fence_ctx.next_seqno - 1);
>>
>> -	xe_lrc_setup_utilization(lrc);
>> +	err = xe_lrc_setup_utilization(lrc);
>> +	if (err)
>> +		goto err_lrc_finish;
>>
>>  	return 0;
>>
>>
>> --
>> 2.49.0
>>
>
>-- 
>Matt Roper
>Graphics Software Engineer
>Linux GPU Platform Enablement
>Intel Corporation