[PATCH v3 2/2] drm/xe/guc: Scale mmio send/recv timeout for CCS save/restore with smem size

Wed Aug 6 19:57:40 UTC 2025

On 8/6/2025 9:28 AM, Matthew Brost wrote:
> On Wed, Aug 06, 2025 at 01:59:10PM +0530, Satyanarayana K V P wrote:
>> After VF migration, GUC restores CCS metadata scaled to system memory size.
>> The default timeout (50ms) is calibrated for 4GB memory capacity per
>> specification. Timeouts for other memory sizes are proportionally derived
>> from this baseline.
>>
>> This ensures adequate restoration time for CCS metadata across
>> different hardware configurations while maintaining spec compliance.
>>
>> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
>> Cc: John Harrison <John.C.Harrison at Intel.com>
>> Cc: Matthew Brost <matthew.brost at intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_guc.c | 33 ++++++++++++++++++++++++++++++++-
>>   1 file changed, 32 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
>> index 9e34401e4489..d836ded83491 100644
>> --- a/drivers/gpu/drm/xe/xe_guc.c
>> +++ b/drivers/gpu/drm/xe/xe_guc.c
>> @@ -10,6 +10,7 @@
>>   #include <generated/xe_wa_oob.h>
>>   
>>   #include "abi/guc_actions_abi.h"
>> +#include "abi/guc_actions_sriov_abi.h"
>>   #include "abi/guc_errors_abi.h"
>>   #include "regs/xe_gt_regs.h"
>>   #include "regs/xe_gtt_defs.h"
>> @@ -1397,6 +1398,36 @@ int xe_guc_auth_huc(struct xe_guc *guc, u32 rsa_addr)
>>   	return xe_guc_ct_send_block(&guc->ct, action, ARRAY_SIZE(action));
>>   }
>>   
>> +/*
>> + * After VF migration, GUC restores CCS metadata scaled to system memory size.
>> + * Default timeout (50ms) is calibrated for 4GB memory capacity per
>> + * specification. Timeouts for other memory sizes are proportionally derived
>> + * from this baseline.
>> + */
>> +static u32 guc_mmio_send_recv_timeout(struct xe_guc *guc, const u32 *request)
>> +{
>> +	struct xe_device *xe = guc_to_xe(guc);
>> +	u32 timeout = 50000;
> Is this really the upper bound? It seems like if could be signicantly
> higher if multiple VFs are trying to do things all at the same time.
That is really a problem with the wait function itself rather than the 
timeout. The timeout is meant to be the maximum expectation for how long 
the operation will take once started. Unfortunately, we currently have 
no checks on whether GuC has actually read the message itself before 
starting that timer.

There is also the opposite concern - what happens to any other VF (or 
PF) that is trying to get work done while the GPU is tied up migrating 
this VF? A stall of multiple seconds will cause all sorts of timeouts to 
trip.

I think the expectation is that migration is a deliberate act and the 
system is not going to be doing anything else at the time. It is not 
something that just randomly occurs in the middle of a heavily loaded 
system. But I may be wrong on that?

>
> Scaling the timeout itself, does make sense though.
>
> Matt
>
>> +	u32 action, factor;
>> +	struct sysinfo si;
>> +	u64 sys_mem_size;
>> +
>> +	action = FIELD_GET(GUC_HXG_REQUEST_MSG_0_ACTION, request[0]);
>> +	if (action != GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE || IS_DGFX(xe) ||
>> +	    !xe_device_has_flat_ccs(xe))
>> +		return timeout;
>> +
>> +	si_meminfo(&si);
>> +	sys_mem_size = si.totalram * si.mem_unit;
Do we have to worry about Linux supporting >64bit addressing any time 
soon? I assume that is the reason for having separated units here is 
that the total might be over 64bits? Or are there no plans for 6-level 
page tables yet?

John.

>> +
>> +	if (sys_mem_size <= SZ_4G)
>> +		return timeout;
>> +
>> +	factor = (sys_mem_size + SZ_4G) / SZ_4G;
>> +	timeout *= factor;
>> +
>> +	return timeout;
>> +}
>>   int xe_guc_mmio_send_recv(struct xe_guc *guc, const u32 *request,
>>   			  u32 len, u32 *response_buf)
>>   {
>> @@ -1439,7 +1470,7 @@ int xe_guc_mmio_send_recv(struct xe_guc *guc, const u32 *request,
>>   
>>   	ret = xe_mmio_wait32(mmio, reply_reg, GUC_HXG_MSG_0_ORIGIN,
>>   			     FIELD_PREP(GUC_HXG_MSG_0_ORIGIN, GUC_HXG_ORIGIN_GUC),
>> -			     50000, &reply, false);
>> +			     guc_mmio_send_recv_timeout(guc, request), &reply, false);
>>   	if (ret) {
>>   		/* scratch registers might be cleared during FLR, try once more */
>>   		if (!reply && !lost) {
>> -- 
>> 2.43.0
>>