[PATCH v3 2/2] drm/xe/guc: Scale mmio send/recv timeout for CCS save/restore with smem size

Wed Aug 6 23:01:47 UTC 2025

On Wed, Aug 06, 2025 at 12:57:40PM -0700, John Harrison wrote:
> On 8/6/2025 9:28 AM, Matthew Brost wrote:
> > On Wed, Aug 06, 2025 at 01:59:10PM +0530, Satyanarayana K V P wrote:
> > > After VF migration, GUC restores CCS metadata scaled to system memory size.
> > > The default timeout (50ms) is calibrated for 4GB memory capacity per
> > > specification. Timeouts for other memory sizes are proportionally derived
> > > from this baseline.
> > > 
> > > This ensures adequate restoration time for CCS metadata across
> > > different hardware configurations while maintaining spec compliance.
> > > 
> > > Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
> > > Cc: John Harrison <John.C.Harrison at Intel.com>
> > > Cc: Matthew Brost <matthew.brost at intel.com>
> > > ---
> > >   drivers/gpu/drm/xe/xe_guc.c | 33 ++++++++++++++++++++++++++++++++-
> > >   1 file changed, 32 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> > > index 9e34401e4489..d836ded83491 100644
> > > --- a/drivers/gpu/drm/xe/xe_guc.c
> > > +++ b/drivers/gpu/drm/xe/xe_guc.c
> > > @@ -10,6 +10,7 @@
> > >   #include <generated/xe_wa_oob.h>
> > >   #include "abi/guc_actions_abi.h"
> > > +#include "abi/guc_actions_sriov_abi.h"
> > >   #include "abi/guc_errors_abi.h"
> > >   #include "regs/xe_gt_regs.h"
> > >   #include "regs/xe_gtt_defs.h"
> > > @@ -1397,6 +1398,36 @@ int xe_guc_auth_huc(struct xe_guc *guc, u32 rsa_addr)
> > >   	return xe_guc_ct_send_block(&guc->ct, action, ARRAY_SIZE(action));
> > >   }
> > > +/*
> > > + * After VF migration, GUC restores CCS metadata scaled to system memory size.
> > > + * Default timeout (50ms) is calibrated for 4GB memory capacity per
> > > + * specification. Timeouts for other memory sizes are proportionally derived
> > > + * from this baseline.
> > > + */
> > > +static u32 guc_mmio_send_recv_timeout(struct xe_guc *guc, const u32 *request)
> > > +{
> > > +	struct xe_device *xe = guc_to_xe(guc);
> > > +	u32 timeout = 50000;
> > Is this really the upper bound? It seems like if could be signicantly
> > higher if multiple VFs are trying to do things all at the same time.
> That is really a problem with the wait function itself rather than the
> timeout. The timeout is meant to be the maximum expectation for how long the
> operation will take once started. Unfortunately, we currently have no checks
> on whether GuC has actually read the message itself before starting that
> timer.
> 
> There is also the opposite concern - what happens to any other VF (or PF)
> that is trying to get work done while the GPU is tied up migrating this VF?
> A stall of multiple seconds will cause all sorts of timeouts to trip.
> 

It should only tie up the migration engine, I think GuC is free to do
other work here when that is running, right? Sure, the migration engine
is used by lot of things - clears for new memory allocations, bind,
etc... but I don't think anything would timeout rather just experience a
pretty heavy delay. Could be wrong this.

> I think the expectation is that migration is a deliberate act and the system
> is not going to be doing anything else at the time. It is not something that
> just randomly occurs in the middle of a heavily loaded system. But I may be
> wrong on that?

I'm unsure on this part too. If is not some random event, then my
original concern would be invalid.

Matt

> 
> > 
> > Scaling the timeout itself, does make sense though.
> > 
> > Matt
> > 
> > > +	u32 action, factor;
> > > +	struct sysinfo si;
> > > +	u64 sys_mem_size;
> > > +
> > > +	action = FIELD_GET(GUC_HXG_REQUEST_MSG_0_ACTION, request[0]);
> > > +	if (action != GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE || IS_DGFX(xe) ||
> > > +	    !xe_device_has_flat_ccs(xe))
> > > +		return timeout;
> > > +
> > > +	si_meminfo(&si);
> > > +	sys_mem_size = si.totalram * si.mem_unit;
> Do we have to worry about Linux supporting >64bit addressing any time soon?
> I assume that is the reason for having separated units here is that the
> total might be over 64bits? Or are there no plans for 6-level page tables
> yet?
> 
> John.
> 
> > > +
> > > +	if (sys_mem_size <= SZ_4G)
> > > +		return timeout;
> > > +
> > > +	factor = (sys_mem_size + SZ_4G) / SZ_4G;
> > > +	timeout *= factor;
> > > +
> > > +	return timeout;
> > > +}
> > >   int xe_guc_mmio_send_recv(struct xe_guc *guc, const u32 *request,
> > >   			  u32 len, u32 *response_buf)
> > >   {
> > > @@ -1439,7 +1470,7 @@ int xe_guc_mmio_send_recv(struct xe_guc *guc, const u32 *request,
> > >   	ret = xe_mmio_wait32(mmio, reply_reg, GUC_HXG_MSG_0_ORIGIN,
> > >   			     FIELD_PREP(GUC_HXG_MSG_0_ORIGIN, GUC_HXG_ORIGIN_GUC),
> > > -			     50000, &reply, false);
> > > +			     guc_mmio_send_recv_timeout(guc, request), &reply, false);
> > >   	if (ret) {
> > >   		/* scratch registers might be cleared during FLR, try once more */
> > >   		if (!reply && !lost) {
> > > -- 
> > > 2.43.0
> > > 
>