[PATCH v3 2/2] drm/xe/guc: Scale mmio send/recv timeout for CCS save/restore with smem size
Matthew Brost
matthew.brost at intel.com
Wed Aug 6 23:01:47 UTC 2025
On Wed, Aug 06, 2025 at 12:57:40PM -0700, John Harrison wrote:
> On 8/6/2025 9:28 AM, Matthew Brost wrote:
> > On Wed, Aug 06, 2025 at 01:59:10PM +0530, Satyanarayana K V P wrote:
> > > After VF migration, GUC restores CCS metadata scaled to system memory size.
> > > The default timeout (50ms) is calibrated for 4GB memory capacity per
> > > specification. Timeouts for other memory sizes are proportionally derived
> > > from this baseline.
> > >
> > > This ensures adequate restoration time for CCS metadata across
> > > different hardware configurations while maintaining spec compliance.
> > >
> > > Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
> > > Cc: John Harrison <John.C.Harrison at Intel.com>
> > > Cc: Matthew Brost <matthew.brost at intel.com>
> > > ---
> > > drivers/gpu/drm/xe/xe_guc.c | 33 ++++++++++++++++++++++++++++++++-
> > > 1 file changed, 32 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
> > > index 9e34401e4489..d836ded83491 100644
> > > --- a/drivers/gpu/drm/xe/xe_guc.c
> > > +++ b/drivers/gpu/drm/xe/xe_guc.c
> > > @@ -10,6 +10,7 @@
> > > #include <generated/xe_wa_oob.h>
> > > #include "abi/guc_actions_abi.h"
> > > +#include "abi/guc_actions_sriov_abi.h"
> > > #include "abi/guc_errors_abi.h"
> > > #include "regs/xe_gt_regs.h"
> > > #include "regs/xe_gtt_defs.h"
> > > @@ -1397,6 +1398,36 @@ int xe_guc_auth_huc(struct xe_guc *guc, u32 rsa_addr)
> > > return xe_guc_ct_send_block(&guc->ct, action, ARRAY_SIZE(action));
> > > }
> > > +/*
> > > + * After VF migration, GUC restores CCS metadata scaled to system memory size.
> > > + * Default timeout (50ms) is calibrated for 4GB memory capacity per
> > > + * specification. Timeouts for other memory sizes are proportionally derived
> > > + * from this baseline.
> > > + */
> > > +static u32 guc_mmio_send_recv_timeout(struct xe_guc *guc, const u32 *request)
> > > +{
> > > + struct xe_device *xe = guc_to_xe(guc);
> > > + u32 timeout = 50000;
> > Is this really the upper bound? It seems like if could be signicantly
> > higher if multiple VFs are trying to do things all at the same time.
> That is really a problem with the wait function itself rather than the
> timeout. The timeout is meant to be the maximum expectation for how long the
> operation will take once started. Unfortunately, we currently have no checks
> on whether GuC has actually read the message itself before starting that
> timer.
>
> There is also the opposite concern - what happens to any other VF (or PF)
> that is trying to get work done while the GPU is tied up migrating this VF?
> A stall of multiple seconds will cause all sorts of timeouts to trip.
>
It should only tie up the migration engine, I think GuC is free to do
other work here when that is running, right? Sure, the migration engine
is used by lot of things - clears for new memory allocations, bind,
etc... but I don't think anything would timeout rather just experience a
pretty heavy delay. Could be wrong this.
> I think the expectation is that migration is a deliberate act and the system
> is not going to be doing anything else at the time. It is not something that
> just randomly occurs in the middle of a heavily loaded system. But I may be
> wrong on that?
I'm unsure on this part too. If is not some random event, then my
original concern would be invalid.
Matt
>
> >
> > Scaling the timeout itself, does make sense though.
> >
> > Matt
> >
> > > + u32 action, factor;
> > > + struct sysinfo si;
> > > + u64 sys_mem_size;
> > > +
> > > + action = FIELD_GET(GUC_HXG_REQUEST_MSG_0_ACTION, request[0]);
> > > + if (action != GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE || IS_DGFX(xe) ||
> > > + !xe_device_has_flat_ccs(xe))
> > > + return timeout;
> > > +
> > > + si_meminfo(&si);
> > > + sys_mem_size = si.totalram * si.mem_unit;
> Do we have to worry about Linux supporting >64bit addressing any time soon?
> I assume that is the reason for having separated units here is that the
> total might be over 64bits? Or are there no plans for 6-level page tables
> yet?
>
> John.
>
> > > +
> > > + if (sys_mem_size <= SZ_4G)
> > > + return timeout;
> > > +
> > > + factor = (sys_mem_size + SZ_4G) / SZ_4G;
> > > + timeout *= factor;
> > > +
> > > + return timeout;
> > > +}
> > > int xe_guc_mmio_send_recv(struct xe_guc *guc, const u32 *request,
> > > u32 len, u32 *response_buf)
> > > {
> > > @@ -1439,7 +1470,7 @@ int xe_guc_mmio_send_recv(struct xe_guc *guc, const u32 *request,
> > > ret = xe_mmio_wait32(mmio, reply_reg, GUC_HXG_MSG_0_ORIGIN,
> > > FIELD_PREP(GUC_HXG_MSG_0_ORIGIN, GUC_HXG_ORIGIN_GUC),
> > > - 50000, &reply, false);
> > > + guc_mmio_send_recv_timeout(guc, request), &reply, false);
> > > if (ret) {
> > > /* scratch registers might be cleared during FLR, try once more */
> > > if (!reply && !lost) {
> > > --
> > > 2.43.0
> > >
>
More information about the Intel-xe
mailing list