[PATCH v3 2/2] drm/xe/guc: Scale mmio send/recv timeout for CCS save/restore with smem size
K V P, Satyanarayana
satyanarayana.k.v.p at intel.com
Thu Aug 7 06:22:57 UTC 2025
On 07-08-2025 01:27, John Harrison wrote:
> On 8/6/2025 9:28 AM, Matthew Brost wrote:
>> On Wed, Aug 06, 2025 at 01:59:10PM +0530, Satyanarayana K V P wrote:
>>> After VF migration, GUC restores CCS metadata scaled to system memory
>>> size.
>>> The default timeout (50ms) is calibrated for 4GB memory capacity per
>>> specification. Timeouts for other memory sizes are proportionally
>>> derived
>>> from this baseline.
>>>
>>> This ensures adequate restoration time for CCS metadata across
>>> different hardware configurations while maintaining spec compliance.
>>>
>>> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
>>> Cc: John Harrison <John.C.Harrison at Intel.com>
>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>> ---
>>> drivers/gpu/drm/xe/xe_guc.c | 33 ++++++++++++++++++++++++++++++++-
>>> 1 file changed, 32 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
>>> index 9e34401e4489..d836ded83491 100644
>>> --- a/drivers/gpu/drm/xe/xe_guc.c
>>> +++ b/drivers/gpu/drm/xe/xe_guc.c
>>> @@ -10,6 +10,7 @@
>>> #include <generated/xe_wa_oob.h>
>>> #include "abi/guc_actions_abi.h"
>>> +#include "abi/guc_actions_sriov_abi.h"
>>> #include "abi/guc_errors_abi.h"
>>> #include "regs/xe_gt_regs.h"
>>> #include "regs/xe_gtt_defs.h"
>>> @@ -1397,6 +1398,36 @@ int xe_guc_auth_huc(struct xe_guc *guc, u32
>>> rsa_addr)
>>> return xe_guc_ct_send_block(&guc->ct, action, ARRAY_SIZE(action));
>>> }
>>> +/*
>>> + * After VF migration, GUC restores CCS metadata scaled to system
>>> memory size.
>>> + * Default timeout (50ms) is calibrated for 4GB memory capacity per
>>> + * specification. Timeouts for other memory sizes are proportionally
>>> derived
>>> + * from this baseline.
>>> + */
>>> +static u32 guc_mmio_send_recv_timeout(struct xe_guc *guc, const u32
>>> *request)
>>> +{
>>> + struct xe_device *xe = guc_to_xe(guc);
>>> + u32 timeout = 50000;
>> Is this really the upper bound? It seems like if could be signicantly
>> higher if multiple VFs are trying to do things all at the same time.
> That is really a problem with the wait function itself rather than the
> timeout. The timeout is meant to be the maximum expectation for how long
> the operation will take once started. Unfortunately, we currently have
> no checks on whether GuC has actually read the message itself before
> starting that timer.
>
> There is also the opposite concern - what happens to any other VF (or
> PF) that is trying to get work done while the GPU is tied up migrating
> this VF? A stall of multiple seconds will cause all sorts of timeouts to
> trip.
>
> I think the expectation is that migration is a deliberate act and the
> system is not going to be doing anything else at the time. It is not
> something that just randomly occurs in the middle of a heavily loaded
> system. But I may be wrong on that?
>
>>
>> Scaling the timeout itself, does make sense though.
>>
>> Matt
>>
>>> + u32 action, factor;
>>> + struct sysinfo si;
>>> + u64 sys_mem_size;
>>> +
>>> + action = FIELD_GET(GUC_HXG_REQUEST_MSG_0_ACTION, request[0]);
>>> + if (action != GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE ||
>>> IS_DGFX(xe) ||
>>> + !xe_device_has_flat_ccs(xe))
>>> + return timeout;
>>> +
>>> + si_meminfo(&si);
>>> + sys_mem_size = si.totalram * si.mem_unit;
> Do we have to worry about Linux supporting >64bit addressing any time
> soon? I assume that is the reason for having separated units here is
> that the total might be over 64bits? Or are there no plans for 6-level
> page tables yet?
>
> John.
>
I do not think we need to worry about >64 bit for now. As per lwn.net,
we may get OS supporting 128 bit by 2035 which is long way to go.
-Satya. >>> +
>>> + if (sys_mem_size <= SZ_4G)
>>> + return timeout;
>>> +
>>> + factor = (sys_mem_size + SZ_4G) / SZ_4G;
>>> + timeout *= factor;
>>> +
>>> + return timeout;
>>> +}
>>> int xe_guc_mmio_send_recv(struct xe_guc *guc, const u32 *request,
>>> u32 len, u32 *response_buf)
>>> {
>>> @@ -1439,7 +1470,7 @@ int xe_guc_mmio_send_recv(struct xe_guc *guc,
>>> const u32 *request,
>>> ret = xe_mmio_wait32(mmio, reply_reg, GUC_HXG_MSG_0_ORIGIN,
>>> FIELD_PREP(GUC_HXG_MSG_0_ORIGIN, GUC_HXG_ORIGIN_GUC),
>>> - 50000, &reply, false);
>>> + guc_mmio_send_recv_timeout(guc, request), &reply,
>>> false);
>>> if (ret) {
>>> /* scratch registers might be cleared during FLR, try once
>>> more */
>>> if (!reply && !lost) {
>>> --
>>> 2.43.0
>>>
>
More information about the Intel-xe
mailing list