[PATCH v3 2/2] drm/xe/guc: Scale mmio send/recv timeout for CCS save/restore with smem size

Thu Aug 7 06:22:57 UTC 2025


On 07-08-2025 01:27, John Harrison wrote:
> On 8/6/2025 9:28 AM, Matthew Brost wrote:
>> On Wed, Aug 06, 2025 at 01:59:10PM +0530, Satyanarayana K V P wrote:
>>> After VF migration, GUC restores CCS metadata scaled to system memory 
>>> size.
>>> The default timeout (50ms) is calibrated for 4GB memory capacity per
>>> specification. Timeouts for other memory sizes are proportionally 
>>> derived
>>> from this baseline.
>>>
>>> This ensures adequate restoration time for CCS metadata across
>>> different hardware configurations while maintaining spec compliance.
>>>
>>> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p at intel.com>
>>> Cc: John Harrison <John.C.Harrison at Intel.com>
>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>> ---
>>>   drivers/gpu/drm/xe/xe_guc.c | 33 ++++++++++++++++++++++++++++++++-
>>>   1 file changed, 32 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_guc.c b/drivers/gpu/drm/xe/xe_guc.c
>>> index 9e34401e4489..d836ded83491 100644
>>> --- a/drivers/gpu/drm/xe/xe_guc.c
>>> +++ b/drivers/gpu/drm/xe/xe_guc.c
>>> @@ -10,6 +10,7 @@
>>>   #include <generated/xe_wa_oob.h>
>>>   #include "abi/guc_actions_abi.h"
>>> +#include "abi/guc_actions_sriov_abi.h"
>>>   #include "abi/guc_errors_abi.h"
>>>   #include "regs/xe_gt_regs.h"
>>>   #include "regs/xe_gtt_defs.h"
>>> @@ -1397,6 +1398,36 @@ int xe_guc_auth_huc(struct xe_guc *guc, u32 
>>> rsa_addr)
>>>       return xe_guc_ct_send_block(&guc->ct, action, ARRAY_SIZE(action));
>>>   }
>>> +/*
>>> + * After VF migration, GUC restores CCS metadata scaled to system 
>>> memory size.
>>> + * Default timeout (50ms) is calibrated for 4GB memory capacity per
>>> + * specification. Timeouts for other memory sizes are proportionally 
>>> derived
>>> + * from this baseline.
>>> + */
>>> +static u32 guc_mmio_send_recv_timeout(struct xe_guc *guc, const u32 
>>> *request)
>>> +{
>>> +    struct xe_device *xe = guc_to_xe(guc);
>>> +    u32 timeout = 50000;
>> Is this really the upper bound? It seems like if could be signicantly
>> higher if multiple VFs are trying to do things all at the same time.
> That is really a problem with the wait function itself rather than the 
> timeout. The timeout is meant to be the maximum expectation for how long 
> the operation will take once started. Unfortunately, we currently have 
> no checks on whether GuC has actually read the message itself before 
> starting that timer.
> 
> There is also the opposite concern - what happens to any other VF (or 
> PF) that is trying to get work done while the GPU is tied up migrating 
> this VF? A stall of multiple seconds will cause all sorts of timeouts to 
> trip.
> 
> I think the expectation is that migration is a deliberate act and the 
> system is not going to be doing anything else at the time. It is not 
> something that just randomly occurs in the middle of a heavily loaded 
> system. But I may be wrong on that?
> 
>>
>> Scaling the timeout itself, does make sense though.
>>
>> Matt
>>
>>> +    u32 action, factor;
>>> +    struct sysinfo si;
>>> +    u64 sys_mem_size;
>>> +
>>> +    action = FIELD_GET(GUC_HXG_REQUEST_MSG_0_ACTION, request[0]);
>>> +    if (action != GUC_ACTION_VF2GUC_NOTIFY_RESFIX_DONE || 
>>> IS_DGFX(xe) ||
>>> +        !xe_device_has_flat_ccs(xe))
>>> +        return timeout;
>>> +
>>> +    si_meminfo(&si);
>>> +    sys_mem_size = si.totalram * si.mem_unit;
> Do we have to worry about Linux supporting >64bit addressing any time 
> soon? I assume that is the reason for having separated units here is 
> that the total might be over 64bits? Or are there no plans for 6-level 
> page tables yet?
> 
> John.
> 
I do not think we need to worry about >64 bit for now. As per lwn.net, 
we may get OS supporting 128 bit by 2035 which is long way to go.
-Satya. >>> +
>>> +    if (sys_mem_size <= SZ_4G)
>>> +        return timeout;
>>> +
>>> +    factor = (sys_mem_size + SZ_4G) / SZ_4G;
>>> +    timeout *= factor;
>>> +
>>> +    return timeout;
>>> +}
>>>   int xe_guc_mmio_send_recv(struct xe_guc *guc, const u32 *request,
>>>                 u32 len, u32 *response_buf)
>>>   {
>>> @@ -1439,7 +1470,7 @@ int xe_guc_mmio_send_recv(struct xe_guc *guc, 
>>> const u32 *request,
>>>       ret = xe_mmio_wait32(mmio, reply_reg, GUC_HXG_MSG_0_ORIGIN,
>>>                    FIELD_PREP(GUC_HXG_MSG_0_ORIGIN, GUC_HXG_ORIGIN_GUC),
>>> -                 50000, &reply, false);
>>> +                 guc_mmio_send_recv_timeout(guc, request), &reply, 
>>> false);
>>>       if (ret) {
>>>           /* scratch registers might be cleared during FLR, try once 
>>> more */
>>>           if (!reply && !lost) {
>>> -- 
>>> 2.43.0
>>>
>