[Intel-gfx] [PATCH] drm/i915/guc/slpc: Use non-blocking H2G for waitboost

Tue Jun 7 23:15:19 UTC 2022

On 6/7/2022 15:29, Dixit, Ashutosh wrote:
> On Sat, 14 May 2022 23:05:06 -0700, Vinay Belgaumkar wrote:
>> SLPC min/max frequency updates require H2G calls. We are seeing
>> timeouts when GuC channel is backed up and it is unable to respond
>> in a timely fashion causing warnings and affecting CI.
>>
>> This is seen when waitboosting happens during a stress test.
>> this patch updates the waitboost path to use a non-blocking
>> H2G call instead, which returns as soon as the message is
>> successfully transmitted.
> Overall I think this patch is trying to paper over problems in the blocking
> H2G CT interface (specifically the 1 second timeout in
> wait_for_ct_request_update()). So I think we should address that problem in
> the interface directly rather than having each client (SLPC and any future
> client) work around the problem. Following points:
>
> 1. This patch seems to assume that it is 'ok' to ignore the return code
>     from FW for a waitboost request (arguing waitboost is best effort so
>     it's ok to 'fire and forget'). But the return code is still useful
>     e.g. in cases where we see performance issues and want to go back and
>     investigate if FW rejected any waitboost requests.
You still get errors reported in the GuC log. Indeed, some errors (or at 
least error reasons) are only visible in the log not in the return code.

>
> 2. We are already seeing that a 1 second timeout is not sufficient. So why
>     not simply increase that timeout?
>
> 3. In fact if we are saying that the CT interface is a "reliable" interface
>     (implying no message loss), to ensure reliability that timeout should
>     not simply be increased, it should be made "infinite" (in quotes).
>
> 4. Maybe it would have been best to not have a "blocking" H2G interface at
>     all (with the wait in wait_for_ct_request_update()). Just have an
>     asynchronous interface (which mirrors the actual interface between FW
>     and i915) in which clients register callbacks which are invoked when FW
>     responds. If this is too big a change we can probably continue with the
>     current blocking interface after increasing the timeout as mentioned
>     above.
>
> 5. Finally, the waitboost request is just the most likely to get stuck at
>     the back of a full CT queue since it happens during normal
>     operation. Actually any request, say one initiated from sysfs, can also
>     get similarly stuck at the back of a full queue. So any solution should
>     also address that situation (where the return code is needed and
>     similarly for a future client of the "blocking" (REQUEST/RESPONSE)
>     interface).
The blocking interface is only intended for init time operations, not 
runtime. Stuff where the operation is meant to be synchronous and the 
KMD should not proceed until it has an ack back from the GuC that the 
update has taken place. All runtime operations are expected to be 
asynchronous. If a response is required, then it should be sent via an 
async callback. E.g. context de-registration is a 'fire and forget' H2G 
call but gets a 'deregistration complete' G2H notification when it is 
safe for the KMD to free up the associated storage.

There is an 'errors only' H2G mechanism. That will not send an ack back 
in the case of a successful H2G but will send back an error notification 
in the case of a failure. All async H2Gs should really be using that 
mechanism. I think Michal W did post a patch for it and I was meant to 
be reviewing it but it dropped of my radar due to other higher priorities.

John.

>
> Thanks.
> --
> Ashutosh