[Intel-gfx] [PATCH] drm/i915/guc/slpc: Use non-blocking H2G for waitboost
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Fri May 6 07:18:49 UTC 2022
On 05/05/2022 19:36, John Harrison wrote:
> On 5/5/2022 10:21, Belgaumkar, Vinay wrote:
>> On 5/5/2022 5:13 AM, Tvrtko Ursulin wrote:
>>> On 05/05/2022 06:40, Vinay Belgaumkar wrote:
>>>> SLPC min/max frequency updates require H2G calls. We are seeing
>>>> timeouts when GuC channel is backed up and it is unable to respond
>>>> in a timely fashion causing warnings and affecting CI.
>>>
>>> Is it the "Unable to force min freq" error? Do you have a link to the
>>> GitLab issue to add to commit message?
>> We don't have a specific error for this one, but have seen similar
>> issues with other H2G which are blocking.
>>>
>>>> This is seen when waitboosting happens during a stress test.
>>>> this patch updates the waitboost path to use a non-blocking
>>>> H2G call instead, which returns as soon as the message is
>>>> successfully transmitted.
>>>
>>> AFAIU with this approach, when CT channel is congested, you instead
>>> achieve silent dropping of the waitboost request, right?
>> We are hoping it makes it, but just not waiting for it to complete.
> We are not 'hoping it makes it'. We know for a fact that it will make
> it. We just don't know when. The issue is not about whether the
> waitboost request itself gets dropped/lost it is about the ack that
> comes back. The GuC will process the message and it will send an ack.
> It's just a question of whether the i915 driver has given up waiting for
> it yet. And if it has, then you get the initial 'timed out waiting for
> ack' followed by a later 'got unexpected ack' message.
>
> Whereas, if we make the call asynchronous, there is no ack. i915 doesn't
> bother waiting and it won't get surprised later.
>
> Also, note that this is only an issue when GuC itself is backed up.
> Normally that requires the creation/destruction of large numbers of
> contexts in rapid succession (context management is about the slowest
> thing we do with GuC). Some of the IGTs and selftests do that with
> thousands of contexts all at once. Those are generally where we see this
> kind of problem. It would be highly unlikely (but not impossible) to hit
> it in real world usage.
Goto ->
> The general design philosophy of H2G messages is that asynchronous mode
> should be used for everything if at all possible. It is fire and forget
> and will all get processed in the order sent (same as batch buffer
> execution, really). Synchronous messages should only be used when an
> ack/status is absolutely required. E.g. start of day initialisation or
> things like TLB invalidation where we need to know that a cache has been
> cleared/flushed before updating memory from the CPU.
>
> John.
>
>
>>>
>>> It sounds like a potentially important feedback from the field to
>>> lose so easily. How about you added drm_notice to the worker when it
>>> fails?
>>>
>>> Or simply a "one line patch" to replace i915_probe_error (!?) with
>>> drm_notice and keep the blocking behavior. (I have no idea what is
>>> the typical time to drain the CT buffer, and so to decide whether
>>> waiting or dropping makes more sense for effectiveness of waitboosting.)
>>>
>>> Or since the congestion /should not/ happen in production, then the
>>> argument is why complicate with more code, in which case going with
>>> one line patch is an easy way forward?
Here. Where I did hint I understood the "should not happen in production
angle".
So statement is GuC is congested in processing requests, but the h2g
buffer is not congested so no chance intel_guc_send_nb() will fail with
no space in that buffer? Sounds a bit un-intuitive.
Anyway, it sounds okay to me to use the non-blocking, but I would like
to see some logging if the unexpected does happen. Hence I was
suggesting the option of adding drm_notice logging if the send fails
from the worker. (Because I think other callers would already propagate
the error, like sysfs.)
err = slpc_force_min_freq(slpc, slpc->boost_freq);
if (!err)
slpc->num_boosts++;
else
drm_notice(... "Failed to send waitboost request (%d)", err);
Something like that.
Regards,
Tvrtko
>> Even if we soften the blow here, the actual timeout error occurs in
>> the intel_guc_ct.c code, so we cannot hide that error anyways. Making
>> this call non-blocking will achieve both things.
>>
>> Thanks,
>>
>> Vinay.
>>
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>> Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar at intel.com>
>>>> ---
>>>> drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c | 38
>>>> ++++++++++++++++-----
>>>> 1 file changed, 30 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>> index 1db833da42df..c852f73cf521 100644
>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>> @@ -98,6 +98,30 @@ static u32 slpc_get_state(struct intel_guc_slpc
>>>> *slpc)
>>>> return data->header.global_state;
>>>> }
>>>> +static int guc_action_slpc_set_param_nb(struct intel_guc *guc, u8
>>>> id, u32 value)
>>>> +{
>>>> + u32 request[] = {
>>>> + GUC_ACTION_HOST2GUC_PC_SLPC_REQUEST,
>>>> + SLPC_EVENT(SLPC_EVENT_PARAMETER_SET, 2),
>>>> + id,
>>>> + value,
>>>> + };
>>>> + int ret;
>>>> +
>>>> + ret = intel_guc_send_nb(guc, request, ARRAY_SIZE(request), 0);
>>>> +
>>>> + return ret > 0 ? -EPROTO : ret;
>>>> +}
>>>> +
>>>> +static int slpc_set_param_nb(struct intel_guc_slpc *slpc, u8 id,
>>>> u32 value)
>>>> +{
>>>> + struct intel_guc *guc = slpc_to_guc(slpc);
>>>> +
>>>> + GEM_BUG_ON(id >= SLPC_MAX_PARAM);
>>>> +
>>>> + return guc_action_slpc_set_param_nb(guc, id, value);
>>>> +}
>>>> +
>>>> static int guc_action_slpc_set_param(struct intel_guc *guc, u8 id,
>>>> u32 value)
>>>> {
>>>> u32 request[] = {
>>>> @@ -208,12 +232,10 @@ static int slpc_force_min_freq(struct
>>>> intel_guc_slpc *slpc, u32 freq)
>>>> */
>>>> with_intel_runtime_pm(&i915->runtime_pm, wakeref) {
>>>> - ret = slpc_set_param(slpc,
>>>> - SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
>>>> - freq);
>>>> - if (ret)
>>>> - i915_probe_error(i915, "Unable to force min freq to %u:
>>>> %d",
>>>> - freq, ret);
>>>> + /* Non-blocking request will avoid stalls */
>>>> + ret = slpc_set_param_nb(slpc,
>>>> + SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
>>>> + freq);
>>>> }
>>>> return ret;
>>>> @@ -231,8 +253,8 @@ static void slpc_boost_work(struct work_struct
>>>> *work)
>>>> */
>>>> mutex_lock(&slpc->lock);
>>>> if (atomic_read(&slpc->num_waiters)) {
>>>> - slpc_force_min_freq(slpc, slpc->boost_freq);
>>>> - slpc->num_boosts++;
>>>> + if (!slpc_force_min_freq(slpc, slpc->boost_freq))
>>>> + slpc->num_boosts++;
>>>> }
>>>> mutex_unlock(&slpc->lock);
>>>> }
>
More information about the Intel-gfx
mailing list