[PATCH] drm/i915/pcode: Wait 10 seconds for pcode to settle

Mon Jan 30 08:48:31 UTC 2023

Hi GG,

thanks for the deep analysis!

> Hi Andi,
> In the call flow invoked by intel_pcode_init(), I've added brief comments
> where further clarification is needed in this scenario, and a description of
> the suspicious scenario at the bottom.
> 
> -------------------------------------------------------------------------
> intel_pcode_init()
>  |
>  +-> skl_pcode_request(uncore, DG1_PCODE_STATUS,
>                        DG1_UNCORE_GET_INIT_STATUS,
>                        DG1_UNCORE_INIT_STATUS_COMPLETE,
>                        DG1_UNCORE_INIT_STATUS_COMPLETE, 180000);
>        |
>        +-> skl_pcode_try_request()
>              |
>              +->  *status = __snb_pcode_rw(uncore, mbox, &request, NULL,
>                                            500, 0, true);
> 
> -------------------------------------------------------------------------
> static int __snb_pcode_rw(struct intel_uncore *uncore, u32 mbox,
> 		          u32 *val, u32 *val1,
> 			  int fast_timeout_us, int slow_timeout_ms,
> 			  bool is_read)
> {
> ...
>         /* Before writing a value to the GEN6_PCODE_DATA register,
>            check if the bit in the GEN6_PCODE_MAILBOX register indicates
>            BUSY. */
> 	if (intel_uncore_read_fw(uncore, GEN6_PCODE_MAILBOX) & GEN6_PCODE_READY)
> 		return -EAGAIN;

what if we fail here because the punit is still initializing and
will be ready, say, in 10 seconds?

GG, without going any further, we fail here! The -EAGAIN we
receive from the test comes from this point. We don't fail with
-ETIMEDOUT, but with -EAGAIN and the reason is because the punit
is not ready to perform the very fist communication and we fail
the probing.

It doesn't mean, though, that there is anything wrong, we just
need to wait a bit before "taking drastic decisions"!

This patch is suggesting to wait up to 10s for the punit to be
ready and eventually try to probe again... and, indeed, it works!

Andi

> 
>         /* write value to GEN6_PCODE_DATA register */
> 	intel_uncore_write_fw(uncore, GEN6_PCODE_DATA, *val);
> 
> 	intel_uncore_write_fw(uncore, GEN6_PCODE_DATA1, val1 ? *val1 : 0);
> 
>         /* In this scenario, the value
>            "DG1_PCODE_STATUS | GEN6_PCODE_READY"
>            is written to the GEN6_PCODE_MAILBOX register,
>            so that the Busy status of the GEN6_PCODE_MAILBOX register
>            can be checked later.
>            (When the value of the GEN6_PCODE_READY bit of the
>             GEN6_PCODE_MAILBOX register changes to 0, the operation can
>             be considered completed.) */
> 	intel_uncore_write_fw(uncore,
> 			      GEN6_PCODE_MAILBOX, GEN6_PCODE_READY | mbox);
> 
>         /* In this scenario, verify that the BUSY status bit in the
>            GEN6_PCODE_MAILBOX register turns off for up to 500us. */
> 	if (__intel_wait_for_register_fw(uncore,
> 					 GEN6_PCODE_MAILBOX,
> 					 GEN6_PCODE_READY, 0,
> 					 fast_timeout_us,
> 					 slow_timeout_ms,
> 					 &mbox))
> 		return -ETIMEDOUT;
>         /* If there is a failure here, it may be considered that the
>            "DG1_PCODE_STATUS | GEN6_PCODE_READY" operation was not
>            completed within 500us */
> ...
> }
> 
> int skl_pcode_request(struct intel_uncore *uncore, u32 mbox, u32 request,
> 		      u32 reply_mask, u32 reply, int timeout_base_ms)
> {
> 	u32 status;
> 	int ret;
> 
> 	mutex_lock(&uncore->i915->sb_lock);
> 
> #define COND \
> 	skl_pcode_try_request(uncore, mbox, request, reply_mask, reply, &status)
> 
>         /* the first trial for skl_pcode_try_request() can return
>            -EAGAIN or -ETIMEDOUT. And the code did not check the error
>            code here, so we don't know how far the __snb_pcode_rw()
>            function went. It is not known whether the pcode_mailbox
>            status was busy before writing the value to the
>            GEN6_PCODE_DATA register or after.*/
> 	if (COND) {
> 		ret = 0;
> 		goto out;
> 	}
> 
>         /* In this scenario, skl_pcode_try_request() is invoked every
>            10us for 180 seconds. When skl_pcode_try_request() returns
>            -EAGAIN and -ETIMEDOUT by _wait_for(),
>            -ETIMEDOUT is returned to a variable ret. */
> 
> 	ret = _wait_for(COND, timeout_base_ms * 1000, 10, 10);
> 
> 	if (!ret)
> 		goto out;
> 
> 	/*
> 	 * The above can time out if the number of requests was low (2 in the
> 	 * worst case) _and_ PCODE was busy for some reason even after a
> 	 * (queued) request and @timeout_base_ms delay. As a workaround retry
> 	 * the poll with preemption disabled to maximize the number of
> 	 * requests. Increase the timeout from @timeout_base_ms to 50ms to
> 	 * account for interrupts that could reduce the number of these
> 	 * requests, and for any quirks of the PCODE firmware that delays
> 	 * the request completion.
> 	 */
> 	drm_dbg_kms(&uncore->i915->drm,
> 		    "PCODE timeout, retrying with preemption disabled\n");
> 	drm_WARN_ON_ONCE(&uncore->i915->drm, timeout_base_ms > 3);
> 	preempt_disable();
> 	ret = wait_for_atomic(COND, 50);
> 
> 	preempt_enable();
> 
> out:
> 	mutex_unlock(&uncore->i915->sb_lock);
> 	return status ? status : ret;
> #undef COND
> }
> 
> ---------------------------------------------------------
> 
> If you try skl_pcode_try_request() for 180 seconds in skl_pcode_request(),
> and the first "intel_uncore_read_fw(uncore, GEN6_PCODE_MAILBOX) &
> GEN6_PCODE_READY)" call in __snb_pcode_rw() that skl_pcode_try_request()
> invokes always fails. if then it does not make sense to me why this patch
> fixes it by just waiting 10 seconds.This is because if it was called with
> the flow described above, 180 seconds is longer than 10 seconds, so the
> scenario you mentioned is also covered in the existing code.
> 
> To describe in more detail the second scenario I previously commented on:
> skl_pcode_request() tries skl_pcode_try_request() for 180 seconds
>  1) In skl_pcode_try_request(), the GEN6_PCODE_MAILBOX register bit is not
> BUSY, so the value is written to the GEN6_PCODE_DATA1 register.
>  2) skl_pcode_try_request() returns -ETIMEDOUT value because the operation
> of 1) does not complete within 500us.
>  3) Scenario in which the GEN6_PCODE_MAILBOX register bit is checked as BUSY
> and returns -EAGAIN in the last call of skl_pcode_try_request() invoked by
> skl_pcode_request()
> 
> If the reason why this problem occurred is because of this scenario,
> shouldn't there be an attempt to increase fast_timeout_us used as an
> argument of __snb_pcode_rw() to 500us or more when skl_pcode_try_request()
> returns -ETIMEDOUT?
> 
> Br,
> G.G.