[Intel-gfx] [PATCH] drm/i915/pxp: Optimize GET_PARAM:PXP_STATUS

Thu Jun 29 22:39:15 UTC 2023

On Tue, 2023-06-20 at 09:30 -0500, Balasubrawmanian, Vivaik wrote:
> On 6/1/2023 12:45 PM, Alan Previn wrote:
> > After recent discussions with Mesa folks, it was requested
> > that we optimize i915's GET_PARAM for the PXP_STATUS without
> > changing the UAPI spec.
> > 
> > This patch adds this additional optimizations:
> >     - If any PXP initializatoin flow failed, then ensure that
> >       we catch it so that we can change the returned PXP_STATUS
> >       from "2" (i.e. 'PXP is supported but not yet ready')
> >       to "-ENODEV". This typically should not happen and if it
> >       does, we have a platform configuration.
> >     - If a PXP arbitration session creation event failed
> >       due to incorrect firmware version or blocking SOC fusing
> >       or blocking BIOS configuration (platform reasons that won't
> >       change if we retry), then reflect that blockage by also
> >       returning -ENODEV in the GET_PARAM-PXP_STATUS.
> >     - GET_PARAM:PXP_STATUS should not wait at all if PXP is
> >       supported but non-i915 dependencies (component-driver /
> >       firmware) we are still pending to complete the init flows.
> >       In this case, just return "2" immediately (i.e. 'PXP is
> >       supported but not yet ready').
> > 
> > Signed-off-by: Alan Previn <alan.previn.teres.alexis at intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c  | 11 +++++++++-
> >   drivers/gpu/drm/i915/i915_getparam.c       |  2 +-
> >   drivers/gpu/drm/i915/pxp/intel_pxp.c       | 25 ++++++++++++++++++----
> >   drivers/gpu/drm/i915/pxp/intel_pxp.h       |  2 +-
> >   drivers/gpu/drm/i915/pxp/intel_pxp_gsccs.c |  7 +++---
> >   drivers/gpu/drm/i915/pxp/intel_pxp_tee.c   |  7 +++---
> >   drivers/gpu/drm/i915/pxp/intel_pxp_types.h |  9 ++++++++
> >   7 files changed, 50 insertions(+), 13 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> > index fb0984f875f9..4dd744c96a37 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_uc.c
> > @@ -42,8 +42,17 @@ static void gsc_work(struct work_struct *work)
> >   		}
> >   
> >   		ret = intel_gsc_proxy_request_handler(gsc);
> > -		if (ret)
> > +		if (ret) {
> > +			if (actions & GSC_ACTION_FW_LOAD) {
> > +				/*
> > +				 * a proxy request failure that came together with the
> > +				 * firmware load action means the last part of init has
> > +				 * failed so GSC fw won't be usable after this
> > +				 */
> > +				intel_uc_fw_change_status(&gsc->fw, INTEL_UC_FIRMWARE_LOAD_FAIL);
> > +			}
> >   			goto out_put;
> > +		}
> >   
> >   		/* mark the GSC FW init as done the first time we run this */
> >   		if (actions & GSC_ACTION_FW_LOAD) {
> 
> On the huc authentication comment block above this snippet, the last 
> statement: "Note that we can only do the GSC auth if the GuC auth was" 
> is confusing as the code below is only dealing with HuC Authentication.
alan: i believe what he meant was "can only do the GSC-based auth if
the GuC-based auth"... but I can't change that code as part
of this patch - I believe the rules for kernel patch is to ensure each
single patch is modular (not mixing unrelated changes) and focuses just
on what its described to do. IIRC, we would need to create a separate
patch review for that change.

> 
> This function seems to have a section to deal with FW load action and 
> another to deal with SW Proxy requests, but we seem to be mixing both 
> actions in the SW proxy section. instead, can we move this call to 
> intel_gsc_proxy_request_handler to the FW load section itself instead of 
> handling it as an additional check in the SW_proxy section? In the same 
> vein, we should also move the intel_uc_fw_change_status() call into the 
> above FW Load action section. i think that way the code reads better.
alan: GSC_ACTION_FW_LOAD is used for loading the GSC firmware which is a
one-time thing per i915 load. However, GSC_ACTION_SW_PROXY events can happen
any time the GSC fw needs to communicate with CSE firmware (or vice versa)
due to platform events that may have not been triggered by i915 long after
init. However, the rule is after GSC FW is loaded, i915 is required
to do a 1-time proxy-init step to prime both GSC and CSE fws that proxy
comms is avail. without this step, we can't use the gsc-fw for other ops.

So to recap the rules:
1. we launch the worker to do the one-time the GSC firmware load.
2. after the GSC firmware load is successful, we have to do a one-time SW-proxy init.
    -> this is why we add the GSC_ACTION_SW_PROXY flag successful load completion.
3. If we are doing proxy-handling for the very first time, we ensure
   -> FW status is only set to RUNNING if proxy int was good (since GSC FW cant be
      accessed to do anything (such as hdcp, pxp, etc) without proxy init completion.
   -> print a message to signal is proxy init failed.
This is the only reason why we have the additional "if (actions & GSC_ACTION_FW_LOAD)"
check inside the SW Proxy block - we are not mixing fw loading steps all, but just to
distinguish between the first-ever-sw-proxy vs the regular runtime sw-proxy where the
latter can fail gracefully without blocking future GSC-fw operation or future sw-proxy handling.

That said, to ensure we can properly honor all 3 steps above, we can either call
intel_gsc_proxy_request_handler twice (once right after fw-load and every other time
runtime proxy events occur) ... or ... we can set GSC_ACTION_SW_PROXY twice...
basically its the same thing - so wont make much difference.

More importantly, this patch is not changing how and where we call intel_gsc_proxy_request_handler
but only optimize how we handle the GET_PARAM:PXP_STATUS. In this specific code
block we are reviewing, the only change being done is to ensure that we treat the
GSC FW status as failed if the first-time-proxy-init step fails. So once again, if
we want to change how we call intel_gsc_proxy_request_handler, that would have to be another
patch, but in light of above recap of the rules this worker is attempting to honor,
i dont agree that we need to change when we call intel_gsc_proxy_request_handler.

...alan

> > diff --git a/drivers/gpu/drm/i915/i915_getparam.c b/drivers/gpu/drm/i915/i915_getparam.c
> > index 6f11d7eaa91a..1b2ee98a158a 100644
> > --- a/drivers/gpu/drm/i915/i915_getparam.c
> > +++ b/drivers/gpu/drm/i915/i915_getparam.c
> > @@ -105,7 +105,7 @@ int i915_getparam_ioctl(struct drm_device *dev, void *data,
> >   			return value;
> >   		break;
> >   	case I915_PARAM_PXP_STATUS:
> > -		value = intel_pxp_get_readiness_status(i915->pxp);
> > +		value = intel_pxp_get_readiness_status(i915->pxp, 1);
> >   		if (value < 0)
> >   			return value;
> >   		break;
> > diff --git a/drivers/gpu/drm/i915/pxp/intel_pxp.c b/drivers/gpu/drm/i915/pxp/intel_pxp.c
> > index bb2e15329f34..1478bb9b4e26 100644
> > --- a/drivers/gpu/drm/i915/pxp/intel_pxp.c
> > +++ b/drivers/gpu/drm/i915/pxp/intel_pxp.c
> > @@ -359,21 +359,38 @@ void intel_pxp_end(struct intel_pxp *pxp)
> >   	intel_runtime_pm_put(&i915->runtime_pm, wakeref);
> >   }
> >   
> > +static bool pxp_required_fw_failed(struct intel_pxp *pxp)
> > +{
> > +	if (__intel_uc_fw_status(&pxp->ctrl_gt->uc.huc.fw) == INTEL_UC_FIRMWARE_LOAD_FAIL)
> > +		return true;
> > +	if (HAS_ENGINE(pxp->ctrl_gt, GSC0) &&
> > +	    __intel_uc_fw_status(&pxp->ctrl_gt->uc.gsc.fw) == INTEL_UC_FIRMWARE_LOAD_FAIL)
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> >   /*
> >    * this helper is used by both intel_pxp_start and by
> >    * the GET_PARAM IOCTL that user space calls. Thus, the
> >    * return values here should match the UAPI spec.
> >    */
> > -int intel_pxp_get_readiness_status(struct intel_pxp *pxp)
> > +int intel_pxp_get_readiness_status(struct intel_pxp *pxp, int timeout)
> >   {
> >   	if (!intel_pxp_is_enabled(pxp))
> >   		return -ENODEV;
> >   
> > +	if (pxp_required_fw_failed(pxp))
> > +		return -ENODEV;
> > +
> > +	if (pxp->platform_cfg_is_bad)
> > +		return -ENODEV;
> > +
> >   	if (HAS_ENGINE(pxp->ctrl_gt, GSC0)) {
> > -		if (wait_for(intel_pxp_gsccs_is_ready_for_sessions(pxp), 250))
> > +		if (wait_for(intel_pxp_gsccs_is_ready_for_sessions(pxp), timeout))
> >   			return 2;
> >   	} else {
> > -		if (wait_for(pxp_component_bound(pxp), 250))
> > +		if (wait_for(pxp_component_bound(pxp), timeout))
> >   			return 2;
> >   	}
> >   	return 1;
> > @@ -387,7 +404,7 @@ int intel_pxp_start(struct intel_pxp *pxp)
> >   {
> >   	int ret = 0;
> >   
> > -	ret = intel_pxp_get_readiness_status(pxp);
> > +	ret = intel_pxp_get_readiness_status(pxp, 250);
> >   	if (ret < 0)
> >   		return ret;
> >   	else if (ret > 1)
> 
> In intel_pxp_start(), shouldn't the 250ms be defined in the struct as a 
> define with a comment that explains why it is 250 vs some other number? 
alan: the value 250 is a carry forward from previous ADL implementation but
this number is not being used for fw interaction but only to check for
readiness. That said there is not other location we use this value for this
purpose. Thus, i can add a #define, but would only be used in the same function
call and no other. I can add that if u insist.

Side note: When intel_pxp_start is called a part of GEM_CONTEXT_CREATE call,
i915 UAPI already specs that I915_CONTEXT_PARAM_PROTECTED_CONTENT can fail with 

-ENXIO when dependencies are not ready and user space can retry so the 250 was
chosen based on historical ADL reviews but won't mean anything since the
user space will have to retry anyway.

> Also in the i915_getparam_ioctl, shouldn't the timeout value be 0 
> instead of 1 as this is a simple status check?
alan: yes, you are right .. I'll fix that -> get_param shouldnt wait
for any timeout.

> Also, the return value of 2 if the timeout expires seems 
> counter-intuitive. I think EBUSY will be more appropriate especially 
> since the IOCTL call seems to be a quick status check.

The IOCTL UAPI behavior was spec'd in the past with the UMD folks
and was mirror-ing the only other GET_PARAM type that has a runtime
status change where negative value means no support and positive values
mean support is available. But we use different positive values to represent
different stages of support readiness where 1 is fully ready and 2,3,4...
for not yet ready for reason b, c, d... (this model scales for future hw/fw/sw
readiness states that doesnt exist yet without breaking backwards compatibility
of the UAPI spec. Ofc, today we only use '2' for adl/mtl-pxp. 

that said, we can't change the UAPI spec now els we'd break backware compatibility
with existing SW since the UMD has already implemented the change to follow this spec
meaning of negative vs 1 vs 2.

alan:snip