[Intel-gfx] [PATCH 00/15] HuC loading for DG2

Wed Jul 6 19:29:14 UTC 2022

On 7/6/2022 10:26 AM, Ye, Tony wrote:
>
> On 7/5/2022 4:30 PM, Ceraolo Spurio, Daniele wrote:
>>
>>
>> On 6/15/2022 7:28 PM, Zhang, Carl wrote:
>>>> On From: Ye, Tony <tony.ye at intel.com>
>>>> Sent: Thursday, June 16, 2022 12:15 AM
>>>>
>>>>
>>>> On 6/15/2022 3:13 AM, Tvrtko Ursulin wrote:
>>>>> On 15/06/2022 00:15, Ye, Tony wrote:
>>>>>> On 6/14/2022 8:30 AM, Ceraolo Spurio, Daniele wrote:
>>>>>>> On 6/14/2022 12:44 AM, Tvrtko Ursulin wrote:
>>>>>>>> On 13/06/2022 19:13, Ceraolo Spurio, Daniele wrote:
>>>>>>>>> On 6/13/2022 10:39 AM, Tvrtko Ursulin wrote:
>>>>>>>>>> On 13/06/2022 18:06, Ceraolo Spurio, Daniele wrote:
>>>>>>>>>>> On 6/13/2022 9:56 AM, Tvrtko Ursulin wrote:
>>>>>>>>>>>> On 13/06/2022 17:41, Ceraolo Spurio, Daniele wrote:
>>>>>>>>>>>>> On 6/13/2022 9:31 AM, Tvrtko Ursulin wrote:
>>>>>>>>>>>>>> On 13/06/2022 16:39, Ceraolo Spurio, Daniele wrote:
>>>>>>>>>>>>>>> On 6/13/2022 1:16 AM, Tvrtko Ursulin wrote:
>>>>>>>>>>>>>>>> On 10/06/2022 00:19, Daniele Ceraolo Spurio wrote:
>>>>>>>>>>>>>>>>> On DG2, HuC loading is performed by the GSC, via a PXP
>>>>>>>>>>>>>>>>> command. The load operation itself is relatively simple
>>>>>>>>>>>>>>>>> (just send a message to the GSC with the physical address
>>>>>>>>>>>>>>>>> of the HuC in LMEM), but there are timing changes that
>>>>>>>>>>>>>>>>> requires special attention. In particular, to send a PXP
>>>>>>>>>>>>>>>>> command we need to first export the GSC driver and then
>>>>>>>>>>>>>>>>> wait for the mei-gsc and mei-pxp modules to start, which
>>>>>>>>>>>>>>>>> means that HuC load will complete after i915 load is
>>>>>>>>>>>>>>>>> complete. This means that there is a small window of time
>>>>>>>>>>>>>>>>> after i915 is registered and before HuC is loaded during
>>>>>>>>>>>>>>>>> which userspace could submit and/or checking the HuC load
>>>>>>>>>>>>>>>>> status, although this is quite unlikely to happen (HuC is
>>>>>>>>>>>>>>>>> usually loaded before kernel init/resume completes).
>>>>>>>>>>>>>>>>> We've consulted with the media team in regards to how to
>>>>>>>>>>>>>>>>> handle this and they've asked us to do the following:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1) Report HuC as loaded in the getparam IOCTL even if 
>>>>>>>>>>>>>>>>> load
>>>>>>>>>>>>>>>>> is still in progress. The media driver uses the IOCTL 
>>>>>>>>>>>>>>>>> as a
>>>>>>>>>>>>>>>>> way to check if HuC is enabled and then includes a
>>>>>>>>>>>>>>>>> secondary check in the batches to get the actual status,
>>>>>>>>>>>>>>>>> so doing it this way allows userspace to keep working
>>>>>>>>>>>>>>>>> without changes.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2) Stall all userspace VCS submission until HuC is 
>>>>>>>>>>>>>>>>> loaded.
>>>>>>>>>>>>>>>>> Stalls are
>>>>>>>>>>>>>>>>> expected to be very rare (if any), due to the fact that
>>>>>>>>>>>>>>>>> HuC is usually loaded before kernel init/resume is
>>>>>>>>>>>>>>>>> completed.
>>>>>>>>>>>>>>>> Motivation to add these complications into i915 are not
>>>>>>>>>>>>>>>> clear to me here. I mean there is no HuC on DG2 _yet_ is
>>>>>>>>>>>>>>>> the premise of the series, right? So no backwards
>>>>>>>>>>>>>>>> compatibility concerns. In this case why jump through the
>>>>>>>>>>>>>>>> hoops and not let userspace handle all of this by just
>>>>>>>>>>>>>>>> leaving the getparam return the true status?
>>>>>>>>>>>>>>> The main areas impacted by the fact that we can't guarantee
>>>>>>>>>>>>>>> that HuC load is complete when i915 starts accepting
>>>>>>>>>>>>>>> submissions are boot and suspend/resume, with the latter
>>>>>>>>>>>>>>> being the main problem; GT reset is not a concern because
>>>>>>>>>>>>>>> HuC now survives it. A suspend/resume can be transparent to
>>>>>>>>>>>>>>> userspace and therefore the HuC status can temporarily flip
>>>>>>>>>>>>>>> from loaded to not without userspace knowledge, especially
>>>>>>>>>>>>>>> if we start going into deeper suspend states and start
>>>>>>>>>>>>>>> causing HuC resets when we go into runtime suspend. Note
>>>>>>>>>>>>>>> that this is different from what happens during GT reset 
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> older platforms, because in that scenario we guarantee that
>>>>>>>>>>>>>>> HuC reload is complete before we restart the submission
>>>>>>>>>>>>>>> back-end, so userspace doesn't notice that the HuC status
>>>>>>>>>>>>>>> change. We had an internal discussion about this problem
>>>>>>>>>>>>>>> with both media and i915 archs and the conclusion was that
>>>>>>>>>>>>>>> the best option is for i915 to stall media submission while
>>>>>>>>>>>>>>> HuC (re-)load is in progress.
>>>>>>>>>>>>>> Resume is potentialy a good reason - I did not pick up on
>>>>>>>>>>>>>> that from the cover letter. I read the statement about the
>>>>>>>>>>>>>> unlikely and small window where HuC is not loaded during
>>>>>>>>>>>>>> kernel init/resume and I guess did not pick up on the resume
>>>>>>>>>>>>>> part.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Waiting for GSC to load HuC from i915 resume is not an 
>>>>>>>>>>>>>> option?
>>>>>>>>>>>>> GSC is an aux device exported by i915, so AFAIU GSC resume
>>>>>>>>>>>>> can't start until i915 resume completes.
>>>>>>>>>>>> I'll dig into this in the next few days since I want to
>>>>>>>>>>>> understand how exactly it works. Or someone can help explain.
>>>>>>>>>>>>
>>>>>>>>>>>> If in the end conclusion will be that i915 resume indeed 
>>>>>>>>>>>> cannot
>>>>>>>>>>>> wait for GSC, then I think auto-blocking of queued up contexts
>>>>>>>>>>>> on media engines indeed sounds unavoidable. Otherwise, as you
>>>>>>>>>>>> explained, user experience post resume wouldn't be good.
>>>>>>>>>>> Even if we could implement a wait, I'm not sure we should. GSC
>>>>>>>>>>> resume and HuC reload takes ~300ms in most cases, I don't think
>>>>>>>>>>> we want to block within the i915 resume path for that long.
>>>>>>>>>> Yeah maybe not. But entertaining the idea that it is technically
>>>>>>>>>> possible to block - we could perhaps add uapi for userspace to
>>>>>>>>>> mark contexts which want HuC access. Then track if there are any
>>>>>>>>>> such contexts with outstanding submissions and only wait in
>>>>>>>>>> resume if there are. If that would end up significantly less 
>>>>>>>>>> code
>>>>>>>>>> on the i915 side to maintain is an open.
>>>>>>>>>>
>>>>>>>>>> What would be the end result from users point of view in case
>>>>>>>>>> where it suspended during video playback? The proposed solution
>>>>>>>>>> from this series sees the video stuck after resume. Maybe
>>>>>>>>>> compositor blocks as well since I am not sure how well they
>>>>>>>>>> handle one window not providing new data. Probably depends on
>>>> the
>>>>>>>>>> compositor.
>>>>>>>>>>
>>>>>>>>>> And then with a simpler solution definitely the whole resume
>>>>>>>>>> would be delayed by 300ms.
>>>>>>>>>>
>>>>>>>>>> With my ChromeOS hat the stalled media engines does sound like a
>>>>>>>>>> better solution. But with the maintainer hat I'd like all 
>>>>>>>>>> options
>>>>>>>>>> evaluated since there is attractiveness if a good enough 
>>>>>>>>>> solution
>>>>>>>>>> can be achieved with significantly less kernel code.
>>>>>>>>>>
>>>>>>>>>> You say 300ms is typical time for HuC load. How long it is on
>>>>>>>>>> other platforms? If much faster then why is it so slow here?
>>>>>>>>> The GSC itself has to come out of suspend before it can perform
>>>>>>>>> the load, which takes a few tens of ms I believe. AFAIU the 
>>>>>>>>> GSC is
>>>>>>>>> also slower in processing the HuC load and auth compared to the
>>>>>>>>> legacy path. The GSC FW team gave a 250ms limit for the time the
>>>>>>>>> GSC FW needs from start of the resume flow to HuC load complete,
>>>>>>>>> so I bumped that to ~300ms to account for all other SW
>>>>>>>>> interactions, plus a bit of buffer. Note that a bit of the SW
>>>>>>>>> overhead is caused by the fact that we have 2 mei modules in play
>>>>>>>>> here: mei-gsc, which manages the GSC device itself (including
>>>>>>>>> resume), and mei-pxp, which owns the pxp messaging, including HuC
>>>>>>>>> load.
>>>>>>>> And how long on other platforms (not DG2) do you know? Presumably
>>>>>>>> there the wait is on the i915 resume path?
>>>>>>> I don't have "official" expected load times at hand, but looking at
>>>>>>> the BAT boot logs for this series for DG1 I see it takes ~10 ms to
>>>>>>> load both GuC and HuC:
>>>>>>>
>>>>>>> <7>[    8.157838] i915 0000:03:00.0: [drm:intel_huc_init [i915]] 
>>>>>>> GSC
>>>>>>> loads huc=no <6>[    8.158632] i915 0000:03:00.0: [drm] GuC 
>>>>>>> firmware
>>>>>>> i915/dg1_guc_70.1.1.bin version 70.1 <6>[ 8.158634] i915
>>>>>>> 0000:03:00.0: [drm] HuC firmware i915/dg1_huc_7.9.3.bin version 7.9
>>>>>>> <7>[    8.164255] i915 0000:03:00.0: [drm:guc_enable_communication
>>>>>>> [i915]] GuC communication enabled <6>[ 8.166111] i915
>>>>>>> 0000:03:00.0: [drm] HuC authenticated
>>>>>>>
>>>>>>> Note that we increase the GT frequency all the way to the max 
>>>>>>> before
>>>>>>> starting the FW load, which speeds things up.
>>>>>>>
>>>>>>>>>>>> However, do we really need to lie in the getparam? How about
>>>>>>>>>>>> extend or add a new one to separate the loading vs loaded
>>>>>>>>>>>> states? Since userspace does not support DG2 HuC yet this
>>>>>>>>>>>> should be doable.
>>>>>>>>>>> I don't really have a preference here. The media team asked us
>>>>>>>>>>> to do it this way because they wouldn't have a use for the
>>>>>>>>>>> different "in progress" and "done" states. If they're ok with
>>>>>>>>>>> having separate flags that's fine by me.
>>>>>>>>>>> Tony, any feedback here?
>>>>>>>>>> We don't even have any docs in i915_drm.h in terms of what it
>>>> means:
>>>>>>>>>> #define I915_PARAM_HUC_STATUS         42
>>>>>>>>>>
>>>>>>>>>> Seems to be a boolean. Status false vs true? Could you add some
>>>>>>>>>> docs?
>>>>>>>>> There is documentation above intel_huc_check_status(), which is
>>>>>>>>> also updated in this series. I can move that to i915_drm.h.
>>>>>>>> That would be great, thanks.
>>>>>>>>
>>>>>>>> And with so rich return codes already documented and exposed via
>>>>>>>> uapi - would we really need to add anything new for DG2 apart for
>>>>>>>> userspace to know that if zero is returned (not a negative error
>>>>>>>> value) it should retry? I mean is there another negative error
>>>>>>>> missing which would prevent zero transitioning to one?
>>>>>>> I think if the auth fails we currently return 0, because the uc
>>>>>>> state in that case would be "TRANSFERRED", i.e. DMA complete but 
>>>>>>> not
>>>>>>> fully enabled. I don't have anything against changing the FW state
>>>>>>> to "ERROR" in this scenario and leave the 0 to mean "not done yet",
>>>>>>> but I'd prefer the media team to comment on their needs for this
>>>>>>> IOCTL before committing to anything.
>>>>>>
>>>>>> Currently media doesn't differentiate "delayed loading is in
>>>>>> progress" with "HuC is authenticated and running". If the HuC
>>>>>> authentication eventually fails, the user needs to check the debugfs
>>>>>> to know the reason. IMHO, it's not a big problem as this is what we
>>>>>> do even when the IOCTL returns non-zero values. + Carl to comment.
>>>>> (Side note - debugfs can be assumed to not exist so it is not
>>>>> interesting to users.)
>>>>>
>>>>> There isn't currently a "delayed loading is in progress" state, 
>>>>> that's
>>>>> the discussion in this thread, if and how to add it.
>>>>>
>>>>> Getparam it currently documents these states:
>>>>>
>>>>>   -ENODEV if HuC is not present on this platform,
>>>>>   -EOPNOTSUPP if HuC firmware is disabled,
>>>>>   -ENOPKG if HuC firmware was not installed,
>>>>>   -ENOEXEC if HuC firmware is invalid or mismatched,
>>>>>   0 if HuC firmware is not running,
>>>>>   1 if HuC firmware is authenticated and running.
>>>>>
>>>>> This patch proposed to change this to:
>>>>>
>>>>>   1 if HuC firmware is authenticated and running or if delayed 
>>>>> load is
>>>>> in progress,
>>>>>   0 if HuC firmware is not running and delayed load is not in 
>>>>> progress
>>>>>
>>>>> Alternative idea is for DG2 (well in general) to add some more fine
>>>>> grained states, so that i915 does not have to use 1 for both running
>>>>> and loading. This may be adding a new error code for auth fails as
>>>>> Daniele mentioned. Then UMD can know that if 0 is returned and
>>>>> platform is DG2 it needs to query it again since it will 
>>>>> transition to
>>>>> either 1 or error eventually. This would mean the non error states
>>>>> would be:
>>>>>
>>>>>   0 not running (aka loading)
>>>>>   1 running (and authenticated)
>>>>>
>>>>> @Daniele - one more thing - can you make sure in the series (if you
>>>>> haven't already) that if HuC status was in any error before suspend
>>>>> reload is not re-tried on resume? My thinking is that the error is
>>>>> likely to persist and we don't want to impose long delay on every
>>>>> resume afterwards. Makes sense to you?
>>>>>
>>>>> @Tony - one more question for the UMD. Or two.
>>>>>
>>>>> How prevalent is usage of HuC on DG2 depending on what codecs need 
>>>>> it?
>>>>> Do you know in advance, before creating a GEM context, that HuC
>>>>> commands will be sent to the engine or this changes at runtime?
>>>> HuC is needed for all codecs while HW bit rate control (CBR, VBR) 
>>>> is in use.
>>>> It's also used by content protection. And UMD doesn't know if it 
>>>> will be used
>>>> later at context creation time.
>>>>
>>> from UMD perspective, We don’t care much on the normal 
>>> initialization process
>>> because, I could not image that a system is boot up, and user select 
>>> a crypted content
>>> to playback, and huc is still not ready.
>>> of course, We are  also ok to query the huc status twice, and wait 
>>> if the status is "0 not running"
>>> to avoid potential issue.
>>>
>>> I suppose the main possible issue will happen in the 
>>> hibernation/awake process, it is transparent to UMD.
>>> UMD will not call ioctrl  to query huc status in this process, and 
>>> will continue to send command buffer to KMD.
>>
>> I think there is an agreement that it is ok to return 0 to mark the 
>> load still in progress and 1 for load & auth complete. However, 
>> double checking the code it turns out that we currently return 0 on 
>> load failure, even if that's not particularly clear from the comment. 
>> I can easily change that to be an error code, but not sure if that's 
>> considered an API breakage considering it's not a well documented 
>> behavior. I believe that on pre-DG2 userspace considers 1 as ok and 
>> everything else as failure, so changing the ioctl to return an error 
>> code on failure and 0 for load pending (with the latter being a 
>> DG2-esclusive code for now) should be safe, but I'd like confirmation 
>> that I'm not breaking API before sending the relevant code.
>
> The UMD code is like this:
>
>     struct drm_i915_getparam gp;
>     int32_t value;
>     gp.param = I915_PARAM_HUC_STATUS;
>     gp.value = &value;
>     ret = ioctl(fd, DRM_IOCTL_I915_GETPARAM, &gp);
>     if (ret != 0)
>         hasHuC = 0
>     else
>         if (value == 0)
>             hasHuC = 0;
>         else
>             hasHuC = 1;
>
> Currently the behavior of i915 is:
>
>     if there is an error, ioctl returns -1, and set errno as 
> ENODEV/EOPNOTSUPP/ENOPKG/ENOEXEC;

Not exactly. The current map of FW states to ioctl behavior is:

FIRMWARE_NOT_SUPPORTED -> return -ENODEV;
FIRMWARE_DISABLED -> return -EOPNOTSUPP;
FIRMWARE_MISSING ->  return -ENOPKG;
FIRMWARE_ERROR -> return -ENOEXEC
FIRMWARE_INIT_FAIL -> return 0, value 0
FIRMWARE_LOAD_FAIL -> return 0, value 0
FIRMWARE_RUNNING -> return 0, value 1

So we do have 2 error state in which the ioctl succeeds.

>
>     otherwise, set *(gp.value) as 0 if HuC is not running, or 1 if HuC 
> is authenticated.
>
> Hi Daniele, which value are you going to change - the "ret" or the 
> "value"?

The idea is to change the 2 FAIL states above to return an error 
(probably -EIO) instead of setting value to 0. This would be compatible 
with your code snippet, because it'll hit the ret != 0 condition. Value 
0 can then be re-purposed for DG2 to indicate "load pending", which 
would not be compatible with your current code, but being a new addition 
for a new platform does not necessarily need to be.

This said, I'm not sure if changing the return behavior of INIT_FAIL and 
LOAD_FAIL is API breakage or not, given that it won't impact userspace 
expectations. Tvrtko, any feedback here?

Thanks,
Daniele

>
>
> Thanks,
>
> Tony
>
>>
>> Thanks,
>> Daniele
>>
>>>
>>>> Thanks,
>>>>
>>>> Tony
>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko
>>