[RFC] drm/xe/guc: Don't support GuC older GuC 70.x releases

Fri Feb 9 00:29:55 UTC 2024

On 2/7/2024 12:40 PM, Lucas De Marchi wrote:
> On Wed, Feb 07, 2024 at 10:34:07AM -0800, Daniele Ceraolo Spurio wrote:
>>
>>
>> On 2/7/2024 8:42 AM, Lucas De Marchi wrote:
>>> +Gustavo who is dealing with DMC firmware lately
>>>
>>> On Wed, Feb 07, 2024 at 03:30:59AM +0000, Matthew Brost wrote:
>>>> On Tue, Feb 06, 2024 at 05:18:50PM -0800, John Harrison wrote:
>>>>> On 2/6/2024 15:41, Daniele Ceraolo Spurio wrote:
>>>>>> Supporting older GuC versions comes with baggage, both on the coding
>>>>>> side (due to interfaces only being available from a certain version
>>>>>> onwards) and on the testing side (due to having to make sure 
>>>>> the driver
>>>>>> works as expected with older GuCs).
>>>>>> Since all of our Xe platform are still under force probe, we haven't
>>>>>> committed to support any specific GuC version and we therefore don't
>>>>>> need to support the older once, which means that we can force 
>>>>> a bottom
>>>>>> limit to what GuC we accept. This allows us to remove any 
>>>>>> conditional
>>>>>> statements based on older GuC versions and also to approach newer
>>>>>> additions knowing that we'll never attempt to load something older
>>>>>> than our minimum requirement.
>>>>>>
>>>>>> RFC: this patch sets the minimum to the current GuC version (70.19),
>>>>>> but that can be moved one way or the other. The main aim here is
>>>>> Ideally, this would be bumped every time we update Xe to a newer 
>>>>> firmware
>>>>> version right up to the point when force probe is lifted. At that 
>>>>> point it
>>>>> becomes fixed and we have to add the version check support back in 
>>>>> for
>>>>> future w/a's and features.
>>>>>
>>>>> Get's my vote :).
>>>
>>> Yeah, but see my other reply... I think we will have to wait the
>>> firmware being available in linux-firmware for that.
>>>
>>> Also, let's kickstart a discussion on our process with some
>>> possible changes so we can get it documented. I think we have a good
>>> opportunity here to start adopting the
>>> https://gitlab.freedesktop.org/drm/firmware repo.
>>>
>>> Rough idea:
>>>
>>> 1) use intel-staging branch with tags for pull requests to
>>>    linux-firmware, just like documented in their readme.
>>>    IMO the naming is rather unfortunate since it would be
>>>    good to use it for (2) below.... but since it's already used
>>>    we can use something else.
>>>
>>>    this would mainly replace the use we have today for
>>>    https://cgit.freedesktop.org/drm/drm-firmware/ ,    which could 
>>> be retired. From  upstream linux-firmware pov the only
>>>    change would be the remote location and that we start using tags
>>>    for the pull requests, coming from a single branch regardless of
>>>    the firmware (guc, huc, dmc, gsc): intel-staging. Once accepted in
>>>    linux-firmware, the branch is fast-forwarded.
>>
>> I think this needs a bit more fleshing out, because before we do a 
>> pull request, we do want to run CI on the blobs. Also, in several 
>> occasions we went through a couple of versions before we closed on 
>> what to push to linux-firmware (e.g. in the latest push we started 
>> with 70.19.1 but then pushed 70.19.2), so we can't go to 
>> intel-staging until we're actually ready to push. I think the process 
>> you have below for mmp blobs should work for this early testing flow 
>> as well, but we might end up with a lot of noise in the 
>> staging-intel-for-CI branch.
>
> that would be a throw away branch where we push stuff to be able to test
> on CI. I don't think the commit history matters much there. The fact

That depends on how CI does things. With the current handling of 
throwaway branches we have on drm-firmware, a CI request can 
accidentally roll back another one. e.g., if we push a throwaway branch 
with a GuC update and then another with a DMC update, the second push 
will roll-back the GuC to what's on the new branch (likely the 
linux-firmware version). That's why there was a suggestion ti use a 
unified branch for CI as well.

> that the firmware is available to match what is in the kernel and that
> there's a documented process for using it in my view trumps the
> this downside.
>
> what I expect would be, considering the LNL case as example:
>
> 1) Start testing with the mmp version:
>
>     a) Add firmware to  drm/firmware intel-staging-for-CI
>     b) Add commit in topic/xe-for-CI on the kernel side to make
>        use of that firmware
>
> 2) Ooops, that has bugs
>
>     a) add a second mmp firmware to drm/firmware intel-staging-for-CI
>     b) replace commit in topic/xe-for-CI on the kernel side
>
> 3) we think we are good, let's try for real
>
>     a) Add lnl_guc_70.bin to drm/firmware intel-staging-for-CI
>     b) replace commit in topic/xe-for-CI on the kernel side
>
> 4) yay, it worked
>
>     a) Add that lnl_gu_70.bin firmware to intel-staging branch and
>        prepare pull request to linux-firmware
>     b) move patch from topic/xe-for-CI to drm-xe-next: i.e., rebase
>        topic/xe-for-CI on top of drm-xe-next leaving that commit as
>        first one. git push topic/xe-for-CI, dim push drm-xe-next (or
>        implement the logic in dim to push 2 branches)
>
>     We may need some time between (a) and (b) depending on where we
>     are on the kernel release cycle: we don't want to submit a
>     kernel pull request before the firmware is available @
>     linux-firmware repo.
>
> Note that the fact we are using mmp makes it more complex, although
> explicit.  Going direct with lnl_gu_70.bin would also work and avoid
> updating the commits on the kernel side.

This works for a completely new release. For updating an existing 
release, we'll have to push, potentially multiple times, all the 
*_guc_70.bin binaries to intel-staging-for-CI. Just to be clear, I have 
nothing against this, just noting that it would generate a lot of noise 
in that branch and potentially use a lot of space on disk.

>
>>
>> We also need some rules to handle the case where there is already a 
>> PR in flight and we need to push some more blobs. This might be as 
>> easy as the committer seeing that there are commits on top of master, 
>> replying to the previous PR to deprecate it, and then generating a 
>> new PR with all the blobs.
>
> the pull requests to linux-firmware would come from tags, not a branch.
> So you have (tip of the branch is on top):
>
>     o <intel-staging> intel: Add lnl_guc_70.bin
>     o <refs/tags/intel-2024-01-30> intel: Update dg2_guc_70.bin  <-- 
> last in flight pull request
>     o intel: Add lnl_dmc.bin
>     o <origin/main> ....  <--  where linux-firmware is at
>
> Looking at amd-staging, it seems to match what they are doing:
> https://gitlab.freedesktop.org/drm/firmware/-/commits/amd-staging?ref_type=heads 
>
>
> see the amd-$DATE tags

Sorry I wasn't very clear in my comment, what I wanted to point out was 
that if we are on a unified branch and we have the PR against a specific 
tag (intel-2024-01-30 in your example) already in flight, how do we 
generate a new PR for the newer commit that comes after the tag (and 
which will have its own new tag)? Does git do some tag magic and handle 
it for us, or do we need to generate a new PR that supersedes the one in 
flight?

>
>>
>>
>>>
>>> 2) mmp firmware versions are only ever pushed to a separate 
>>> staging-intel-for-CI
>>>    branch. There is no pull request in the mailing for this. We can 
>>> either
>>>    push directly to the branch or create MRs in gitlab. CI would start
>>>    using this branch for the extra firmware for platforms instead of
>>>    whatever it's using today to process the pull requests from the
>>>    mailing list.  Or whatever it's using, because I don't know and 
>>> don't
>>>    see it documented anywhere.
>>
>> As long as the CI team is ok with this, I'm all for it.
>>
>>>
>>>    The patch on the kernel side to use the mmp firmware is only ever
>>>    pushed to the topic/xe-for-CI branch since a) the firmware is 
>>> coming from
>>>    a non-official location and b) end users and distro packaging
>>>    shouldn't see a warning when building the kernel due to a possibly
>>>    missing firmware
>>> 3) Raising firmware version requirement for past platforms used as
>>>    SDV can be done **unless** it raises the major version. That's 
>>> because
>>>    end users would start seeing the warning that we avoided in (2).
>>
>> Who are the end users here? If we're talking about older 
>> non-officially supported platforms, the only users should be 
>> developers and they should be able to handle having to update the 
>> firmwares to a newer major versions.
>
> distros and any developer outside Intel. The kernel build system is
> unaware of xe.force_probe. So if you have, after the several macros:
>
> MODULE_FIRMWARE("xe/tgl_guc_71.bin")
>
> It will show up in `modinfo -f firmware xe`. And it will show as a
> warning when installing/packaging a kernel.
>
> It doesn't matter for minor/patch updates because the file name is
> major-only and **running** with that module is protected by the
> force_probe. The major may be updated when it's available in
> linux-firmware, which means i915 started using it (for i915 that would
> be "as an option, with fallback to the previous major release" of
> course).

Ok I get the concern. My assumption here was that we'd only update the 
minimum required version if that version was in linux-firmware even for 
minor updates, hence why I didn't see why a major update would be 
different. I guess we could go with a more relaxed approach where we 
allow the required minor to be updated for force-probe platforms as long 
as the firmware is available on a public/CI branch even if it is not in 
linux-firmware.

Getting back on track with the original purpose of this patch, are you 
ok with setting the minimum to 70.19 if I first push the matching PVC 
70.19 binary (via the old method for now), while we continue sorting out 
how to manage the new repo?

Daniele

>
> Lucas De Marchi
>
>>
>> Daniele
>>
>>>
>>> thoughts?
>>>
>>> Lucas De Marchi
>>>
>>>>>
>>>>
>>>> Mine too.
>>>>
>>>> With that:
>>>> Acked-by: Matthew Brost <matthew.brost at intel.com>
>>>>
>>>>> John.
>>>>>
>>>>>> agreeing to stop supporting very old GuC releases on the newer 
>>>>> driver.
>>>>>>
>>>>>> Signed-off-by: Daniele Ceraolo Spurio 
>>>>> <daniele.ceraolospurio at intel.com>
>>>>>> Cc: John Harrison <John.C.Harrison at Intel.com>
>>>>>> Cc: Lucas De Marchi <lucas.demarchi at intel.com>
>>>>>> Cc: Matt Roper <matthew.d.roper at intel.com>
>>>>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>>>>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>>>>>> ---
>>>>>>    drivers/gpu/drm/xe/xe_guc.c   | 14 ++------------
>>>>>>    drivers/gpu/drm/xe/xe_uc_fw.c | 36 
>>>>> ++++++++++++++---------------------
>>>>>>    2 files changed, 16 insertions(+), 34 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_guc.c 
>>>>> b/drivers/gpu/drm/xe/xe_guc.c
>>>>>> index 868208a39829..5e6b27aac495 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_guc.c
>>>>>> +++ b/drivers/gpu/drm/xe/xe_guc.c
>>>>>> @@ -132,15 +132,10 @@ static u32 guc_ctl_ads_flags(struct 
>>>>> xe_guc *guc)
>>>>>>        return flags;
>>>>>>    }
>>>>>> -#define GUC_VER(maj, min, pat)    (((maj) << 16) | ((min) << 
>>>>> 8) | (pat))
>>>>>> -
>>>>>>    static u32 guc_ctl_wa_flags(struct xe_guc *guc)
>>>>>>    {
>>>>>>        struct xe_device *xe = guc_to_xe(guc);
>>>>>>        struct xe_gt *gt = guc_to_gt(guc);
>>>>>> -    struct xe_uc_fw *uc_fw = &guc->fw;
>>>>>> -    struct xe_uc_fw_version *version = 
>>>>> &uc_fw->versions.found[XE_UC_FW_VER_RELEASE];
>>>>>> -
>>>>>>        u32 flags = 0;
>>>>>>        if (XE_WA(gt, 22012773006))
>>>>>> @@ -170,13 +165,8 @@ static u32 guc_ctl_wa_flags(struct xe_guc *guc)
>>>>>>        if (XE_WA(gt, 1509372804))
>>>>>>            flags |= GUC_WA_RENDER_RST_RC6_EXIT;
>>>>>> -    if (XE_WA(gt, 14018913170)) {
>>>>>> -        if (GUC_VER(version->major, version->minor, 
>>>>> version->patch) >= GUC_VER(70, 7, 0))
>>>>>> -            flags |= GUC_WA_ENABLE_TSC_CHECK_ON_RC6;
>>>>>> -        else
>>>>>> -            drm_dbg(&xe->drm, "Skip WA 14018913170: GUC 
>>>>> version expected >= 70.7.0, found %u.%u.%u\n",
>>>>>> - version->major, version->minor, version->patch);
>>>>>> -    }
>>>>>> +    if (XE_WA(gt, 14018913170))
>>>>>> +        flags |= GUC_WA_ENABLE_TSC_CHECK_ON_RC6;
>>>>>>        return flags;
>>>>>>    }
>>>>>> diff --git a/drivers/gpu/drm/xe/xe_uc_fw.c 
>>>>> b/drivers/gpu/drm/xe/xe_uc_fw.c
>>>>>> index 4714f2c8d2ba..e5bf59616f3d 100644
>>>>>> --- a/drivers/gpu/drm/xe/xe_uc_fw.c
>>>>>> +++ b/drivers/gpu/drm/xe/xe_uc_fw.c
>>>>>> @@ -296,36 +296,28 @@ static void uc_fw_fini(struct drm_device 
>>>>> *drm, void *arg)
>>>>>> xe_uc_fw_change_status(uc_fw, XE_UC_FIRMWARE_SELECTED);
>>>>>>    }
>>>>>> -static void guc_read_css_info(struct xe_uc_fw *uc_fw, struct 
>>>>> uc_css_header *css)
>>>>>> +static int guc_read_css_info(struct xe_uc_fw *uc_fw, struct 
>>>>> uc_css_header *css)
>>>>>>    {
>>>>>>        struct xe_gt *gt = uc_fw_to_gt(uc_fw);
>>>>>>        struct xe_uc_fw_version *release = 
>>>>> &uc_fw->versions.found[XE_UC_FW_VER_RELEASE];
>>>>>>        struct xe_uc_fw_version *compatibility = 
>>>>> &uc_fw->versions.found[XE_UC_FW_VER_COMPATIBILITY];
>>>>>>        xe_gt_assert(gt, uc_fw->type == XE_UC_FW_TYPE_GUC);
>>>>>> -    xe_gt_assert(gt, release->major >= 70);
>>>>>> -
>>>>>> -    if (release->major > 70 || release->minor >= 6) {
>>>>>> -        /* v70.6.0 adds CSS header support */
>>>>>> -        compatibility->major = FIELD_GET(CSS_SW_VERSION_UC_MAJOR,
>>>>>> -                         css->submission_version);
>>>>>> -        compatibility->minor = FIELD_GET(CSS_SW_VERSION_UC_MINOR,
>>>>>> -                         css->submission_version);
>>>>>> -        compatibility->patch = FIELD_GET(CSS_SW_VERSION_UC_PATCH,
>>>>>> -                         css->submission_version);
>>>>>> -    } else if (release->minor >= 3) {
>>>>>> -        /* v70.3.0 introduced v1.1.0 */
>>>>>> -        compatibility->major = 1;
>>>>>> -        compatibility->minor = 1;
>>>>>> -        compatibility->patch = 0;
>>>>>> -    } else {
>>>>>> -        /* v70.0.0 introduced v1.0.0 */
>>>>>> -        compatibility->major = 1;
>>>>>> -        compatibility->minor = 0;
>>>>>> -        compatibility->patch = 0;
>>>>>> +
>>>>>> +    /* We don't support GuC releases older than 70.19 */
>>>>>> +    if (release->major < 70 || (release->major == 70 && 
>>>>> release->minor < 19)) {
>>>>>> +        xe_gt_err(gt, "Unsupported GuC v%u.%u! v70.19 or 
>>>>> newer is required\n",
>>>>>> +              release->major, release->minor);
>>>>>> +        return -EINVAL;
>>>>>>        }
>>>>>> +    compatibility->major = FIELD_GET(CSS_SW_VERSION_UC_MAJOR, 
>>>>> css->submission_version);
>>>>>> +    compatibility->minor = FIELD_GET(CSS_SW_VERSION_UC_MINOR, 
>>>>> css->submission_version);
>>>>>> +    compatibility->patch = FIELD_GET(CSS_SW_VERSION_UC_PATCH, 
>>>>> css->submission_version);
>>>>>> +
>>>>>>        uc_fw->private_data_size = css->private_data_size;
>>>>>> +
>>>>>> +    return 0;
>>>>>>    }
>>>>>>    int xe_uc_fw_check_version_requirements(struct xe_uc_fw *uc_fw)
>>>>>> @@ -424,7 +416,7 @@ static int parse_css_header(struct 
>>>>> xe_uc_fw *uc_fw, const void *fw_data, size_t
>>>>>>        release->patch = FIELD_GET(CSS_SW_VERSION_UC_PATCH, 
>>>>> css->sw_version);
>>>>>>        if (uc_fw->type == XE_UC_FW_TYPE_GUC)
>>>>>> -        guc_read_css_info(uc_fw, css);
>>>>>> +        return guc_read_css_info(uc_fw, css);
>>>>>>        return 0;
>>>>>>    }
>>>>>
>>