[RFC] drm/xe/guc: Don't support GuC older GuC 70.x releases

Wed Feb 14 21:33:39 UTC 2024

On 2/14/2024 11:22 AM, Gustavo Sousa wrote:
> Quoting Lucas De Marchi (2024-02-09 03:01:26-03:00)
>> On Thu, Feb 08, 2024 at 04:29:55PM -0800, Daniele Ceraolo Spurio wrote:
>>>
>>> On 2/7/2024 12:40 PM, Lucas De Marchi wrote:
>>>> On Wed, Feb 07, 2024 at 10:34:07AM -0800, Daniele Ceraolo Spurio wrote:
>>>>>
>>>>> On 2/7/2024 8:42 AM, Lucas De Marchi wrote:
>>>>>> +Gustavo who is dealing with DMC firmware lately
> Hey, guys. Sorry for being so late for the party...
>
>>>>>> On Wed, Feb 07, 2024 at 03:30:59AM +0000, Matthew Brost wrote:
>>>>>>> On Tue, Feb 06, 2024 at 05:18:50PM -0800, John Harrison wrote:
>>>>>>>> On 2/6/2024 15:41, Daniele Ceraolo Spurio wrote:
>>>>>>>>> Supporting older GuC versions comes with baggage, both on the coding
>>>>>>>>> side (due to interfaces only being available from a certain version
>>>>>>>>> onwards) and on the testing side (due to having to make
>>>>>>>>> sure
>>>>>>>> the driver
>>>>>>>>> works as expected with older GuCs).
>>>>>>>>> Since all of our Xe platform are still under force probe, we haven't
>>>>>>>>> committed to support any specific GuC version and we therefore don't
>>>>>>>>> need to support the older once, which means that we can
>>>>>>>>> force
>>>>>>>> a bottom
>>>>>>>>> limit to what GuC we accept. This allows us to remove any
>>>>>>>>> conditional
>>>>>>>>> statements based on older GuC versions and also to approach newer
>>>>>>>>> additions knowing that we'll never attempt to load something older
>>>>>>>>> than our minimum requirement.
>>>>>>>>>
>>>>>>>>> RFC: this patch sets the minimum to the current GuC version (70.19),
>>>>>>>>> but that can be moved one way or the other. The main aim here is
>>>>>>>> Ideally, this would be bumped every time we update Xe to a
>>>>>>>> newer firmware
>>>>>>>> version right up to the point when force probe is lifted. At
>>>>>>>> that point it
>>>>>>>> becomes fixed and we have to add the version check support
>>>>>>>> back in for
>>>>>>>> future w/a's and features.
>>>>>>>>
>>>>>>>> Get's my vote :).
>>>>>> Yeah, but see my other reply... I think we will have to wait the
>>>>>> firmware being available in linux-firmware for that.
>>>>>>
>>>>>> Also, let's kickstart a discussion on our process with some
>>>>>> possible changes so we can get it documented. I think we have a good
>>>>>> opportunity here to start adopting the
>>>>>> https://gitlab.freedesktop.org/drm/firmware  repo.
>>>>>>
>>>>>> Rough idea:
>>>>>>
>>>>>> 1) use intel-staging branch with tags for pull requests to
>>>>>>     linux-firmware, just like documented in their readme.
>>>>>>     IMO the naming is rather unfortunate since it would be
>>>>>>     good to use it for (2) below.... but since it's already used
> Agreed. The <vendor>-staging branches, as proposed in the README, look
> more like some sort of <vendor>-next type of branch.
>
> I think it would be good if vendors could have their own `<vendor>/*`
> (or `<vendor>-*`) namespace for branch names so that, while having some
> common conventions, they could also adapt parts of it to their needs.
> For example, we could have branches like intel/next for (1) and intel/ci
> for (2). Not sure how easy it is to do this now, though.
>
>>>>>>     we can use something else.
>>>>>>
>>>>>>     this would mainly replace the use we have today for
>>>>>>     https://cgit.freedesktop.org/drm/drm-firmware/  ,    which
>>>>>> could be retired. From  upstream linux-firmware pov the only
>>>>>>     change would be the remote location and that we start using tags
>>>>>>     for the pull requests, coming from a single branch regardless of
>>>>>>     the firmware (guc, huc, dmc, gsc): intel-staging. Once accepted in
>>>>>>     linux-firmware, the branch is fast-forwarded.
>>>>> I think this needs a bit more fleshing out, because before we do a
>>>>> pull request, we do want to run CI on the blobs. Also, in several
>>>>> occasions we went through a couple of versions before we closed on
>>>>> what to push to linux-firmware (e.g. in the latest push we started
>>>>> with 70.19.1 but then pushed 70.19.2), so we can't go to
>>>>> intel-staging until we're actually ready to push. I think the
>>>>> process you have below for mmp blobs should work for this early
>>>>> testing flow as well, but we might end up with a lot of noise in
>>>>> the staging-intel-for-CI branch.
>>>> that would be a throw away branch where we push stuff to be able to test
>>>> on CI. I don't think the commit history matters much there. The fact
>>> That depends on how CI does things. With the current handling of
>>> throwaway branches we have on drm-firmware, a CI request can
>>> accidentally roll back another one. e.g., if we push a throwaway
>>> branch with a GuC update and then another with a DMC update, the
>>> second push will roll-back the GuC to what's on the new branch (likely
>>> the linux-firmware version). That's why there was a suggestion ti use
>>> a unified branch for CI as well.
>> not sure we are talking about the same thing. It is a unified branch for
>> CI: staging-intel-for-CI is where the mmp +
>> about-to-be-upstreamed-for-the-first-time firmware blobs are added,
>> regardless if it's guc, dmc, huc, etc. IMO it's much simpler since CI
>> basically has to take the additional firmware from this 1 branch. No
>> risk of rolling back another firmware because of the new one.
> Maybe we are having some confusion here because of the term "throwaway"
> for intel-staging-for-ci?
>
> We use throwaway branches for the current process, but I guess
> intel-staging-for-ci would not really be a "throwaway" branch per se and
> a unified one (as already mentioned above).
>
> I think using intel-staging-for-ci will be okay if teams take the care
> of only adding/updating/removing blobs they are responsible for.

Just to clarify my POV, I see 3 use cases:

1) mmp + about-to-be-upstreamed-for-the-first-time blobs 2) CI on 
updates to existing blobs 3) Official push to linux-firmware for #1 we 
agreed to use intel-staging-for-ci while for #3 we have intel-staging. I 
was just saying that we should use intel-staging-for-ci for #2 as well, 
which however might create a lot of churn on that branch.

Daniele

>
>>>
>>>> that the firmware is available to match what is in the kernel and that
>>>> there's a documented process for using it in my view trumps the
>>>> this downside.
>>>>
>>>> what I expect would be, considering the LNL case as example:
>>>>
>>>> 1) Start testing with the mmp version:
>>>>
>>>>      a) Add firmware to  drm/firmware intel-staging-for-CI
>>>>      b) Add commit in topic/xe-for-CI on the kernel side to make
>>>>         use of that firmware
> How would we check for CI after (b)?
>
> For DMC, I have been doing something similar. Differences are:
>
>    * for (a), I am using an intel-ci branch on drm/drm-firmware and send
>      a pull request to intel-gfx so that CI makes the mmp blobs
>      available;
>
>    * for (b), I send a "[CI]"-tagged patch to intel-gfx making the kernel
>      explicitly use that fully versioned blob path. One advantage here is
>      that I keep a broken DMC release from causing CI noise on existing
>      unrelated patch series.
>
>>>> 2) Ooops, that has bugs
>>>>
>>>>      a) add a second mmp firmware to drm/firmware intel-staging-for-CI
>>>>      b) replace commit in topic/xe-for-CI on the kernel side
>>>>
>>>> 3) we think we are good, let's try for real
>>>>
>>>>      a) Add lnl_guc_70.bin to drm/firmware intel-staging-for-CI
>>>>      b) replace commit in topic/xe-for-CI on the kernel side
>>>>
>>>> 4) yay, it worked
>>>>
>>>>      a) Add that lnl_gu_70.bin firmware to intel-staging branch and
>>>>         prepare pull request to linux-firmware
>>>>      b) move patch from topic/xe-for-CI to drm-xe-next: i.e., rebase
>>>>         topic/xe-for-CI on top of drm-xe-next leaving that commit as
>>>>         first one. git push topic/xe-for-CI, dim push drm-xe-next (or
>>>>         implement the logic in dim to push 2 branches)
>>>>
>>>>      We may need some time between (a) and (b) depending on where we
>>>>      are on the kernel release cycle: we don't want to submit a
>>>>      kernel pull request before the firmware is available @
>>>>      linux-firmware repo.
>>>>
>>>> Note that the fact we are using mmp makes it more complex, although
>>>> explicit.  Going direct with lnl_gu_70.bin would also work and avoid
>>>> updating the commits on the kernel side.
>>> This works for a completely new release. For updating an existing
>>> release, we'll have to push, potentially multiple times, all the
>>> *_guc_70.bin binaries to intel-staging-for-CI. Just to be clear, I
>>> have nothing against this, just noting that it would generate a lot of
>>> noise in that branch and potentially use a lot of space on disk.
>>>
>>>>> We also need some rules to handle the case where there is already
>>>>> a PR in flight and we need to push some more blobs. This might be
>>>>> as easy as the committer seeing that there are commits on top of
>>>>> master, replying to the previous PR to deprecate it, and then
>>>>> generating a new PR with all the blobs.
>>>> the pull requests to linux-firmware would come from tags, not a branch.
>>>> So you have (tip of the branch is on top):
>>>>
>>>>      o <intel-staging> intel: Add lnl_guc_70.bin
>>>>      o <refs/tags/intel-2024-01-30> intel: Update dg2_guc_70.bin  <--
>>>> last in flight pull request
>>>>      o intel: Add lnl_dmc.bin
>>>>      o <origin/main> ....  <--  where linux-firmware is at
>>>>
>>>> Looking at amd-staging, it seems to match what they are doing:
>>>> https://gitlab.freedesktop.org/drm/firmware/-/commits/amd-staging?ref_type=heads
>>>>
>>>>
>>>> see the amd-$DATE tags
>>> Sorry I wasn't very clear in my comment, what I wanted to point out
>>> was that if we are on a unified branch and we have the PR against a
>>> specific tag (intel-2024-01-30 in your example) already in flight, how
>>> do we generate a new PR for the newer commit that comes after the tag
>>> (and which will have its own new tag)? Does git do some tag magic and
>>> handle it for us, or do we need to generate a new PR that supersedes
>>> the one in flight?
>> humn... there is no magic, the old tag is an ascendent path of the new
>> one. But as I said, just coordinating with the few people updating
>> firmware who/when will do the pull request should be sufficient for
>> avoiding a pull request when there's already another one in flight.
> While I see the benefit of having a unified intel-staging-for-ci, I
> think I'm failing to see much benefit of having a unified intel-staging
> here.
>
> Wouldn't it be better if pull request were independent of each other? If
> we had an intel/* (or intel-*) branching namespace, we could keep using
> throwaway branches for the pull requests and have the discipline of
> removing them when not needed anymore.
>
> --
> Gustavo Sousa
>
>>
>>>
>>>>>
>>>>>> 2) mmp firmware versions are only ever pushed to a separate
>>>>>> staging-intel-for-CI
>>>>>>     branch. There is no pull request in the mailing for this. We
>>>>>> can either
>>>>>>     push directly to the branch or create MRs in gitlab. CI would start
>>>>>>     using this branch for the extra firmware for platforms instead of
>>>>>>     whatever it's using today to process the pull requests from the
>>>>>>     mailing list.  Or whatever it's using, because I don't know
>>>>>> and don't
>>>>>>     see it documented anywhere.
>>>>> As long as the CI team is ok with this, I'm all for it.
>>>>>
>>>>>>     The patch on the kernel side to use the mmp firmware is only ever
>>>>>>     pushed to the topic/xe-for-CI branch since a) the firmware is
>>>>>> coming from
>>>>>>     a non-official location and b) end users and distro packaging
>>>>>>     shouldn't see a warning when building the kernel due to a possibly
>>>>>>     missing firmware
>>>>>> 3) Raising firmware version requirement for past platforms used as
>>>>>>     SDV can be done **unless** it raises the major version.
>>>>>> That's because
>>>>>>     end users would start seeing the warning that we avoided in (2).
>>>>> Who are the end users here? If we're talking about older
>>>>> non-officially supported platforms, the only users should be
>>>>> developers and they should be able to handle having to update the
>>>>> firmwares to a newer major versions.
>>>> distros and any developer outside Intel. The kernel build system is
>>>> unaware of xe.force_probe. So if you have, after the several macros:
>>>>
>>>> MODULE_FIRMWARE("xe/tgl_guc_71.bin")
>>>>
>>>> It will show up in `modinfo -f firmware xe`. And it will show as a
>>>> warning when installing/packaging a kernel.
>>>>
>>>> It doesn't matter for minor/patch updates because the file name is
>>>> major-only and **running** with that module is protected by the
>>>> force_probe. The major may be updated when it's available in
>>>> linux-firmware, which means i915 started using it (for i915 that would
>>>> be "as an option, with fallback to the previous major release" of
>>>> course).
>>> Ok I get the concern. My assumption here was that we'd only update the
>>> minimum required version if that version was in linux-firmware even
>>> for minor updates, hence why I didn't see why a major update would be
>>> different. I guess we could go with a more relaxed approach where we
>>> allow the required minor to be updated for force-probe platforms as
>>> long as the firmware is available on a public/CI branch even if it is
>>> not in linux-firmware.
>> yep, I don't see it causing issues to end users.
>>
>>> Getting back on track with the original purpose of this patch, are you
>>> ok with setting the minimum to 70.19 if I first push the matching PVC
>>> 70.19 binary (via the old method for now), while we continue sorting
>>> out how to manage the new repo?
>> yes.
>>
>> Lucas De Marchi
>>
>>> Daniele
>>>
>>>> Lucas De Marchi
>>>>
>>>>> Daniele
>>>>>
>>>>>> thoughts?
>>>>>>
>>>>>> Lucas De Marchi
>>>>>>
>>>>>>> Mine too.
>>>>>>>
>>>>>>> With that:
>>>>>>> Acked-by: Matthew Brost<matthew.brost at intel.com>
>>>>>>>
>>>>>>>> John.
>>>>>>>>
>>>>>>>>> agreeing to stop supporting very old GuC releases on the
>>>>>>>>> newer
>>>>>>>> driver.
>>>>>>>>> Signed-off-by: Daniele Ceraolo Spurio
>>>>>>>> <daniele.ceraolospurio at intel.com>
>>>>>>>>> Cc: John Harrison<John.C.Harrison at Intel.com>
>>>>>>>>> Cc: Lucas De Marchi<lucas.demarchi at intel.com>
>>>>>>>>> Cc: Matt Roper<matthew.d.roper at intel.com>
>>>>>>>>> Cc: Matthew Brost<matthew.brost at intel.com>
>>>>>>>>> Cc: Rodrigo Vivi<rodrigo.vivi at intel.com>
>>>>>>>>> ---
>>>>>>>>>     drivers/gpu/drm/xe/xe_guc.c   | 14 ++------------
>>>>>>>>>     drivers/gpu/drm/xe/xe_uc_fw.c | 36
>>>>>>>> ++++++++++++++---------------------
>>>>>>>>>     2 files changed, 16 insertions(+), 34 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_guc.c
>>>>>>>> b/drivers/gpu/drm/xe/xe_guc.c
>>>>>>>>> index 868208a39829..5e6b27aac495 100644
>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_guc.c
>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_guc.c
>>>>>>>>> @@ -132,15 +132,10 @@ static u32 guc_ctl_ads_flags(struct
>>>>>>>> xe_guc *guc)
>>>>>>>>>         return flags;
>>>>>>>>>     }
>>>>>>>>> -#define GUC_VER(maj, min, pat)    (((maj) << 16) | ((min)
>>>>>>>>> <<
>>>>>>>> 8) | (pat))
>>>>>>>>> -
>>>>>>>>>     static u32 guc_ctl_wa_flags(struct xe_guc *guc)
>>>>>>>>>     {
>>>>>>>>>         struct xe_device *xe = guc_to_xe(guc);
>>>>>>>>>         struct xe_gt *gt = guc_to_gt(guc);
>>>>>>>>> -    struct xe_uc_fw *uc_fw = &guc->fw;
>>>>>>>>> -    struct xe_uc_fw_version *version =
>>>>>>>> &uc_fw->versions.found[XE_UC_FW_VER_RELEASE];
>>>>>>>>> -
>>>>>>>>>         u32 flags = 0;
>>>>>>>>>         if (XE_WA(gt, 22012773006))
>>>>>>>>> @@ -170,13 +165,8 @@ static u32 guc_ctl_wa_flags(struct xe_guc *guc)
>>>>>>>>>         if (XE_WA(gt, 1509372804))
>>>>>>>>>             flags |= GUC_WA_RENDER_RST_RC6_EXIT;
>>>>>>>>> -    if (XE_WA(gt, 14018913170)) {
>>>>>>>>> -        if (GUC_VER(version->major, version->minor,
>>>>>>>> version->patch) >= GUC_VER(70, 7, 0))
>>>>>>>>> -            flags |= GUC_WA_ENABLE_TSC_CHECK_ON_RC6;
>>>>>>>>> -        else
>>>>>>>>> -            drm_dbg(&xe->drm, "Skip WA 14018913170: GUC
>>>>>>>> version expected >= 70.7.0, found %u.%u.%u\n",
>>>>>>>>> - version->major, version->minor, version->patch);
>>>>>>>>> -    }
>>>>>>>>> +    if (XE_WA(gt, 14018913170))
>>>>>>>>> +        flags |= GUC_WA_ENABLE_TSC_CHECK_ON_RC6;
>>>>>>>>>         return flags;
>>>>>>>>>     }
>>>>>>>>> diff --git a/drivers/gpu/drm/xe/xe_uc_fw.c
>>>>>>>> b/drivers/gpu/drm/xe/xe_uc_fw.c
>>>>>>>>> index 4714f2c8d2ba..e5bf59616f3d 100644
>>>>>>>>> --- a/drivers/gpu/drm/xe/xe_uc_fw.c
>>>>>>>>> +++ b/drivers/gpu/drm/xe/xe_uc_fw.c
>>>>>>>>> @@ -296,36 +296,28 @@ static void uc_fw_fini(struct
>>>>>>>>> drm_device
>>>>>>>> *drm, void *arg)
>>>>>>>>> xe_uc_fw_change_status(uc_fw, XE_UC_FIRMWARE_SELECTED);
>>>>>>>>>     }
>>>>>>>>> -static void guc_read_css_info(struct xe_uc_fw *uc_fw,
>>>>>>>>> struct
>>>>>>>> uc_css_header *css)
>>>>>>>>> +static int guc_read_css_info(struct xe_uc_fw *uc_fw,
>>>>>>>>> struct
>>>>>>>> uc_css_header *css)
>>>>>>>>>     {
>>>>>>>>>         struct xe_gt *gt = uc_fw_to_gt(uc_fw);
>>>>>>>>>         struct xe_uc_fw_version *release =
>>>>>>>> &uc_fw->versions.found[XE_UC_FW_VER_RELEASE];
>>>>>>>>>         struct xe_uc_fw_version *compatibility =
>>>>>>>> &uc_fw->versions.found[XE_UC_FW_VER_COMPATIBILITY];
>>>>>>>>>         xe_gt_assert(gt, uc_fw->type == XE_UC_FW_TYPE_GUC);
>>>>>>>>> -    xe_gt_assert(gt, release->major >= 70);
>>>>>>>>> -
>>>>>>>>> -    if (release->major > 70 || release->minor >= 6) {
>>>>>>>>> -        /* v70.6.0 adds CSS header support */
>>>>>>>>> -        compatibility->major = FIELD_GET(CSS_SW_VERSION_UC_MAJOR,
>>>>>>>>> -                         css->submission_version);
>>>>>>>>> -        compatibility->minor = FIELD_GET(CSS_SW_VERSION_UC_MINOR,
>>>>>>>>> -                         css->submission_version);
>>>>>>>>> -        compatibility->patch = FIELD_GET(CSS_SW_VERSION_UC_PATCH,
>>>>>>>>> -                         css->submission_version);
>>>>>>>>> -    } else if (release->minor >= 3) {
>>>>>>>>> -        /* v70.3.0 introduced v1.1.0 */
>>>>>>>>> -        compatibility->major = 1;
>>>>>>>>> -        compatibility->minor = 1;
>>>>>>>>> -        compatibility->patch = 0;
>>>>>>>>> -    } else {
>>>>>>>>> -        /* v70.0.0 introduced v1.0.0 */
>>>>>>>>> -        compatibility->major = 1;
>>>>>>>>> -        compatibility->minor = 0;
>>>>>>>>> -        compatibility->patch = 0;
>>>>>>>>> +
>>>>>>>>> +    /* We don't support GuC releases older than 70.19 */
>>>>>>>>> +    if (release->major < 70 || (release->major == 70 &&
>>>>>>>> release->minor < 19)) {
>>>>>>>>> +        xe_gt_err(gt, "Unsupported GuC v%u.%u! v70.19 or
>>>>>>>> newer is required\n",
>>>>>>>>> +              release->major, release->minor);
>>>>>>>>> +        return -EINVAL;
>>>>>>>>>         }
>>>>>>>>> +    compatibility->major =
>>>>>>>>> FIELD_GET(CSS_SW_VERSION_UC_MAJOR,
>>>>>>>> css->submission_version);
>>>>>>>>> +    compatibility->minor =
>>>>>>>>> FIELD_GET(CSS_SW_VERSION_UC_MINOR,
>>>>>>>> css->submission_version);
>>>>>>>>> +    compatibility->patch =
>>>>>>>>> FIELD_GET(CSS_SW_VERSION_UC_PATCH,
>>>>>>>> css->submission_version);
>>>>>>>>> +
>>>>>>>>>         uc_fw->private_data_size = css->private_data_size;
>>>>>>>>> +
>>>>>>>>> +    return 0;
>>>>>>>>>     }
>>>>>>>>>     int xe_uc_fw_check_version_requirements(struct xe_uc_fw *uc_fw)
>>>>>>>>> @@ -424,7 +416,7 @@ static int parse_css_header(struct
>>>>>>>> xe_uc_fw *uc_fw, const void *fw_data, size_t
>>>>>>>>>         release->patch = FIELD_GET(CSS_SW_VERSION_UC_PATCH,
>>>>>>>> css->sw_version);
>>>>>>>>>         if (uc_fw->type == XE_UC_FW_TYPE_GUC)
>>>>>>>>> -        guc_read_css_info(uc_fw, css);
>>>>>>>>> +        return guc_read_css_info(uc_fw, css);
>>>>>>>>>         return 0;
>>>>>>>>>     }
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-xe/attachments/20240214/c6e59f04/attachment-0001.htm>