[Intel-gfx] [PATCH v2 1/5] drm/i915/guc: add enable_guc_loading parameter

Fri May 13 14:36:23 UTC 2016

On 10/05/16 15:37, Tvrtko Ursulin wrote:
>
> On 06/05/16 17:39, Dave Gordon wrote:
>> On 29/04/16 16:03, Tvrtko Ursulin wrote:
>
> [snip]
>
>>>> +        goto fail;
>>>> +    if (fw_path == NULL)
>>>> +        goto fail;
>>>> +    if (*fw_path == '\0') {
>>>> +        DRM_ERROR("No GuC firmware known for this platform\n");
>>>
>>> It is not an error unless i915.enable_guc_loading == 2, no? And if best
>>> effort then it is probably debug or informational.
>>
>> No, it's still an ERROR. You're running the driver on a platform for
>> which we don't know what firmware is required. That probably means an
>> old driver on new hardware, so it might not work at all. You can
>> suppress the error by setting i915.enable_guc_loading=0 if you want to
>> try this version of the driver anyway. Also note the difference between
>> path == NULL (no GuC, or no firmware required => not an error) vs. path
>> == "" (has GuC, presumably needs firmware, but we don't know where to
>> look => ERROR).
>
> I think that if i915.enable_guc_loading == 1 then no error should be
> logged. Documentation says that value meand "try to load/use the GuC,
> fallback if not available" and to me that means it is not an error and
> an informational message only should be logged.

OK, this message is now DRM_INFO

>>> Also, don't the checks against fw_path (together with the error or debug
>>> message) belong in the fw fetch function? If they are invalid fw fetch
>>> would have failed and this function would be able to inspect the high
>>> level status of that step here, no?
>>
>> The checks are done in intel_guc_ucode_init(), before fw_fetch() is even
>> called; but that function is void, so can't return failure. (Also, we
>> originally supported asynchronous loading, which also can't return
>> failure). So this function will get called even when we already know
>> that we haven't got any firmware to load, and these tests are indeed
>> checking the high-level status from _init().
>
> Anyhow the special meanings conveyed in fw_path == NULL and *fw_path ==
> 0 are imho just too hard to follow. I see it is not your code, but
> reworking this looked like an opportunity to clean that up. Never mind.

For fw_path:
== NULL  means no firmware needed, not an error.
== ""    means the driver actually doesn't know what to load in this
          situation - the hardware is not recognised. IMHO this should
          be an ERROR, but I've made it just informational now.
== path  the file we want.

Those are the only three possibilities, it's set just once according the 
hardware type and is constant thereafter. It doesn't depend on the 
actual load status. That really isn't very complicated.

>>>> +        goto fail;
>>>> +    }
>>>>
>>>> -    if (guc_fw->guc_fw_fetch_status == GUC_FIRMWARE_NONE)
>>>> -        return 0;
>>>> +    /* Fetch failed, or already fetched but failed to load? */
>>>> +    if (guc_fw->guc_fw_fetch_status != GUC_FIRMWARE_SUCCESS)
>>>> +        goto fail;
>>>> +    if (guc_fw->guc_fw_load_status == GUC_FIRMWARE_FAIL)
>>>> +        goto fail;
>>>
>>> Leads back to the question of load status in this function. So it is
>>> expected we always enter here with load status of none? Is it possible
>>> to get here with the firmware already loaded already?
>>
>> Not *actually* loaded, because it's been erased by poweroff. But the
>> status tracking variables are persistent, so they reflect the last
>> attempt. So on resume, we actually expect "SUCCESS" at this point, and
>> therefore change it back to PENDING below.
>
> Shouldn't the code then update the status variables on suspend/whatever?
> Same as above, I find it very hard to follow the logic here.

It proceeds by identifying all the ways in which the loading process can 
fail; what's left must be The Path To Success (tm) :)

1. Loading forbidden => fail
2. No firmware wanted => fail
3. No firmware known => fail
4. Firmware not fetched => fail
5. Previous load failed => fail

At this point we commit to trying to load the firmware into the h/w.
Then we try three times to reset-and-load the GuC.

>>>> -    if (guc_fw->guc_fw_fetch_status == GUC_FIRMWARE_SUCCESS &&
>>>> -        guc_fw->guc_fw_load_status == GUC_FIRMWARE_FAIL)
>>>> -        return -ENOEXEC;
>>>> +    direct_interrupts_to_host(dev_priv);
>>>>
>>>>       guc_fw->guc_fw_load_status = GUC_FIRMWARE_PENDING;
>>>>
>>>> -    DRM_DEBUG_DRIVER("GuC fw fetch status %s\n",
>>>> -        intel_guc_fw_status_repr(guc_fw->guc_fw_fetch_status));
>>>> -
>>>> -    switch (guc_fw->guc_fw_fetch_status) {
>>>> -    case GUC_FIRMWARE_FAIL:
>>>> -        /* something went wrong :( */
>>>> -        err = -EIO;
>>>> -        goto fail;
>>>> -
>>>> -    case GUC_FIRMWARE_NONE:
>>>> -    case GUC_FIRMWARE_PENDING:
>>>> -    default:
>>>> -        /* "can't happen" */
>>>> -        WARN_ONCE(1, "GuC fw %s invalid guc_fw_fetch_status %s
>>>> [%d]\n",
>>>> -            guc_fw->guc_fw_path,
>>>> -            intel_guc_fw_status_repr(guc_fw->guc_fw_fetch_status),
>>>> -            guc_fw->guc_fw_fetch_status);
>>>> -        err = -ENXIO;
>>>> -        goto fail;
>>>> -
>>>> -    case GUC_FIRMWARE_SUCCESS:
>>>> -        break;
>>>> -    }

Look how many lines of complicated logic this patch removes :)

>>>> +    DRM_DEBUG_DRIVER("GuC fw status: fetch %s, load %s\n",
>>>> +        intel_guc_fw_status_repr(guc_fw->guc_fw_fetch_status),
>>>> +        intel_guc_fw_status_repr(guc_fw->guc_fw_load_status));
>>>>
>>>>       err = i915_guc_submission_init(dev);
>>>>       if (err)
>>>> @@ -483,6 +471,7 @@ int intel_guc_ucode_load(struct drm_device *dev)
>>>>
>>>>   fail:
>>>>       DRM_ERROR("GuC firmware load failed, err %d\n", err);
>>>
>>> Same as above I think error must be dependent on the requested mode.
>>> Some customers are very sensitive to errors which are not really errors
>>> so it is bad to log them when they are not.
>>
>> No, it's still an ERROR. enable_guc_loading must be nonzero, so we've
>> been asked to *try* to load the GuC. If the load fails, that means
>> broken hardware or a corrupt firmware blob, or some other form of system
>> misconfiguration. Even if we're going to fall back to execlist mode,
>> that needs to be reported; for example, it may mean that SLPC is
>> disabled. The user can avoid the error by booting with GuC loading
>> disabled, but they should probably fix the problem instead.
>
> Yes it needs to be reported, but if we are in best effort mode it should
> be informational I think. Depends how you view the errors in the kernel
> log - it may be a firmware loading error, but from the point of view of
> the system log, it is not an error. Everything still works as intended,
> more so, the user has asked us to try and carry on with the alternative.
> So the system (log) experienced no error.

OK, I've moved it later, and it's now DRM_ERROR if we're going to return 
-EIO (GPU wedged) and DRM_INFO otherwise.

>>>> +
>>>>       if (guc_fw->guc_fw_load_status == GUC_FIRMWARE_PENDING)
>>>>           guc_fw->guc_fw_load_status = GUC_FIRMWARE_FAIL;
>>>>
>>>> @@ -490,6 +479,29 @@ int intel_guc_ucode_load(struct drm_device *dev)
>>>>       i915_guc_submission_disable(dev);
>>>>       i915_guc_submission_fini(dev);
>>>>
>>>> +    /*
>>>> +     * We've failed to load the firmware :(
>>>> +     *
>>>> +     * Decide whether to disable GuC submission and fall back to
>>>> +     * execlist mode, and whether to hide the error by returning
>>>> +     * zero or to return -EIO, which the caller will treat as a
>>>> +     * nonfatal error (i.e. it doesn't prevent driver load, but
>>>> +     * marks the GPU as wedged until reset).
>>>> +     */
>>>> +    if (i915.enable_guc_loading > 1) {
>>>> +        err = -EIO;
>>>> +    } else if (HAS_GUC_SCHED(dev) && !HAS_GUC_UCODE(dev)) {
>>>> +        return 0;
>>>
>>> i915_gem_init_hw already guards the call to intel_guc_ucode_load with
>>> HAS_GUC_UCODE so at the moment at least this is a dead branch.
>>>
>>> I don't even understand what is this branch supposed to do? How can
>>> there be a platform with no guc fw but guc scheduling?
>>
>> Imagine a GuC with firmware in ROM :) Or at least flash ...
>>
>> (it already has a BootROM)
>
> Ok but it is still a dead branch, no? Should the HAS_GUC_UCODE guard in
> i915_gem_init_hw be removed and let be handled by this code only?

Yep, I've changed the call condition :)

>>>> +    } else if (i915.enable_guc_submission > 1) {
>>>> +        err = -EIO;
>>>> +    } else {
>>>> +        err = 0;
>>>> +    }
>>>> +
>>>> +    i915.enable_guc_submission = 0;
>>>> +
>>>> +    DRM_DEBUG_DRIVER("falling back to execlist mode, err %d\n", err);
>>>
>>> This would log when i915.enable_guc_loading is set to 0 which would be
>>> confusing. I think in this case the function should bail out much
>>> earlier.
>>
>> That was tested right back at the top of the function! It bailed out
>> very early, so you can't get here with GuC loading disabled.
>
> AFAICS the patch removes the early bailout?!
>
>> Also, that's a DEBUG message, so users won't see it by default.
>
> So it is OK to confuse fellow developers? :D

With the new reporting as described above, this message is promoted to 
DRM_INFO, but only if enable_guc_submission is not already zero. If GuC 
submission is already disabled, we won't report that we're falling back 
to execlist mode.

You can still confuse yourself by disabling GuC loading *without* also 
disabling GuC submission, but in that case the message should help you 
realise you asked for something rather strange :)

>>>>       if (!HAS_GUC_UCODE(dev)) {
>>>>           fw_path = NULL;
>>>> @@ -641,26 +656,21 @@ void intel_guc_ucode_init(struct drm_device *dev)
>>>>           guc_fw->guc_fw_major_wanted = 6;
>>>>           guc_fw->guc_fw_minor_wanted = 1;
>>>>       } else {
>>>> -        i915.enable_guc_submission = false;
>>>>           fw_path = "";    /* unknown device */
>>>>       }
>>>
>>> Confusing block, HAS_GUC_UCODE is defined as (IS_GEN9(dev) &&
>>> !IS_KABYLAKE(dev)) but then here we only support SKL here. Why the
>>> former is then not just IS_SKYLAKE?
>>>
>>> When BXT support is added this still needs to be modified and would only
>>> save touching HAS_GUC_UCODE in the header. But it must be a better
>>> reason?
>>
>> I don't see anything confusing. The logic does not depend on how
>> somebody has defined HAS_GUC_UCODE(), and we shouldn't assume any
>> relation between it and the platform macros. What it says here is:
>>
>> 1.  if this platform *doesn't* have GuC firmware, fw_path is NULL
>> 2a. else if this is SKYLAKE, look for f/w version 6.1
>> 2b. (repeat 2a for each supported platform)
>> 3.  (else) unknown device, path is "" which triggers ERROR later.
>>
>> Imagine a SKL version with GuC uCode in ROM - the HAS_GUC_UCODE() test
>> must take precedence.
>
> My question was, why isn't HAS_GUC_UCODE == IS_SKYLAKE to start with ?
> Then when BXT support is added, which will need code changes anyway, we
> change HAS_GUC_UCODE only.

Probably because the original definition of HAS_GUC_UCODE() was simply 
IS_GEN9() -- POR was that *ALL* Gen9 platforms should have a GuC and 
firmware for it. Then someone else added KBL as an exception.

Anyway, we now have BXT support, so the if-else-if condition ladder has 
a more obvious structure now :)

.Dave.