[PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold

Russell, Kent Kent.Russell at amd.com
Thu Oct 21 14:04:27 UTC 2021


[AMD Official Use Only]

My editor won't let me reply in-line without making it look like garbage.

Thanks for the insight, Luben! They're all useful points, especially the consolidation and relying on the threshold_validation which has already occurred before we get to this point (which I should've checked).

And I overdid the transitive multiplication explanation, so I wouldn't have to answer questions about it later. But your concise comment below pretty much covers things and shouldn't cause any unnecessary inquiries.

Kent

From: Tuikov, Luben <Luben.Tuikov at amd.com>
Sent: Wednesday, October 20, 2021 5:48 PM
To: Russell, Kent <Kent.Russell at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Joshi, Mukul <Mukul.Joshi at amd.com>
Subject: Re: [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold

On 2021-10-20 12:35, Kent Russell wrote:

Currently dmesg doesn't warn when the number of bad pages approaches the

"Currently" is redundant in this sentence as it is already in present simple tense.

Fair point



threshold for page retirement. WARN when the number of bad pages

is at 90% or greater for easier checks and planning, instead of waiting

until the GPU is full of bad pages

Missing full-stop (period) above.







Cc: Luben Tuikov <luben.tuikov at amd.com><mailto:luben.tuikov at amd.com>

Cc: Mukul Joshi <Mukul.Joshi at amd.com><mailto:Mukul.Joshi at amd.com>

Signed-off-by: Kent Russell <kent.russell at amd.com><mailto:kent.russell at amd.com>

---

 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 17 +++++++++++++++++

 1 file changed, 17 insertions(+)



diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c

index f4c05ff4b26c..1ede0f0d6f55 100644

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c

+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c

@@ -1071,12 +1071,29 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,

        control->ras_fri = RAS_OFFSET_TO_INDEX(control, hdr->first_rec_offset);



        if (hdr->header == RAS_TABLE_HDR_VAL) {

+               int threshold = 0;

                DRM_DEBUG_DRIVER("Found existing EEPROM table with %d records",

                                control->ras_num_recs);

                res = __verify_ras_table_checksum(control);

                if (res)

                        DRM_ERROR("RAS table incorrect checksum or error:%d\n",

                                 res);

+

+               /* threshold = 0 means that page retirement is disabled, while

+                * threshold = -1 means default behaviour

+                */

+               if (amdgpu_bad_page_threshold == -1)

+                       threshold = ras->bad_page_cnt_threshold;

+               else if (amdgpu_bad_page_threshold > 0)

+                       threshold = amdgpu_bad_page_threshold;

I believe we don't need this calculation here as it's already been done for us in amdgpu_ras_validate_threshold(), in amdgpu_ras.c.


I believe you want to use "ras->bad_page_cnt_threshold" here instead. For instance of this, see a bit further down in this very function this check including the comment, in italics:

    } else if (hdr->header == RAS_TABLE_HDR_BAD &&
           amdgpu_bad_page_threshold != 0) {
        res = __verify_ras_table_checksum(control);
        if (res)
            DRM_ERROR("RAS Table incorrect checksum or error:%d\n",
                  res);
        if (ras->bad_page_cnt_threshold > control->ras_num_recs) {
            /* This means that, the threshold was increased since
             * the last time the system was booted, and now,
             * ras->bad_page_cnt_threshold - control->num_recs > 0,
             * so that at least one more record can be saved,
             * before the page count threshold is reached.
             */

And on the "else", a bit further down, again in italics:

        } else {
            *exceed_err_limit = true;
            dev_err(adev->dev,
                "RAS records:%d exceed threshold:%d, "
                "maybe retire this GPU?",
                control->ras_num_recs, ras->bad_page_cnt_threshold);
        }


See how it says "records exceed threshold"--well, with this patch you want to say "records exceed 90% of threshold." :-) So these are the quantities we gauge each other to.

Clarification on this below.





+

+               /* Since multiplcation is transitive, a = 9b/10 is the same

+                * as 10a = 9b. Use this for our 90% limit to avoid rounding

+                */

I really like the format of the comment. But I feel that the comment itself isn't necessary... at least the way it is written ("9b" may mean "9 bits" or "9 binary". I'd avoid getting into arithmetic theory, and remove the comment completely. Anything else (explaining the math) really distracts from the real purpose of what we're doing. (After all, this is C, not a class on arithmetic--they who can, will figure it out.)

Perhaps something like:

/* Warn if we get past 90% of the threshold.
 */



+               if (threshold > 0 && ((control->ras_num_recs * 10) >= (threshold * 9)))

Right, so here perhaps we want to do this instead:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 98732518543e53..2aab62fa488eba 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1077,6 +1077,12 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
                if (res)
                        DRM_ERROR("RAS table incorrect checksum or error:%d\n",
                                  res);
+               /* Warn if we get past 90% of the threshold.
+                */
+               if (10 * control->ras_num_recs >= 9 * ras->bad_page_cnt_threshold)
+                       DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
+                                control->ras_num_recs,
+                                ras->bad_page_cnt_threshold);
        } else if (hdr->header == RAS_TABLE_HDR_BAD &&
                   amdgpu_bad_page_threshold != 0) {
                res = __verify_ras_table_checksum(control);

Also note that up until this point of the boot process, we don't qualify the boot by amdgpu_bad_page_threshold and I feel that this check in this embedded if-conditional shouldn't do that as well.

Regards,
Luben





+                       DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",

+                               control->ras_num_recs,

+                               threshold);

        } else if (hdr->header == RAS_TABLE_HDR_BAD &&

                   amdgpu_bad_page_threshold != 0) {

                res = __verify_ras_table_checksum(control);

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20211021/91be9e28/attachment-0001.htm>


More information about the amd-gfx mailing list