<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<div class="moz-cite-prefix">On 2021-10-20 12:35, Kent Russell
wrote:<br>
</div>
<blockquote type="cite" cite="mid:20211020163520.1167214-1-kent.russell@amd.com">
<pre class="moz-quote-pre" wrap="">Currently dmesg doesn't warn when the number of bad pages approaches the</pre>
</blockquote>
<br>
"Currently" is redundant in this sentence as it is already in
present simple tense.<br>
<br>
<blockquote type="cite" cite="mid:20211020163520.1167214-1-kent.russell@amd.com">
<pre class="moz-quote-pre" wrap="">
threshold for page retirement. WARN when the number of bad pages
is at 90% or greater for easier checks and planning, instead of waiting
until the GPU is full of bad pages</pre>
</blockquote>
<br>
Missing full-stop (period) above.<br>
<br>
<blockquote type="cite" cite="mid:20211020163520.1167214-1-kent.russell@amd.com">
<pre class="moz-quote-pre" wrap="">
Cc: Luben Tuikov <a class="moz-txt-link-rfc2396E" href="mailto:luben.tuikov@amd.com"><luben.tuikov@amd.com></a>
Cc: Mukul Joshi <a class="moz-txt-link-rfc2396E" href="mailto:Mukul.Joshi@amd.com"><Mukul.Joshi@amd.com></a>
Signed-off-by: Kent Russell <a class="moz-txt-link-rfc2396E" href="mailto:kent.russell@amd.com"><kent.russell@amd.com></a>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index f4c05ff4b26c..1ede0f0d6f55 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1071,12 +1071,29 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
control->ras_fri = RAS_OFFSET_TO_INDEX(control, hdr->first_rec_offset);
if (hdr->header == RAS_TABLE_HDR_VAL) {
+ int threshold = 0;
DRM_DEBUG_DRIVER("Found existing EEPROM table with %d records",
control->ras_num_recs);
res = __verify_ras_table_checksum(control);
if (res)
DRM_ERROR("RAS table incorrect checksum or error:%d\n",
res);
+
+ /* threshold = 0 means that page retirement is disabled, while
+ * threshold = -1 means default behaviour
+ */
+ if (amdgpu_bad_page_threshold == -1)
+ threshold = ras->bad_page_cnt_threshold;
+ else if (amdgpu_bad_page_threshold > 0)
+ threshold = amdgpu_bad_page_threshold;</pre>
</blockquote>
<br>
I believe we don't need this calculation here as it's already been
done for us in <i>amdgpu_ras_validate_threshold()</i>, in
amdgpu_ras.c.<br>
<br>
I believe you want to use "ras->bad_page_cnt_threshold" here
instead. For instance of this, see a bit further down in this very
function this check including the comment, in italics:<br>
<br>
<font face="monospace"> } else if (hdr->header ==
RAS_TABLE_HDR_BAD &&<br>
amdgpu_bad_page_threshold != 0) {<br>
res = __verify_ras_table_checksum(control);<br>
if (res)<br>
DRM_ERROR("RAS Table incorrect checksum or
error:%d\n",<br>
res);<br>
</font><i><font face="monospace">if
(ras->bad_page_cnt_threshold > control->ras_num_recs) {<br>
/* This means that, the threshold was increased
since<br>
* the last time the system was booted, and now,<br>
* ras->bad_page_cnt_threshold -
control->num_recs > 0,<br>
* so that at least one more record can be saved,<br>
* before the page count threshold is reached.<br>
*/</font></i><br>
<br>
And on the "else", a bit further down, again in italics:<br>
<br>
<font face="monospace"> } else {<br>
*exceed_err_limit = true;<br>
<i>dev_err(adev->dev,</i><i><br>
</i><i> "RAS records:%d exceed threshold:%d, "</i><i><br>
</i><i> "maybe retire this GPU?",</i><i><br>
</i><i> control->ras_num_recs,
ras->bad_page_cnt_threshold);</i><br>
}</font><br>
<br>
<br>
See how it says "<i>records exceed threshold</i>"--well, with this
patch you want to say "<i>records exceed </i>90%<i> of threshold.</i>"
:-) So these are the quantities we gauge each other to.<br>
<br>
Clarification on this below.<br>
<br>
<blockquote type="cite" cite="mid:20211020163520.1167214-1-kent.russell@amd.com">
<pre class="moz-quote-pre" wrap="">
+
+ /* Since multiplcation is transitive, a = 9b/10 is the same
+ * as 10a = 9b. Use this for our 90% limit to avoid rounding
+ */
</pre>
</blockquote>
<br>
I really like the format of the comment. But I feel that the comment
itself isn't necessary... at least the way it is written ("9b" may
mean "9 bits" or "9 binary". I'd avoid getting into arithmetic
theory, and remove the comment completely. Anything else (explaining
the math) really distracts from the real purpose of what we're
doing. (After all, this is C, not a class on arithmetic--they who
can, will figure it out.)<br>
<br>
Perhaps something like:<br>
<br>
<font face="monospace">/* Warn if we get past 90% of the threshold.<br>
*/<br>
</font><br>
<blockquote type="cite" cite="mid:20211020163520.1167214-1-kent.russell@amd.com">
<pre class="moz-quote-pre" wrap="">+ if (threshold > 0 && ((control->ras_num_recs * 10) >= (threshold * 9)))</pre>
</blockquote>
<br>
Right, so here perhaps we want to do this instead:<br>
<br>
<font face="monospace">diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c<br>
index 98732518543e53..2aab62fa488eba 100644<br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c<br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c<br>
@@ -1077,6 +1077,12 @@ int amdgpu_ras_eeprom_init(struct
amdgpu_ras_eeprom_control *control,<br>
if (res)<br>
DRM_ERROR("RAS table incorrect checksum or
error:%d\n",<br>
res);<br>
+ /* Warn if we get past 90% of the threshold.<br>
+ */<br>
+ if (10 * control->ras_num_recs >= 9 *
ras->bad_page_cnt_threshold)<br>
+ DRM_WARN("RAS records:%u exceeds 90%% of
threshold:%d",<br>
+ control->ras_num_recs,<br>
+ ras->bad_page_cnt_threshold);<br>
} else if (hdr->header == RAS_TABLE_HDR_BAD &&<br>
amdgpu_bad_page_threshold != 0) {<br>
res = __verify_ras_table_checksum(control);<br>
</font><br>
Also note that up until this point of the boot process, we don't
qualify the boot by amdgpu_bad_page_threshold and I feel that this
check in this embedded if-conditional shouldn't do that as well.<br>
<br>
Regards,<br>
Luben<br>
<br>
<blockquote type="cite" cite="mid:20211020163520.1167214-1-kent.russell@amd.com">
<pre class="moz-quote-pre" wrap="">
+ DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
+ control->ras_num_recs,
+ threshold);
} else if (hdr->header == RAS_TABLE_HDR_BAD &&
amdgpu_bad_page_threshold != 0) {
res = __verify_ras_table_checksum(control);
</pre>
</blockquote>
<br>
</body>
</html>