<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 7/26/2024 7:53 PM, Khatri, Sunil
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:f967ce91-dd88-4542-8340-1e61813eb780@amd.com">
      <br>
      On 7/26/2024 7:18 PM, Lazar, Lijo wrote:
      <br>
      <blockquote type="cite">
        <br>
        On 7/26/2024 6:42 PM, Alex Deucher wrote:
        <br>
        <blockquote type="cite">On Fri, Jul 26, 2024 at 8:48 AM Sunil
          Khatri <a class="moz-txt-link-rfc2396E" href="mailto:sunil.khatri@amd.com"><sunil.khatri@amd.com></a> wrote:
          <br>
          <blockquote type="cite">Problem:
            <br>
            IP dump right now is done post suspend of
            <br>
            all IP's which for some IP's could change power
            <br>
            state and software state too which we do not want
            <br>
            to reflect in the dump as it might not be same at
            <br>
            the time of hang.
            <br>
            <br>
            Solution:
            <br>
            IP should be dumped as close to the HW state when
            <br>
            the GPU was in hung state without trying to reinitialize
            <br>
            any resource.
            <br>
            <br>
            Signed-off-by: Sunil Khatri <a class="moz-txt-link-rfc2396E" href="mailto:sunil.khatri@amd.com"><sunil.khatri@amd.com></a>
            <br>
          </blockquote>
          Acked-by: Alex Deucher <a class="moz-txt-link-rfc2396E" href="mailto:alexander.deucher@amd.com"><alexander.deucher@amd.com></a>
          <br>
          <br>
          <blockquote type="cite">---
            <br>
              drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 60
            +++++++++++-----------
            <br>
              1 file changed, 30 insertions(+), 30 deletions(-)
            <br>
            <br>
            diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
            b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
            <br>
            index 730dae77570c..74f6f15e73b5 100644
            <br>
            --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
            <br>
            +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
            <br>
            @@ -5277,11 +5277,29 @@ int amdgpu_device_mode1_reset(struct
            amdgpu_device *adev)
            <br>
                     return ret;
            <br>
              }
            <br>
            <br>
            +static int amdgpu_reset_reg_dumps(struct amdgpu_device
            *adev)
            <br>
            +{
            <br>
            +       int i;
            <br>
            +
            <br>
            +      
            lockdep_assert_held(&adev->reset_domain->sem);
            <br>
            +
            <br>
            +       for (i = 0; i < adev->reset_info.num_regs;
            i++) {
            <br>
            +               adev->reset_info.reset_dump_reg_value[i]
            =
            <br>
            +                      
            RREG32(adev->reset_info.reset_dump_reg_list[i]);
            <br>
          </blockquote>
        </blockquote>
        A suspend also involves power/clock ungate. When reg dump is
        moved
        <br>
        earlier, I'm not sure if this read works for all. If it's left
        to
        <br>
        individual IP call backs, they could just do the same or better
        to move
        <br>
        these up before a dump.
        <br>
      </blockquote>
      Suspend also put the status.hw = false and each IP in their
      respective suspend state which i feel does change the state of the
      HW.
      <br>
      To get the correct snapshot of the GPU register we should not be
      fiddling with the HW IP at least till we capture the dump and that
      is the intention behind the change.
      <br>
      <br>
      Do you think there is a problem in this approach?
      <br>
      <blockquote type="cite">
        <br>
                 amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
        <br>
                 amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
        <br>
      </blockquote>
      Adding this does sounds better to enable just before we dump the
      registers but i am not very sure if ungating would be clean here
      or not. i Could try quickly adding these two calls just before
      dump.
      <br>
      <br>
      All i am worried if it does change some register reflecting the
      original state of registers at dump.
      <br>
      <br>
      What u suggest ?
      <br>
      Regards
      <br>
      Sunil
      <br>
    </blockquote>
    I quickly validated on Navi22 by adding below call just before we
    dump registers <br>
    <div style="color: #cccccc;background-color: #1f1f1f;font-family: Consolas, 'Courier New', monospace;font-weight: normal;font-size: 14px;line-height: 19px;white-space: pre;"><div><span style="color: #c586c0;">if</span><span style="color: #cccccc;"> (</span><span style="color: #d4d4d4;">!</span><span style="color: #dcdcaa;">test_bit</span><span style="color: #cccccc;">(</span><span style="color: #4fc1ff;">AMDGPU_SKIP_COREDUMP</span><span style="color: #cccccc;">, </span><span style="color: #d4d4d4;">&</span><span style="color: #9cdcfe;">reset_context</span><span style="color: #cccccc;">-></span><span style="color: #9cdcfe;">flags</span><span style="color: #cccccc;">)) {</span></div><div><span style="color: #cccccc;">                       </span></div><div><span style="color: #cccccc;">    </span><span style="color: #dcdcaa;">amdgpu_device_set_pg_state</span><span style="color: #cccccc;">(</span><span style="color: #9cdcfe;">adev</span><span style="color: #cccccc;">, </span><span style="color: #4fc1ff;">AMD_PG_STATE_UNGATE</span><span style="color: #cccccc;">);</span></div><div><span style="color: #cccccc;">    </span><span style="color: #dcdcaa;">amdgpu_device_set_cg_state</span><span style="color: #cccccc;">(</span><span style="color: #9cdcfe;">adev</span><span style="color: #cccccc;">, </span><span style="color: #4fc1ff;">AMD_CG_STATE_UNGATE</span><span style="color: #cccccc;">);</span></div><div><span style="color: #cccccc;">                       </span></div><div><span style="color: #cccccc;">    </span><span style="color: #dcdcaa;">amdgpu_reset_reg_dumps</span><span style="color: #cccccc;">(</span><span style="color: #9cdcfe;">tmp_adev</span><span style="color: #cccccc;">);</span></div>
<div><span style="color: #cccccc;">    </span><span style="color: #dcdcaa;">dev_info</span><span style="color: #cccccc;">(</span><span style="color: #9cdcfe;">tmp_adev</span><span style="color: #cccccc;">-></span><span style="color: #9cdcfe;">dev</span><span style="color: #cccccc;">, </span><span style="color: #ce9178;">"Dumping IP State</span><span style="color: #d7ba7d;">\n</span><span style="color: #ce9178;">"</span><span style="color: #cccccc;">);</span></div><div><span style="color: #6a9955;">    /* Trigger ip dump before we reset the asic */</span></div><div><span style="color: #cccccc;">    </span><span style="color: #c586c0;">for</span><span style="color: #cccccc;"> (</span><span style="color: #9cdcfe;">i</span><span style="color: #cccccc;"> </span><span style="color: #d4d4d4;">=</span><span style="color: #cccccc;"> </span><span style="color: #b5cea8;">0</span><span style="color: #cccccc;">; </span><span style="color: #9cdcfe;">i</span><span style="color: #cccccc;"> </span><span style="color: #d4d4d4;"><</span><span style="color: #cccccc;"> </span><span style="color: #9cdcfe;">tmp_adev</span><span style="color: #cccccc;">-></span><span style="color: #9cdcfe;">num_ip_blocks</span><span style="color: #cccccc;">; </span><span style="color: #9cdcfe;">i</span><span style="color: #d4d4d4;">++</span><span style="color: #cccccc;">)</span></div><div><span style="color: #cccccc;">         </span><span style="color: #c586c0;">if</span><span style="color: #cccccc;"> (</span><span style="color: #9cdcfe;">tmp_adev</span><span style="color: #cccccc;">-></span><span style="color: #9cdcfe;">ip_blocks</span><span style="color: #cccccc;">[</span><span style="color: #9cdcfe;">i</span><span style="color: #cccccc;">].</span><span style="color: #9cdcfe;">version</span><span style="color: #cccccc;">-></span><span style="color: #9cdcfe;">funcs</span><span style="color: #cccccc;">-></span><span style="color: #9cdcfe;">dump_ip_state</span><span style="color: #cccccc;">)</span></div><div><span style="color: #cccccc;">             </span><span style="color: #9cdcfe;">tmp_adev</span><span style="color: #cccccc;">-></span><span style="color: #9cdcfe;">ip_blocks</span><span style="color: #cccccc;">[</span><span style="color: #9cdcfe;">i</span><span style="color: #cccccc;">].</span><span style="color: #9cdcfe;">version</span><span style="color: #cccccc;">-></span><span style="color: #9cdcfe;">funcs</span><span style="color: #cccccc;">-></span><span style="color: #9cdcfe;">dump_ip_state</span><span style="color: #cccccc;">(</span></div><div><span style="color: #cccccc;">                                    (</span><span style="color: #569cd6;">void</span><span style="color: #cccccc;"> </span><span style="color: #d4d4d4;">*</span><span style="color: #cccccc;">)</span><span style="color: #9cdcfe;">tmp_adev</span><span style="color: #cccccc;">);</span></div><div><span style="color: #cccccc;">    </span><span style="color: #dcdcaa;">dev_info</span><span style="color: #cccccc;">(</span><span style="color: #9cdcfe;">tmp_adev</span><span style="color: #cccccc;">-></span><span style="color: #9cdcfe;">dev</span><span style="color: #cccccc;">, </span><span style="color: #ce9178;">"Dumping IP State Completed</span><span style="color: #d7ba7d;">\n</span><span style="color: #ce9178;">"</span><span style="color: #cccccc;">);</span></div><div><span style="color: #cccccc;">}</span></div><div><span style="color: #cccccc;">
</span></div><div><span style="color: #cccccc;">If this sounds fine with you i am update that.

Regards
Sunil Khatri
</span></div></div>
    <blockquote type="cite" cite="mid:f967ce91-dd88-4542-8340-1e61813eb780@amd.com">
      <br>
      <blockquote type="cite">
        <br>
        Thanks,
        <br>
        Lijo
        <br>
        <br>
        <blockquote type="cite">
          <blockquote type="cite">+
            <br>
            +              
            trace_amdgpu_reset_reg_dumps(adev->reset_info.reset_dump_reg_list[i],
            <br>
            +                                           
            adev->reset_info.reset_dump_reg_value[i]);
            <br>
            +       }
            <br>
            +
            <br>
            +       return 0;
            <br>
            +}
            <br>
            +
            <br>
              int amdgpu_device_pre_asic_reset(struct amdgpu_device
            *adev,
            <br>
                                              struct
            amdgpu_reset_context *reset_context)
            <br>
              {
            <br>
                     int i, r = 0;
            <br>
                     struct amdgpu_job *job = NULL;
            <br>
            +       struct amdgpu_device *tmp_adev =
            reset_context->reset_req_dev;
            <br>
                     bool need_full_reset =
            <br>
                             test_bit(AMDGPU_NEED_FULL_RESET,
            &reset_context->flags);
            <br>
            <br>
            @@ -5340,6 +5358,18 @@ int
            amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
            <br>
                                     }
            <br>
                             }
            <br>
            <br>
            +               if (!test_bit(AMDGPU_SKIP_COREDUMP,
            &reset_context->flags)) {
            <br>
            +                       amdgpu_reset_reg_dumps(tmp_adev);
            <br>
            +
            <br>
            +                       dev_info(tmp_adev->dev, "Dumping
            IP State\n");
            <br>
            +                       /* Trigger ip dump before we reset
            the asic */
            <br>
            +                       for (i = 0; i <
            tmp_adev->num_ip_blocks; i++)
            <br>
            +                               if
            (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
            <br>
            +                                      
            tmp_adev->ip_blocks[i].version->funcs->dump_ip_state(
            <br>
            +                                                      
            (void *)tmp_adev);
            <br>
            +                       dev_info(tmp_adev->dev, "Dumping
            IP State Completed\n");
            <br>
            +               }
            <br>
            +
            <br>
                             if (need_full_reset)
            <br>
                                     r = amdgpu_device_ip_suspend(adev);
            <br>
                             if (need_full_reset)
            <br>
            @@ -5352,47 +5382,17 @@ int
            amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
            <br>
                     return r;
            <br>
              }
            <br>
            <br>
            -static int amdgpu_reset_reg_dumps(struct amdgpu_device
            *adev)
            <br>
            -{
            <br>
            -       int i;
            <br>
            -
            <br>
            -      
            lockdep_assert_held(&adev->reset_domain->sem);
            <br>
            -
            <br>
            -       for (i = 0; i < adev->reset_info.num_regs;
            i++) {
            <br>
            -               adev->reset_info.reset_dump_reg_value[i]
            =
            <br>
            -                      
            RREG32(adev->reset_info.reset_dump_reg_list[i]);
            <br>
            -
            <br>
            -              
            trace_amdgpu_reset_reg_dumps(adev->reset_info.reset_dump_reg_list[i],
            <br>
            -                                           
            adev->reset_info.reset_dump_reg_value[i]);
            <br>
            -       }
            <br>
            -
            <br>
            -       return 0;
            <br>
            -}
            <br>
            -
            <br>
              int amdgpu_do_asic_reset(struct list_head
            *device_list_handle,
            <br>
                                      struct amdgpu_reset_context
            *reset_context)
            <br>
              {
            <br>
                     struct amdgpu_device *tmp_adev = NULL;
            <br>
                     bool need_full_reset, skip_hw_reset, vram_lost =
            false;
            <br>
                     int r = 0;
            <br>
            -       uint32_t i;
            <br>
            <br>
                     /* Try reset handler method first */
            <br>
                     tmp_adev = list_first_entry(device_list_handle,
            struct amdgpu_device,
            <br>
                                                 reset_list);
            <br>
            <br>
            -       if (!test_bit(AMDGPU_SKIP_COREDUMP,
            &reset_context->flags)) {
            <br>
            -               amdgpu_reset_reg_dumps(tmp_adev);
            <br>
            -
            <br>
            -               dev_info(tmp_adev->dev, "Dumping IP
            State\n");
            <br>
            -               /* Trigger ip dump before we reset the asic
            */
            <br>
            -               for (i = 0; i <
            tmp_adev->num_ip_blocks; i++)
            <br>
            -                       if
            (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
            <br>
            -                              
            tmp_adev->ip_blocks[i].version->funcs
            <br>
            -                               ->dump_ip_state((void
            *)tmp_adev);
            <br>
            -               dev_info(tmp_adev->dev, "Dumping IP State
            Completed\n");
            <br>
            -       }
            <br>
            -
            <br>
                     reset_context->reset_device_list =
            device_list_handle;
            <br>
                     r = amdgpu_reset_perform_reset(tmp_adev,
            reset_context);
            <br>
                     /* If reset handler not implemented, continue;
            otherwise return */
            <br>
            --
            <br>
            2.34.1
            <br>
            <br>
          </blockquote>
        </blockquote>
      </blockquote>
    </blockquote>
  </body>
</html>