<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 2022-03-08 12:20, Somalapuram,
      Amaranath wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:abf6d329-6a3f-26f0-1d5b-75b3ff55acfd@amd.com">
      
      <p><br>
      </p>
      <div class="moz-cite-prefix">On 3/8/2022 10:00 PM, Sharma,
        Shashank wrote:<br>
      </div>
      <blockquote type="cite" cite="mid:bc293ab7-db45-2b16-aeb8-291cffef8ba4@amd.com">Hello
        Andrey <br>
        <br>
        On 3/8/2022 5:26 PM, Andrey Grodzovsky wrote: <br>
        <blockquote type="cite"> <br>
          On 2022-03-07 11:26, Shashank Sharma wrote: <br>
          <blockquote type="cite">From: Shashank Sharma <a class="moz-txt-link-rfc2396E" href="mailto:shashank.sharma@amd.com" moz-do-not-send="true"><shashank.sharma@amd.com></a>
            <br>
            <br>
            This patch adds a work function, which will get scheduled <br>
            in event of a GPU reset, and will send a uevent to user with
            <br>
            some reset context infomration, like a PID and some flags. <br>
          </blockquote>
          <br>
          <br>
          Where is the actual scheduling of the work function ?
          Shouldn't <br>
          there be a patch for that too ? <br>
          <br>
        </blockquote>
        <br>
        Yes, Amar is working on that patch, on top of these patches.
        They should be out soon. I thought it was a good idea to get
        quick feedback on the basic patches before we build something on
        top of it. <br>
        <br>
      </blockquote>
      <p>schedule_work() will be called in the function
        amdgpu_do_asic_reset () <br>
      </p>
    </blockquote>
    <p><br>
    </p>
    <p>I didn't follow closely on the requirements and so I don't know
      but, what about<br>
      job timeout that was able to soft recover - do you need to cover
      this too ? Or<br>
      in this case no need to restart user application and you hence
      don't care ?</p>
    <p>Andrey</p>
    <p><br>
    </p>
    <blockquote type="cite" cite="mid:abf6d329-6a3f-26f0-1d5b-75b3ff55acfd@amd.com">
      <p> </p>
      <p>after getting vram_lost info:<br>
      </p>
      <p>vram_lost = amdgpu_device_check_vram_lost(tmp_adev);</p>
      <p>update  amdgpu_reset_event_ctx and call schedule_work()</p>
      <ul>
        <li>vram_lost</li>
        <li>reset_context->job->vm->task_info.process_name</li>
        <li>reset_context->job->vm->task_info.pid</li>
      </ul>
      Regards,<br>
      S.Amarnath<br>
      <blockquote type="cite" cite="mid:bc293ab7-db45-2b16-aeb8-291cffef8ba4@amd.com">-
        Shashank <br>
        <br>
        <blockquote type="cite">Andrey <br>
          <br>
          <br>
          <blockquote type="cite"> <br>
            The userspace can do some recovery and post-processing work
            <br>
            based on this event. <br>
            <br>
            V2: <br>
            - Changed the name of the work to gpu_reset_event_work <br>
               (Christian) <br>
            - Added a structure to accommodate some additional
            information <br>
               (like a PID and some flags) <br>
            <br>
            Cc: Alexander Deucher <a class="moz-txt-link-rfc2396E" href="mailto:alexander.deucher@amd.com" moz-do-not-send="true"><alexander.deucher@amd.com></a>
            <br>
            Cc: Christian Koenig <a class="moz-txt-link-rfc2396E" href="mailto:christian.koenig@amd.com" moz-do-not-send="true"><christian.koenig@amd.com></a>
            <br>
            Signed-off-by: Shashank Sharma <a class="moz-txt-link-rfc2396E" href="mailto:shashank.sharma@amd.com" moz-do-not-send="true"><shashank.sharma@amd.com></a>
            <br>
            --- <br>
              drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  7 +++++++ <br>
              drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19
            +++++++++++++++++++ <br>
              2 files changed, 26 insertions(+) <br>
            <br>
            diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
            b/drivers/gpu/drm/amd/amdgpu/amdgpu.h <br>
            index d8b854fcbffa..7df219fe363f 100644 <br>
            --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h <br>
            +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h <br>
            @@ -813,6 +813,11 @@ struct amd_powerplay { <br>
              #define AMDGPU_RESET_MAGIC_NUM 64 <br>
              #define AMDGPU_MAX_DF_PERFMONS 4 <br>
              #define AMDGPU_PRODUCT_NAME_LEN 64 <br>
            +struct amdgpu_reset_event_ctx { <br>
            +    uint64_t pid; <br>
            +    uint32_t flags; <br>
            +}; <br>
            + <br>
              struct amdgpu_device { <br>
                  struct device            *dev; <br>
                  struct pci_dev            *pdev; <br>
            @@ -1063,6 +1068,7 @@ struct amdgpu_device { <br>
                  int asic_reset_res; <br>
                  struct work_struct        xgmi_reset_work; <br>
            +    struct work_struct        gpu_reset_event_work; <br>
                  struct list_head        reset_list; <br>
                  long                gfx_timeout; <br>
            @@ -1097,6 +1103,7 @@ struct amdgpu_device { <br>
                  pci_channel_state_t        pci_channel_state; <br>
                  struct amdgpu_reset_control     *reset_cntl; <br>
            +    struct amdgpu_reset_event_ctx   reset_event_ctx; <br>
                  uint32_t                       
            ip_versions[MAX_HWIP][HWIP_MAX_INSTANCE]; <br>
                  bool                ram_is_direct_mapped; <br>
            diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
            b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c <br>
            index ed077de426d9..c43d099da06d 100644 <br>
            --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c <br>
            +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c <br>
            @@ -73,6 +73,7 @@ <br>
              #include <linux/pm_runtime.h> <br>
              #include <drm/drm_drv.h> <br>
            +#include <drm/drm_sysfs.h> <br>
              MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin"); <br>
              MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin"); <br>
            @@ -3277,6 +3278,23 @@ bool
            amdgpu_device_has_dc_support(struct amdgpu_device *adev) <br>
                  return
            amdgpu_device_asic_has_dc_support(adev->asic_type); <br>
              } <br>
            +static void amdgpu_device_reset_event_func(struct
            work_struct *__work) <br>
            +{ <br>
            +    struct amdgpu_device *adev = container_of(__work,
            struct amdgpu_device, <br>
            +                          gpu_reset_event_work); <br>
            +    struct amdgpu_reset_event_ctx *event_ctx =
            &adev->reset_event_ctx; <br>
            + <br>
            +    /* <br>
            +     * A GPU reset has happened, indicate the userspace and
            pass the <br>
            +     * following information: <br>
            +     *    - pid of the process involved, <br>
            +     *    - if the VRAM is valid or not, <br>
            +     *    - indicate that userspace may want to collect the
            ftrace event <br>
            +     * data from the trace event. <br>
            +     */ <br>
            +    drm_sysfs_reset_event(&adev->ddev,
            event_ctx->pid, event_ctx->flags); <br>
            +} <br>
            + <br>
              static void amdgpu_device_xgmi_reset_func(struct
            work_struct *__work) <br>
              { <br>
                  struct amdgpu_device *adev = <br>
            @@ -3525,6 +3543,7 @@ int amdgpu_device_init(struct
            amdgpu_device *adev, <br>
                            amdgpu_device_delay_enable_gfx_off); <br>
                  INIT_WORK(&adev->xgmi_reset_work,
            amdgpu_device_xgmi_reset_func); <br>
            +    INIT_WORK(&adev->gpu_reset_event_work,
            amdgpu_device_reset_event_func); <br>
                  adev->gfx.gfx_off_req_count = 1; <br>
                  adev->pm.ac_power =
            power_supply_is_system_supplied() > 0; <br>
          </blockquote>
        </blockquote>
      </blockquote>
    </blockquote>
  </body>
</html>