[RESEND PATCH 4/5] Subject: drm/amdgpu: Redo XGMI reset synchronization.
Andrey Grodzovsky
Andrey.Grodzovsky at amd.com
Thu Dec 12 15:20:05 UTC 2019
On 12/11/19 11:05 PM, Ma, Le wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> -----Original Message-----
> From: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
> Sent: Thursday, December 12, 2019 4:39 AM
> To: dri-devel at lists.freedesktop.org; amd-gfx at lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Ma, Le
> <Le.Ma at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; Quan, Evan
> <Evan.Quan at amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
> Subject: [RESEND PATCH 4/5] Subject: drm/amdgpu: Redo XGMI reset
> synchronization.
>
> Use task barrier in XGMI hive to synchronize ASIC resets across
> devices in XGMI hive.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com
> <mailto:andrey.grodzovsky at amd.com>>
>
> ---
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 42
> +++++++++++++++++++++++++-----
>
> 1 file changed, 36 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>
> index 1d19edfa..e4089a0 100644
>
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>
> @@ -67,6 +67,7 @@
>
> #include "amdgpu_tmz.h"
>
> #include <linux/suspend.h>
>
> +#include <drm/task_barrier.h>
>
> MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
>
> MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
>
> @@ -2663,14 +2664,43 @@ static void
> amdgpu_device_xgmi_reset_func(struct work_struct *__work) {
>
> struct amdgpu_device *adev =
>
> container_of(__work, struct amdgpu_device, xgmi_reset_work);
>
> + struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev, 0);
>
> - if (amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_BACO)
>
> - adev->asic_reset_res = (adev->in_baco == false) ?
>
> - amdgpu_device_baco_enter(adev->ddev) :
>
> - qamdgpu_device_baco_exit(adev->ddev);
>
> - else
>
> - adev->asic_reset_res = amdgpu_asic_reset(adev);
>
> + /*
>
> + * Use task barrier to synchronize all xgmi reset works
> across the
>
> + * hive.
>
> + * task_barrier_enter and task_barrier_exit will block
> untill all the
>
> + * threads running the xgmi reset works reach those points.
> I assume
>
> + * guarantee of progress here for all the threads as the
> workqueue code
>
> + * creates new worker threads as needed by amount of work
> items in queue
>
> + * (see worker_thread) and also each thread sleeps in the
> barrir and by
>
> + * this yielding the CPU for other work threads to make
> progress.
>
> + */
>
> [Le]: This comments can be adjusted since we switch to
> system_unbound_wq in patch #5.
>
> + if (amdgpu_asic_reset_method(adev) == AMD_RESET_METHOD_BACO) {
>
> +
>
> + if (hive)
>
> + task_barrier_enter(&hive->tb);
>
> [Le]: The multiple hive condition can be checked only once and moved
> to the location right after the assignment.
>
Not sure what you meant here but in fact let's note that while in
amdgpu_device_xgmi_reset_func it's a bug for amdgpu_get_xgmi_hive to
return NULL so I think better instead to add WARN_ON(!hive,"...") and
return right at the beginning of the function if indeed hive == NULL
Andrey
> +
>
> + adev->asic_reset_res = amdgpu_device_baco_enter(adev->ddev);
>
> +
>
> + if (adev->asic_reset_res)
>
> + goto fail;
>
> +
>
> + if (hive)
>
> + task_barrier_exit(&hive->tb);
>
> [Le]: Same as above.
>
> +
>
> + adev->asic_reset_res = amdgpu_device_baco_exit(adev->ddev);
>
> +
>
> + if (adev->asic_reset_res)
>
> + goto fail;
>
> + } else {
>
> + if (hive)
>
> + task_barrier_full(&hive->tb);
>
> [Le]: Same as above.
>
> With above addressed, Reviewed-by: Le Ma <Le.Ma at amd.com
> <mailto:Le.Ma at amd.com>>
>
> Regards,
>
> Ma Le
>
> +
>
> + adev->asic_reset_res = amdgpu_asic_reset(adev);
>
> + }
>
> +fail:
>
> if (adev->asic_reset_res)
>
> DRM_WARN("ASIC reset failed with error, %d for
> drm dev, %s",
>
> adev->asic_reset_res, adev->ddev->unique);
>
> --
>
> 2.7.4
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20191212/d7d83df0/attachment-0001.htm>
More information about the amd-gfx
mailing list