[PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device

Thu Dec 9 09:00:51 UTC 2021

Am 09.12.21 um 09:49 schrieb Lang Yu:
> It is useful to maintain error context when debugging
> SW/FW issues. We introduce amdgpu_device_halt() for this
> purpose. It will bring hardware to a kind of halt state,
> so that no one can touch it any more.
>
> Compare to a simple hang, the system will keep stable
> at least for SSH access. Then it should be trivial to
> inspect the hardware state and see what's going on.
>
> Suggested-by: Christian Koenig <christian.koenig at amd.com>
> Suggested-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
> Signed-off-by: Lang Yu <lang.yu at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 39 ++++++++++++++++++++++
>   2 files changed, 41 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index c5cfe2926ca1..3f5f8f62aa5c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1317,6 +1317,8 @@ void amdgpu_device_flush_hdp(struct amdgpu_device *adev,
>   void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
>   		struct amdgpu_ring *ring);
>   
> +void amdgpu_device_halt(struct amdgpu_device *adev);
> +
>   /* atpx handler */
>   #if defined(CONFIG_VGA_SWITCHEROO)
>   void amdgpu_register_atpx_handler(void);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index a1c14466f23d..62216627cc83 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5634,3 +5634,42 @@ void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
>   
>   	amdgpu_asic_invalidate_hdp(adev, ring);
>   }
> +
> +/**
> + * amdgpu_device_halt() - bring hardware to some kind of halt state
> + *
> + * @adev: amdgpu_device pointer
> + *
> + * Bring hardware to some kind of halt state so that no one can touch it
> + * any more. It will help to maintain error context when error occurred.
> + * Compare to a simple hang, the system will keep stable at least for SSH
> + * access. Then it should be trivial to inspect the hardware state and
> + * see what's going on. Implemented as following:
> + *
> + * 1. drm_dev_unplug() makes device inaccessible to user space(IOCTLs, etc),
> + *    clears all CPU mappings to device, disallows remappings through page faults
> + * 2. amdgpu_irq_disable_all() disables all interrupts
> + * 3. amdgpu_fence_driver_hw_fini() signals all HW fences
> + * 4. amdgpu_device_unmap_mmio() clears all MMIO mappings
> + * 5. pci_disable_device() and pci_wait_for_pending_transaction()
> + *    flush any in flight DMA operations
> + * 6. set adev->no_hw_access to true
> + */
> +void amdgpu_device_halt(struct amdgpu_device *adev)
> +{
> +	struct pci_dev *pdev = adev->pdev;
> +	struct drm_device *ddev = &adev->ddev;
> +
> +	drm_dev_unplug(ddev);
> +
> +	amdgpu_irq_disable_all(adev);
> +
> +	amdgpu_fence_driver_hw_fini(adev);
> +
> +	amdgpu_device_unmap_mmio(adev);
> +
> +	pci_disable_device(pdev);
> +	pci_wait_for_pending_transaction(pdev);
> +
> +	adev->no_hw_access = true;

I think we need to reorder this, e.g. set adev->no_hw_access much 
earlier for example. Andrey what do you think?

Apart from that sounds like the right idea to me.

Regards,
Christian.

> +}