[PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device

Thu Dec 9 15:38:03 UTC 2021

On 2021-12-09 4:00 a.m., Christian König wrote:
>
>
> Am 09.12.21 um 09:49 schrieb Lang Yu:
>> It is useful to maintain error context when debugging
>> SW/FW issues. We introduce amdgpu_device_halt() for this
>> purpose. It will bring hardware to a kind of halt state,
>> so that no one can touch it any more.
>>
>> Compare to a simple hang, the system will keep stable
>> at least for SSH access. Then it should be trivial to
>> inspect the hardware state and see what's going on.
>>
>> Suggested-by: Christian Koenig <christian.koenig at amd.com>
>> Suggested-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>> Signed-off-by: Lang Yu <lang.yu at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 39 ++++++++++++++++++++++
>>   2 files changed, 41 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index c5cfe2926ca1..3f5f8f62aa5c 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -1317,6 +1317,8 @@ void amdgpu_device_flush_hdp(struct 
>> amdgpu_device *adev,
>>   void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
>>           struct amdgpu_ring *ring);
>>   +void amdgpu_device_halt(struct amdgpu_device *adev);
>> +
>>   /* atpx handler */
>>   #if defined(CONFIG_VGA_SWITCHEROO)
>>   void amdgpu_register_atpx_handler(void);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index a1c14466f23d..62216627cc83 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -5634,3 +5634,42 @@ void amdgpu_device_invalidate_hdp(struct 
>> amdgpu_device *adev,
>>         amdgpu_asic_invalidate_hdp(adev, ring);
>>   }
>> +
>> +/**
>> + * amdgpu_device_halt() - bring hardware to some kind of halt state
>> + *
>> + * @adev: amdgpu_device pointer
>> + *
>> + * Bring hardware to some kind of halt state so that no one can 
>> touch it
>> + * any more. It will help to maintain error context when error 
>> occurred.
>> + * Compare to a simple hang, the system will keep stable at least 
>> for SSH
>> + * access. Then it should be trivial to inspect the hardware state and
>> + * see what's going on. Implemented as following:
>> + *
>> + * 1. drm_dev_unplug() makes device inaccessible to user 
>> space(IOCTLs, etc),
>> + *    clears all CPU mappings to device, disallows remappings 
>> through page faults
>> + * 2. amdgpu_irq_disable_all() disables all interrupts
>> + * 3. amdgpu_fence_driver_hw_fini() signals all HW fences
>> + * 4. amdgpu_device_unmap_mmio() clears all MMIO mappings
>> + * 5. pci_disable_device() and pci_wait_for_pending_transaction()
>> + *    flush any in flight DMA operations
>> + * 6. set adev->no_hw_access to true
>> + */
>> +void amdgpu_device_halt(struct amdgpu_device *adev)
>> +{
>> +    struct pci_dev *pdev = adev->pdev;
>> +    struct drm_device *ddev = &adev->ddev;
>> +
>> +    drm_dev_unplug(ddev);
>> +
>> +    amdgpu_irq_disable_all(adev);
>> +
>> +    amdgpu_fence_driver_hw_fini(adev);
>> +
>> +    amdgpu_device_unmap_mmio(adev);

Note that this one will cause page fault on any subsequent MMIO access 
(trough registers or by direct VRAM access)

>>
>> +
>> +    pci_disable_device(pdev);
>> +    pci_wait_for_pending_transaction(pdev);
>> +
>> +    adev->no_hw_access = true;
>
> I think we need to reorder this, e.g. set adev->no_hw_access much 
> earlier for example. Andrey what do you think?

Earlier can be ok but at least after the last HW configuration we 
actaully want to do like disabling IRQs.

Andrey

>
> Apart from that sounds like the right idea to me.
>
> Regards,
> Christian.
>
>> +}
>