[PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device

Fri Dec 10 15:45:50 UTC 2021

On 2021-12-09 10:47 p.m., Lang Yu wrote:
> On 12/09/ , Christian KKKnig wrote:
>> Am 09.12.21 um 16:38 schrieb Andrey Grodzovsky:
>>> On 2021-12-09 4:00 a.m., Christian König wrote:
>>>>
>>>> Am 09.12.21 um 09:49 schrieb Lang Yu:
>>>>> It is useful to maintain error context when debugging
>>>>> SW/FW issues. We introduce amdgpu_device_halt() for this
>>>>> purpose. It will bring hardware to a kind of halt state,
>>>>> so that no one can touch it any more.
>>>>>
>>>>> Compare to a simple hang, the system will keep stable
>>>>> at least for SSH access. Then it should be trivial to
>>>>> inspect the hardware state and see what's going on.
>>>>>
>>>>> Suggested-by: Christian Koenig <christian.koenig at amd.com>
>>>>> Suggested-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>> Signed-off-by: Lang Yu <lang.yu at amd.com>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 39
>>>>> ++++++++++++++++++++++
>>>>>    2 files changed, 41 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> index c5cfe2926ca1..3f5f8f62aa5c 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> @@ -1317,6 +1317,8 @@ void amdgpu_device_flush_hdp(struct
>>>>> amdgpu_device *adev,
>>>>>    void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
>>>>>            struct amdgpu_ring *ring);
>>>>>    +void amdgpu_device_halt(struct amdgpu_device *adev);
>>>>> +
>>>>>    /* atpx handler */
>>>>>    #if defined(CONFIG_VGA_SWITCHEROO)
>>>>>    void amdgpu_register_atpx_handler(void);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index a1c14466f23d..62216627cc83 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -5634,3 +5634,42 @@ void amdgpu_device_invalidate_hdp(struct
>>>>> amdgpu_device *adev,
>>>>>          amdgpu_asic_invalidate_hdp(adev, ring);
>>>>>    }
>>>>> +
>>>>> +/**
>>>>> + * amdgpu_device_halt() - bring hardware to some kind of halt state
>>>>> + *
>>>>> + * @adev: amdgpu_device pointer
>>>>> + *
>>>>> + * Bring hardware to some kind of halt state so that no one can
>>>>> touch it
>>>>> + * any more. It will help to maintain error context when error
>>>>> occurred.
>>>>> + * Compare to a simple hang, the system will keep stable at
>>>>> least for SSH
>>>>> + * access. Then it should be trivial to inspect the hardware state and
>>>>> + * see what's going on. Implemented as following:
>>>>> + *
>>>>> + * 1. drm_dev_unplug() makes device inaccessible to user
>>>>> space(IOCTLs, etc),
>>>>> + *    clears all CPU mappings to device, disallows remappings through
>>>>> page faults
>>>>> + * 2. amdgpu_irq_disable_all() disables all interrupts
>>>>> + * 3. amdgpu_fence_driver_hw_fini() signals all HW fencesamdgpu_device_unmap_mmio
>>>>>
>>>>> + * 4. amdgpu_device_unmap_mmio() clears all MMIO mappings
>>>>> + * 5. pci_disable_device() and pci_wait_for_pending_transaction()
>>>>> + *    flush any in flight DMA operations
>>>>> + * 6. set adev->no_hw_access to true
>>>>> + */
>>>>> +void amdgpu_device_halt(struct amdgpu_device *adev)
>>>>> +{
>>>>> +    struct pci_dev *pdev = adev->pdev;
>>>>> +    struct drm_device *ddev = &adev->ddev;
>>>>> +
>>>>> +    drm_dev_unplug(ddev);
>>>>> +
>>>>> +    amdgpu_irq_disable_all(adev);
>>>>> +
>>>>> +    amdgpu_fence_driver_hw_fini(adev);
>>>>> +
>>>>> +    amdgpu_device_unmap_mmio(adev);
>>>
>>> Note that this one will cause page fault on any subsequent MMIO access
>>> (trough registers or by direct VRAM access)
>>>
>>>
>>>>> +
>>>>> +    pci_disable_device(pdev);
>>>>> +    pci_wait_for_pending_transaction(pdev);
>>>>> +
>>>>> +    adev->no_hw_access = true;
>>>> I think we need to reorder this, e.g. set adev->no_hw_access much
>>>> earlier for example. Andrey what do you think?
>>>
>>> Earlier can be ok but at least after the last HW configuration we
>>> actaully want to do like disabling IRQs.
>> My thinking was to at least do this before we unmap the MMIO to avoid the
>> crash.
>>
>> Additionally to that we maybe don't even want to do this for this case.
> So we just do "adev->no_hw_access = true;" before
> "amdgpu_device_unmap_mmio(adev);".
>
> That can avoid potential registers access page faults.
> But direct VRAM access will still trigger page faults.
>
> For example,
> "cat /sys/class/drm/card0/device/pp_od_clk_voltage"
> will call smu_cmn_update_table and can still trigger
> a page fault.
>
> smu_cmn_update_table()
> {
>   ...
> 	if (drv2smu) {
> 		memcpy(table->cpu_addr, table_data, table_size);
>   ...
> }
>
> Regards,
> Lang

What Christian meant is to just drop amdgpu_device_unmap_mmio, unless 
you are worried
that we might somehow alter the SMU state from the driver side by direct 
VRAM access.

Andrey

>
>> Christian.
>>
>>>
>>> Andrey
>>>
>>>> Apart from that sounds like the right idea to me.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> +}