[PATCH 3/3] drm/amdgpu: add AMDGPURESET uevent on AMD GPU reset

Wed Jan 12 13:00:46 UTC 2022

On 1/11/2022 12:26 PM, Christian König wrote:
> Am 11.01.22 um 08:12 schrieb Somalapuram Amaranath:
>> AMDGPURESET uevent added to notify userspace,
>> collect dump_stack and amdgpu_reset_reg_dumps
>>
>> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/nv.c | 31 +++++++++++++++++++++++++++++++
>>   1 file changed, 31 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c 
>> b/drivers/gpu/drm/amd/amdgpu/nv.c
>> index 2ec1ffb36b1f..41a2c37e825f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
>> @@ -529,10 +529,41 @@ nv_asic_reset_method(struct amdgpu_device *adev)
>>       }
>>   }
>> +/**
>> + * drm_sysfs_reset_event - generate a DRM uevent
>> + * @dev: DRM device
>> + *
>> + * Send a uevent for the DRM device specified by @dev.  Currently we 
>> only
>> + * set AMDGPURESET=1 in the uevent environment, but this could be 
>> expanded to
>> + * deal with other types of events.
>> + *
>> + * Any new uapi should be using the drm_sysfs_connector_status_event()
>> + * for uevents on connector status change.
>> + */
>> +void drm_sysfs_reset_event(struct drm_device *dev)
>> +{
>> +    char *event_string = "AMDGPURESET=1";
>> +    char *envp[2] = { event_string, NULL };
>> +
>> +    kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
> 
> That won't work like this.
> 
> kobject_uevent_env() needs to allocate memory to send the event to 
> userspace and that is not allowed while we do an reset. The Intel guys 
> felt into the same trap already.
> 
> What we could maybe do is to teach kobject_uevent_env() gfp flags and 
> make all allocations from the atomic pool.
> 
> Regards,
> Christian.

Hi Amar,

I see another problem here,

We are sending the event at the GPU reset, but we are collecting the 
register values only when the corresponding userspace agent calls a 
read() on the respective sysfs entry.

There is a very fair possibility that the register values are reset by 
the HW by then, and we are reading re-programmed values. At least there 
will be a race().

I think we should change this design in such a way:
1. Get into gpu_reset()
2. collect the register values and save this context into a separate 
file/node. Probably sending a trace_event here would be easiest way.
3. Send the drm event to the userspace client
4. The client reads from the trace file, and gets the data.

- Shashank

> 
>> +}
>> +
>> +void amdgpu_reset_dumps(struct amdgpu_device *adev)
>> +{
>> +    struct drm_device *ddev = adev_to_drm(adev);
>> +    /* original raven doesn't have full asic reset */
>> +    if ((adev->apu_flags & AMD_APU_IS_RAVEN) &&
>> +        !(adev->apu_flags & AMD_APU_IS_RAVEN2))
>> +        return;
>> +    drm_sysfs_reset_event(ddev);
>> +    dump_stack();
>> +}
>> +
>>   static int nv_asic_reset(struct amdgpu_device *adev)
>>   {
>>       int ret = 0;
>> +    amdgpu_reset_dumps(adev);
>>       switch (nv_asic_reset_method(adev)) {
>>       case AMD_RESET_METHOD_PCI:
>>           dev_info(adev->dev, "PCI reset\n");
>