[PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4

Wed May 6 03:45:13 UTC 2020

[AMD Official Use Only - Internal Distribution Only]

Hi Christian,

> Hi Jiange, well that looks correct to me, but seems to be a bit to complicated. What exactly was wrong with version 3?
(1) If you open amdgpu_autodump, use it and close it, then you can't open it again, because wait_for_completion_interruptible_timeout() would decrement amdgpu_autodump.dumping.done to 0, then .open() would always return failure except the first time.
(2) reset lock is not optimal in this case. Because usermode app would take any operation at any time and there are so many race conditions to solve. A dedicated lock would be simpler and clearer.

> Please completely drop this extra check. Waking up the queue and waiting for completion should always work when done right.
This check is very necessary, because if there is no usermode app listening, the following wait_for_completion_interruptible_timeout() would wait until timeout anyway, which is 10 minutes for nothing. This is not what we wanted.

Jiange

-----Original Message-----
From: Koenig, Christian <Christian.Koenig at amd.com> 
Sent: Wednesday, April 29, 2020 10:09 PM
To: Pelloux-prayer, Pierre-eric <Pierre-eric.Pelloux-prayer at amd.com>; Zhao, Jiange <Jiange.Zhao at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Kuehling, Felix <Felix.Kuehling at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; Liu, Monk <Monk.Liu at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>
Subject: Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4

Am 29.04.20 um 16:04 schrieb Pierre-Eric Pelloux-Prayer:
> Hi Jiange,
>
> This version seems to work fine.
>
> Tested-by: Pierre-Eric Pelloux-Prayer 
> <pierre-eric.pelloux-prayer at amd.com>
>
>
> On 29/04/2020 07:08, Zhao, Jiange wrote:
>> [AMD Official Use Only - Internal Distribution Only]
>>
>>
>> Hi all,
>>
>> I worked out the race condition and here is version 5. Please have a look.
>>
>> Jiange
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> ------------------------
>> *From:* Zhao, Jiange <Jiange.Zhao at amd.com>
>> *Sent:* Wednesday, April 29, 2020 1:06 PM
>> *To:* amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
>> *Cc:* Koenig, Christian <Christian.Koenig at amd.com>; Kuehling, Felix 
>> <Felix.Kuehling at amd.com>; Pelloux-prayer, Pierre-eric 
>> <Pierre-eric.Pelloux-prayer at amd.com>; Deucher, Alexander 
>> <Alexander.Deucher at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; 
>> Liu, Monk <Monk.Liu at amd.com>; Zhao, Jiange <Jiange.Zhao at amd.com>
>> *Subject:* [PATCH] drm/amdgpu: Add autodump debugfs node for gpu 
>> reset v4
>>   
>> From: Jiange Zhao <Jiange.Zhao at amd.com>
>>
>> When GPU got timeout, it would notify an interested part of an 
>> opportunity to dump info before actual GPU reset.
>>
>> A usermode app would open 'autodump' node under debugfs system and 
>> poll() for readable/writable. When a GPU reset is due, amdgpu would 
>> notify usermode app through wait_queue_head and give it 10 minutes to 
>> dump info.
>>
>> After usermode app has done its work, this 'autodump' node is closed.
>> On node closure, amdgpu gets to know the dump is done through the 
>> completion that is triggered in release().
>>
>> There is no write or read callback because necessary info can be 
>> obtained through dmesg and umr. Messages back and forth between 
>> usermode app and amdgpu are unnecessary.
>>
>> v2: (1) changed 'registered' to 'app_listening'
>>      (2) add a mutex in open() to prevent race condition
>>
>> v3 (chk): grab the reset lock to avoid race in autodump_open,
>>            rename debugfs file to amdgpu_autodump,
>>            provide autodump_read as well,
>>            style and code cleanups
>>
>> v4: add 'bool app_listening' to differentiate situations, so that
>>      the node can be reopened; also, there is no need to wait for
>>      completion when no app is waiting for a dump.
>>
>> v5: change 'bool app_listening' to 'enum amdgpu_autodump_state'
>>      add 'app_state_mutex' for race conditions:
>>          (1)Only 1 user can open this file node
>>          (2)wait_dump() can only take effect after poll() executed.
>>          (3)eliminated the race condition between release() and
>>             wait_dump()

Hi Jiange, well that looks correct to me, but seems to be a bit to complicated. What exactly was wrong with version 3?

One more comment below.

>>
>> Signed-off-by: Jiange Zhao <Jiange.Zhao at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h         |  2 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 92 
>> ++++++++++++++++++++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h | 14 ++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  2 +
>>   4 files changed, 109 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index bc1e0fd71a09..6f8ef98c4b97 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -990,6 +990,8 @@ struct amdgpu_device {
>>           char                            product_number[16];
>>           char                            product_name[32];
>>           char                            serial[16];
>> +
>> +       struct amdgpu_autodump          autodump;
>>   };
>>   
>>   static inline struct amdgpu_device *amdgpu_ttm_adev(struct 
>> ttm_bo_device *bdev) diff --git 
>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> index 1a4894fa3693..1d4a95e8ad5b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> @@ -27,7 +27,7 @@
>>   #include <linux/pci.h>
>>   #include <linux/uaccess.h>
>>   #include <linux/pm_runtime.h>
>> -
>> +#include <linux/poll.h>
>>   #include <drm/drm_debugfs.h>
>>   
>>   #include "amdgpu.h"
>> @@ -74,8 +74,96 @@ int amdgpu_debugfs_add_files(struct amdgpu_device 
>> *adev,
>>           return 0;
>>   }
>>   
>> +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) { #if 
>> +defined(CONFIG_DEBUG_FS)
>> +       unsigned long timeout = 600 * HZ;
>> +       int ret;
>> +
>> +       mutex_lock(&adev->autodump.app_state_mutex);
>> +       if (adev->autodump.app_state != AMDGPU_AUTODUMP_LISTENING) {
>> +               mutex_unlock(&adev->autodump.app_state_mutex);
>> +               return 0;
>> +       }
>> +       mutex_unlock(&adev->autodump.app_state_mutex);

Please completely drop this extra check. Waking up the queue and waiting for completion should always work when done right.

Regards,
Christian.

>> +
>> +       wake_up_interruptible(&adev->autodump.gpu_hang);
>> +
>> +       ret = 
>> +wait_for_completion_interruptible_timeout(&adev->autodump.dumping, 
>> +timeout);
>> +       if (ret == 0) {
>> +               pr_err("autodump: timeout, move on to gpu 
>> +recovery\n");
>> +               return -ETIMEDOUT;
>> +       }
>> +#endif
>> +       return 0;
>> +}
>> +
>>   #if defined(CONFIG_DEBUG_FS)
>>   
>> +static int amdgpu_debugfs_autodump_open(struct inode *inode, struct 
>> +file *file) {
>> +       struct amdgpu_device *adev = inode->i_private;
>> +       int ret;
>> +
>> +       file->private_data = adev;
>> +
>> +       mutex_lock(&adev->autodump.app_state_mutex);
>> +       if (adev->autodump.app_state == AMDGPU_AUTODUMP_NO_APP) {
>> +               adev->autodump.app_state = 
>> +AMDGPU_AUTODUMP_REGISTERED;
>> +               ret = 0;
>> +       } else {
>> +               ret = -EBUSY;
>> +       }
>> +       mutex_unlock(&adev->autodump.app_state_mutex);
>> +
>> +       return ret;
>> +}
>> +
>> +static int amdgpu_debugfs_autodump_release(struct inode *inode, 
>> +struct file *file) {
>> +       struct amdgpu_device *adev = file->private_data;
>> +
>> +       mutex_lock(&adev->autodump.app_state_mutex);
>> +       complete(&adev->autodump.dumping);
>> +       adev->autodump.app_state = AMDGPU_AUTODUMP_NO_APP;
>> +       mutex_unlock(&adev->autodump.app_state_mutex);
>> +       return 0;
>> +}
>> +
>> +static unsigned int amdgpu_debugfs_autodump_poll(struct file *file, 
>> +struct poll_table_struct *poll_table) {
>> +       struct amdgpu_device *adev = file->private_data;
>> +
>> +       mutex_lock(&adev->autodump.app_state_mutex);
>> +       poll_wait(file, &adev->autodump.gpu_hang, poll_table);
>> +       adev->autodump.app_state = AMDGPU_AUTODUMP_LISTENING;
>> +       mutex_unlock(&adev->autodump.app_state_mutex);
>> +
>> +       if (adev->in_gpu_reset)
>> +               return POLLIN | POLLRDNORM | POLLWRNORM;
>> +
>> +       return 0;
>> +}
>> +
>> +static const struct file_operations autodump_debug_fops = {
>> +       .owner = THIS_MODULE,
>> +       .open = amdgpu_debugfs_autodump_open,
>> +       .poll = amdgpu_debugfs_autodump_poll,
>> +       .release = amdgpu_debugfs_autodump_release, };
>> +
>> +static void amdgpu_debugfs_autodump_init(struct amdgpu_device *adev) 
>> +{
>> +       init_completion(&adev->autodump.dumping);
>> +       init_waitqueue_head(&adev->autodump.gpu_hang);
>> +       adev->autodump.app_state = AMDGPU_AUTODUMP_NO_APP;
>> +       mutex_init(&adev->autodump.app_state_mutex);
>> +
>> +       debugfs_create_file("amdgpu_autodump", 0600,
>> +               adev->ddev->primary->debugfs_root,
>> +               adev, &autodump_debug_fops); }
>> +
>>   /**
>>    * amdgpu_debugfs_process_reg_op - Handle MMIO register 
>> reads/writes
>>    *
>> @@ -1434,6 +1522,8 @@ int amdgpu_debugfs_init(struct amdgpu_device 
>> *adev)
>>   
>>           amdgpu_ras_debugfs_create_all(adev);
>>   
>> +       amdgpu_debugfs_autodump_init(adev);
>> +
>>           return amdgpu_debugfs_add_files(adev, amdgpu_debugfs_list,
>>                                           
>> ARRAY_SIZE(amdgpu_debugfs_list));
>>   }
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h
>> index de12d1101526..51b4ea790686 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h
>> @@ -31,6 +31,19 @@ struct amdgpu_debugfs {
>>           unsigned                num_files;
>>   };
>>   
>> +enum amdgpu_autodump_state {
>> +       AMDGPU_AUTODUMP_NO_APP,
>> +       AMDGPU_AUTODUMP_REGISTERED,
>> +       AMDGPU_AUTODUMP_LISTENING
>> +};
>> +
>> +struct amdgpu_autodump {
>> +       struct mutex                    app_state_mutex;
>> +       enum amdgpu_autodump_state      app_state;
>> +       struct completion               dumping;
>> +       struct wait_queue_head          gpu_hang; };
>> +
>>   int amdgpu_debugfs_regs_init(struct amdgpu_device *adev);
>>   int amdgpu_debugfs_init(struct amdgpu_device *adev);
>>   void amdgpu_debugfs_fini(struct amdgpu_device *adev); @@ -40,3 
>> +53,4 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev,
>>   int amdgpu_debugfs_fence_init(struct amdgpu_device *adev);
>>   int amdgpu_debugfs_firmware_init(struct amdgpu_device *adev);
>>   int amdgpu_debugfs_gem_init(struct amdgpu_device *adev);
>> +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index e6978a2c26b7..8109946075b1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3912,6 +3912,8 @@ static int amdgpu_device_pre_asic_reset(struct 
>> amdgpu_device *adev,
>>           int i, r = 0;
>>           bool need_full_reset  = *need_full_reset_arg;
>>   
>> +       amdgpu_debugfs_wait_dump(adev);
>> +
>>           /* block all schedulers and reset given job's ring */
>>           for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>                   struct amdgpu_ring *ring = adev->rings[i];
>> --
>> 2.20.1
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>