[PATCH 0/7] *** GPU recover V3 ***

Alex Deucher alexdeucher at gmail.com
Thu Nov 9 18:08:41 UTC 2017


On Thu, Nov 9, 2017 at 4:35 AM, Julien Isorce <julien.isorce at gmail.com> wrote:
> Hi Monk.
>
> I am interested on this. Currently when a "ring X stalled for more than N
> sec" happens it usually goes into the gpu reset routine.
> Does it always cause the vram to be lost ? Could you explain what happens if
> the vram remains lost ?

It means the contents of vram are gone or unreliable.  In that case
applications need to re-initialize all of their buffers before
submitting any work.  You really need to add GL_robustness support to
any applications you care about.  Whether vram is lost or not depends
on the reset method and the asic.  E.g., soft reset of a specific
engine won't cause a loss of vram, but a full adapter reset or an FLR
may.

>
> I am asking this because I experienced some recurrent gpu reset that are
> marked succeeded from the log but fail in the "resume" step.
> I would not be interested in this if it would always leave a chance to the
> user to cleanly reboot the machine.
>
> The issue is that it can require a hard reboot without kernel panic and
> without keeping the keyboard responding to magic keys.
> Are those patches trying to address this issue ?
>
> Note that here "issue" is not referring to the root cause of a ring X
> stalled and it is also not referring to why "resume" step fails.

There were a few issues that caused problems with GPU reset.  The
biggest was that the GPU scheduler deadlocked in certain cases so if
you got a GPU hang, the driver locked up.  That should mostly be
straightened out at this point.  I think there may still be some
deadlocks in the modesetting code after a reset.  Once that is sorted,
it will come down to fine tuning the actual reset sequences.  Full
adapter resets are the easiest to get working reliably (and are
already implemented in the driver), but also the most destructive.

Alex

>
> Thx a lot
> Julien
>
>
> On 30 October 2017 at 04:15, Monk Liu <Monk.Liu at amd.com> wrote:
>>
>> *** job skipping logic in scheduler part is re-implemented  ***
>>
>> Monk Liu (7):
>>   amd/scheduler:imple job skip feature(v3)
>>   drm/amdgpu:implement new GPU recover(v3)
>>   drm/amdgpu:cleanup in_sriov_reset and lock_reset
>>   drm/amdgpu:cleanup ucode_init_bo
>>   drm/amdgpu:block kms open during gpu_reset
>>   drm/amdgpu/sriov:fix memory leak in psp_load_fw
>>   drm/amdgpu:fix random missing of FLR NOTIFY
>>
>>  drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   9 +-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 311
>> ++++++++++++--------------
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c     |  10 +-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c       |   2 +-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |  18 +-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c       |   3 +
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c       |  22 +-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c     |   4 +-
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c      |   2 -
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h      |   2 -
>>  drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c         |   6 +-
>>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c         |   6 +-
>>  drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c         |  16 +-
>>  drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c         |   2 +-
>>  drivers/gpu/drm/amd/scheduler/gpu_scheduler.c |  39 ++--
>>  15 files changed, 220 insertions(+), 232 deletions(-)
>>
>> --
>> 2.7.4
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>


More information about the amd-gfx mailing list