[PATCH] Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute"
Felix Kuehling
felix.kuehling at amd.com
Mon Oct 23 23:39:25 UTC 2023
On 2023-10-23 11:15, Christian König wrote:
> Am 23.10.23 um 15:06 schrieb Daniel Tang:
>> That commit causes the screen to freeze a few moments after running
>> clinfo on v6.6-rc7 and ROCm 5.6. Sometimes the rest of the computer
>> including ssh also freezes. On v6.5-rc1, it only results in a NULL
>> pointer
>> deference message in dmesg and the process to become a zombie whose
>> unkillableness prevents shutdown without REISUB. Although llama.cpp and
>> hashcat were working in v6.2 and ROCm 5.6, broke, and are not fixed by
>> this revert, pytorch-rocm is now working with stability and without
>> whole-computer freezes caused by any accidental running of clinfo.
>>
>> This reverts commit 1d7776cc148b9f2f3ebaf1181662ba695a29f639.
>
> That result doesn't make much sense. Felix please correct me, but
> AFAIK the ATS stuff was completely removed by now.
>
> Are you sure that this is pure v6.6-rc7 and not some other patches
> applied? If yes than we must have missed something.
This revert doesn't really affect systems with ATS. It moves the sanity
check back out of the ATS-specific code.
The Null pointer dereference in the bug report comes from the CPU page
table update code:
This revert is just a roundabout way of disabling CPU page table updates
for compute VMs. But I don't think it really addresses the root cause.
>
> Regards,
> Christian.
>
>>
>> Closes: https://github.com/RadeonOpenCompute/ROCm/issues/2596
>> Signed-off-by: Daniel Tang <danielzgtg.opensource at gmail.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 12 ++++++------
>> 1 file changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index 82f25996ff5e..602f311ab766 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -2243,16 +2243,16 @@ int amdgpu_vm_make_compute(struct
>> amdgpu_device *adev, struct amdgpu_vm *vm)
>> if (r)
>> return r;
>> + /* Sanity checks */
>> + if (!amdgpu_vm_pt_is_root_clean(adev, vm)) {
>> + r = -EINVAL;
>> + goto unreserve_bo;
>> + }
>> +
>> /* Check if PD needs to be reinitialized and do it before
>> * changing any other state, in case it fails.
>> */
>> if (pte_support_ats != vm->pte_support_ats) {
>> - /* Sanity checks */
>> - if (!amdgpu_vm_pt_is_root_clean(adev, vm)) {
>> - r = -EINVAL;
>> - goto unreserve_bo;
>> - }
>> -
>> vm->pte_support_ats = pte_support_ats;
>> r = amdgpu_vm_pt_clear(adev, vm, to_amdgpu_bo_vm(vm->root.bo),
>> false);
>> --
>> 2.40.1
>>
>>
>>
>
More information about the amd-gfx
mailing list