[PATCH] drm/amdgpu: Initialize schedulers before using them

Luben Tuikov luben.tuikov at amd.com
Tue Oct 24 02:55:47 UTC 2023


On 2023-10-23 01:49, Christian König wrote:
> 
> 
> Am 23.10.23 um 05:23 schrieb Luben Tuikov:
>> Initialize ring schedulers before using them, very early in the amdgpu boot,
>> at PCI probe time, specifically at frame-buffer dumb-create at fill-buffer.
>>
>> This was discovered by using dynamic scheduler run-queues, which showed that
>> amdgpu was using a scheduler before calling drm_sched_init(), and the only
>> reason it was working was because sched_rq[] was statically allocated in the
>> scheduler structure. However, the scheduler structure had _not_ been
>> initialized.
>>
>> When switching to dynamically allocated run-queues, this lack of
>> initialization was causing an oops and a blank screen at boot up. This patch
>> fixes this amdgpu bug.
>>
>> This patch depends on the "drm/sched: Convert the GPU scheduler to variable
>> number of run-queues" patch, as that patch prevents subsequent scheduler
>> initialization if a scheduler has already been initialized.
>>
>> Cc: Christian König <christian.koenig at amd.com>
>> Cc: Alex Deucher <Alexander.Deucher at amd.com>
>> Cc: Felix Kuehling <Felix.Kuehling at amd.com>
>> Cc: AMD Graphics <amd-gfx at lists.freedesktop.org>
>> Signed-off-by: Luben Tuikov <luben.tuikov at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 14 ++++++++++++++
>>   1 file changed, 14 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index 4e51dce3aab5d6..575ef7e1e30fd4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -60,6 +60,7 @@
>>   #include "amdgpu_atomfirmware.h"
>>   #include "amdgpu_res_cursor.h"
>>   #include "bif/bif_4_1_d.h"
>> +#include "amdgpu_reset.h"
>>   
>>   MODULE_IMPORT_NS(DMA_BUF);
>>   
>> @@ -2059,6 +2060,19 @@ void amdgpu_ttm_set_buffer_funcs_status(struct amdgpu_device *adev, bool enable)
>>   
>>   		ring = adev->mman.buffer_funcs_ring;
>>   		sched = &ring->sched;
>> +
>> +		r = drm_sched_init(sched, &amdgpu_sched_ops,
>> +				   DRM_SCHED_PRIORITY_COUNT,
>> +				   ring->num_hw_submission, 0,
>> +				   adev->sdma_timeout, adev->reset_domain->wq,
>> +				   ring->sched_score, ring->name,
>> +				   adev->dev);
>> +		if (r) {
>> +			drm_err(adev, "%s: couldn't initialize ring:%s error:%d\n",
>> +				__func__, ring->name, r);
>> +			return;
>> +		}
> 
> That doesn't look correct either.
> 
> amdgpu_ttm_set_buffer_funcs_status() should only be called with 
> enable=true as argument *after* the copy ring is initialized and valid 
> to use. One part of this ring initialization is to setup the scheduler.

It's the only way to keep the functionality of amdgpu_fill_buffer()
from amdgpu_mode_dumb_create(), from drm_client_framebuffer_create(),
from ... without an oops and a blank screen at boot up.

Here is a stack of the oops:

Oct 20 22:12:34 fedora kernel: RIP: 0010:drm_sched_job_arm+0x1f/0x60 [gpu_sched]
Oct 20 22:12:34 fedora kernel: Code: 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 53 48 8b 6f 58 48 85 ed 74 3f 48 89 fb 48 89 ef e8 95 34 00 00 48 8b 45 10 <48> 8b 50 08 48 89 53 18 8b 45 24 89 43 54 b8 01 00 00 00 f0 48 0f
Oct 20 22:12:34 fedora kernel: RSP: 0018:ffffc90001613838 EFLAGS: 00010246
Oct 20 22:12:34 fedora kernel: RAX: 0000000000000000 RBX: ffff88812f33b400 RCX: 0000000000000004
Oct 20 22:12:34 fedora kernel: RDX: 0000000000000000 RSI: ffffc9000395145c RDI: ffff88812eacf850
Oct 20 22:12:34 fedora kernel: RBP: ffff88812eacf850 R08: 0000000000000004 R09: 0000000000030000
Oct 20 22:12:34 fedora kernel: R10: ffffffffc066b850 R11: ffffffffbc848ef1 R12: 0000000000000000
Oct 20 22:12:34 fedora kernel: R13: 0000000000000004 R14: 0000008003000000 R15: 0000000001000000
Oct 20 22:12:34 fedora kernel: FS:  00007f7be4866940(0000) GS:ffff88880ed00000(0000) knlGS:0000000000000000
Oct 20 22:12:34 fedora kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 20 22:12:34 fedora kernel: CR2: 0000000000000008 CR3: 000000012cf22000 CR4: 00000000003506e0
Oct 20 22:12:34 fedora kernel: Call Trace:
Oct 20 22:12:34 fedora kernel:  <TASK>
Oct 20 22:12:34 fedora kernel:  ? __die+0x1f/0x70
Oct 20 22:12:34 fedora kernel:  ? page_fault_oops+0x149/0x440
Oct 20 22:12:34 fedora kernel:  ? drm_sched_fence_alloc+0x1a/0x40 [gpu_sched]
Oct 20 22:12:34 fedora kernel:  ? amdgpu_job_alloc_with_ib+0x34/0xb0 [amdgpu]
Oct 20 22:12:34 fedora kernel:  ? srso_return_thunk+0x5/0x10
Oct 20 22:12:34 fedora kernel:  ? do_user_addr_fault+0x65/0x650
Oct 20 22:12:34 fedora kernel:  ? drm_client_framebuffer_create+0xa3/0x280 [drm]
Oct 20 22:12:34 fedora kernel:  ? exc_page_fault+0x7b/0x180
Oct 20 22:12:34 fedora kernel:  ? asm_exc_page_fault+0x22/0x30
Oct 20 22:12:34 fedora kernel:  ? local_pci_probe+0x41/0x90
Oct 20 22:12:34 fedora kernel:  ? __pfx_sdma_v5_0_emit_fill_buffer+0x10/0x10 [amdgpu]
Oct 20 22:12:34 fedora kernel:  ? drm_sched_job_arm+0x1f/0x60 [gpu_sched]
Oct 20 22:12:34 fedora kernel:  ? drm_sched_job_arm+0x1b/0x60 [gpu_sched]
Oct 20 22:12:34 fedora kernel:  amdgpu_job_submit+0xf/0x70 [amdgpu]
Oct 20 22:12:34 fedora kernel:  amdgpu_fill_buffer+0x2b4/0x650 [amdgpu]
Oct 20 22:12:34 fedora kernel:  amdgpu_bo_create+0x401/0x4a0 [amdgpu]
Oct 20 22:12:34 fedora kernel:  ? srso_return_thunk+0x5/0x10
Oct 20 22:12:34 fedora kernel:  amdgpu_bo_create_user+0x24/0x40 [amdgpu]
Oct 20 22:12:34 fedora kernel:  amdgpu_mode_dumb_create+0xf8/0x1a0 [amdgpu]
Oct 20 22:12:34 fedora kernel:  ? drm_client_framebuffer_create+0x69/0x280 [drm]
Oct 20 22:12:34 fedora kernel:  ? __pfx_amdgpu_bo_user_destroy+0x10/0x10 [amdgpu]
Oct 20 22:12:34 fedora kernel:  drm_client_framebuffer_create+0xa3/0x280 [drm]
Oct 20 22:12:34 fedora kernel:  ? amdgpu_vm_bo_add+0x2a/0xb0 [amdgpu]
Oct 20 22:12:34 fedora kernel:  drm_fbdev_generic_helper_fb_probe+0x61/0x190 [drm_kms_helper]
Oct 20 22:12:34 fedora kernel:  __drm_fb_helper_initial_config_and_unlock+0x297/0x500 [drm_kms_helper]
Oct 20 22:12:34 fedora kernel:  drm_fbdev_generic_client_hotplug+0x66/0xb0 [drm_kms_helper]
Oct 20 22:12:34 fedora kernel:  drm_client_register+0x75/0xb0 [drm]
Oct 20 22:12:34 fedora kernel:  amdgpu_pci_probe+0x3ac/0x440 [amdgpu]
Oct 20 22:12:34 fedora kernel:  local_pci_probe+0x41/0x90
Oct 20 22:12:34 fedora kernel:  pci_device_probe+0xb3/0x210
Oct 20 22:12:34 fedora kernel:  really_probe+0x19e/0x3e0
Oct 20 22:12:34 fedora kernel:  ? __pfx___driver_attach+0x10/0x10
Oct 20 22:12:34 fedora kernel:  __driver_probe_device+0x78/0x160
Oct 20 22:12:34 fedora kernel:  driver_probe_device+0x1f/0x90
Oct 20 22:12:34 fedora kernel:  __driver_attach+0xce/0x1c0
Oct 20 22:12:34 fedora kernel:  bus_for_each_dev+0x63/0xa0
Oct 20 22:12:34 fedora kernel:  bus_add_driver+0x112/0x210
Oct 20 22:12:34 fedora kernel:  driver_register+0x55/0x100
Oct 20 22:12:34 fedora kernel:  ? __pfx_amdgpu_init+0x10/0x10 [amdgpu]
Oct 20 22:12:34 fedora kernel:  do_one_initcall+0x46/0x310
Oct 20 22:12:34 fedora kernel:  ? srso_return_thunk+0x5/0x10
Oct 20 22:12:34 fedora kernel:  ? kmalloc_trace+0x26/0x90
Oct 20 22:12:34 fedora kernel:  do_init_module+0x60/0x230
Oct 20 22:12:34 fedora kernel:  init_module_from_file+0x75/0xa0
Oct 20 22:12:34 fedora kernel:  idempotent_init_module+0xf9/0x270
Oct 20 22:12:34 fedora kernel:  __x64_sys_finit_module+0x5a/0xb0
Oct 20 22:12:34 fedora kernel:  do_syscall_64+0x3b/0x90
Oct 20 22:12:34 fedora kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8

It is at PCI probe time, when DRM probes for an fb dev.

> 
> 
>> +
>>   		r = drm_sched_entity_init(&adev->mman.high_pr,
>>   					  DRM_SCHED_PRIORITY_KERNEL, &sched,
>>   					  1, NULL);
> 
> That here looks totally incorrect and misplaced to me. 
> amdgpu_ttm_set_buffer_funcs_status() should only enabled/disable using 
> the copy functions and not really initialize the entity.
> 
> So the entity should only be created when enable=true and it should 
> especially *not* re-created all the time without properly destroying it.
> 
> Can you look at the history of the code? I'm pretty sure that this was 
> at some time correctly implemented.

Yeah, the drm_sched_entity_init() line above--which is not part of this
patch--comes from this commit:

commit c3aaca43fb07ce
Author: Mukul Joshi <mukul.joshi at amd.com>
Date:   Wed May 17 15:53:50 2023 -0400

    drm/amdgpu: Add a low priority scheduler for VRAM clearing
    
    Add a low priority DRM scheduler for VRAM clearing instead of using
    the exisiting high priority scheduler. Use the high priority scheduler
    for migrations and evictions.
    
    Signed-off-by: Mukul Joshi <mukul.joshi at amd.com>
    Acked-by: Felix Kuehling <Felix.Kuehling at amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher at amd.com>

The options are,

a) Revert c3aaca43fb07ce.

b) Let this patch in, so that it's not blocking the DRM dynamic sched_rq,
   and we can fix this properly in the future locally in amdgpu, as it is
   a driver issue, and it shouldn't be blocking the DRM dynamic sched_rq.
   If we had the dynamic sched_rq in before May 17 of this year, c3aaca43fb07ce
   wouldn't have been able to go in (due to oops). More details in the commit
   message of this patch above.

I'm writing this from a 6.6.0-rc6 + {DRM dynamic sched_rq patch, this patch}.

> 
> Thanks,
> Christian.
> 
>>
>> base-commit: 05d3ef8bba77c1b5f98d941d8b2d4aeab8118ef1
>> prerequisite-patch-id: c52673df9b6fc9ee001d6261c7ac107b618912a0
> 

-- 
Regards,
Luben



More information about the amd-gfx mailing list