[PATCH 2/2] drm/amdgpu: fix vcn sw init failed

Wed Nov 13 05:22:22 UTC 2024

[AMD Official Use Only - AMD Internal Distribution Only]

HI Alex,

-----Original Message-----
From: Lazar, Lijo <Lijo.Lazar at amd.com>
Sent: Wednesday, November 13, 2024 1:03 PM
To: Alex Deucher <alexdeucher at gmail.com>
Cc: Zhang, Jesse(Jie) <Jesse.Zhang at amd.com>; amd-gfx at lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; Prosyak, Vitaly <Vitaly.Prosyak at amd.com>; Huang, Tim <Tim.Huang at amd.com>
Subject: Re: [PATCH 2/2] drm/amdgpu: fix vcn sw init failed

On 11/13/2024 10:16 AM, Alex Deucher wrote:
> On Tue, Nov 12, 2024 at 10:24 PM Lazar, Lijo <lijo.lazar at amd.com> wrote:
>>
>>
>>
>> On 11/13/2024 7:58 AM, Zhang, Jesse(Jie) wrote:
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>> Hi, Lijo,
>>>
>>> -----Original Message-----
>>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>>> Sent: Tuesday, November 12, 2024 10:54 PM
>>> To: Zhang, Jesse(Jie) <Jesse.Zhang at amd.com>;
>>> amd-gfx at lists.freedesktop.org
>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig,
>>> Christian <Christian.Koenig at amd.com>; Prosyak, Vitaly
>>> <Vitaly.Prosyak at amd.com>; Huang, Tim <Tim.Huang at amd.com>
>>> Subject: Re: [PATCH 2/2] drm/amdgpu: fix vcn sw init failed
>>>
>>>
>>>
>>> On 11/12/2024 8:00 PM, Jesse.zhang at amd.com wrote:
>>>> [ 2875.870277] [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of
>>>> IP block <vcn_v4_0_3> failed -22 [ 2875.880494] amdgpu 0000:01:00.0:
>>>> amdgpu: amdgpu_device_ip_init failed [ 2875.887689] amdgpu
>>>> 0000:01:00.0: amdgpu: Fatal error during GPU init [ 2875.894791] amdgpu 0000:01:00.0: amdgpu: amdgpu: finishing device.
>>>>
>>>> Add irqs with different IRQ source pointer for vcn0 and vcn1.
>>>>
>>>> Signed-off-by: Jesse Zhang <jesse.zhang at amd.com>
>>>> ---
>>>>  drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 19 +++++++++++++------
>>>>  1 file changed, 13 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>>>> b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>>>> index ef3dfd44a022..82b90f1e6f33 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c
>>>> @@ -83,6 +83,10 @@ static const struct amdgpu_hwip_reg_entry
>>>> vcn_reg_list_4_0_3[] = {
>>>>
>>>>  #define NORMALIZE_VCN_REG_OFFSET(offset) \
>>>>               (offset & 0x1FFFF)
>>>> +static int amdgpu_ih_clientid_vcns[] = {
>>>> +     SOC15_IH_CLIENTID_VCN,
>>>> +     SOC15_IH_CLIENTID_VCN1
>>>
>>> This is not valid for 4.0.3. It uses only the same client id, different node_id to distinguish. Also, there are max of 4 instances.
>>>
>>> I would say that entire IP instance series was done in a haste without applying thought and breaks other things including ip block mask.
>>>
>>> If the same client id is used, it returns -EINVAL (because of the following check) and sw init fails at step vcn_v4_0_3_sw_init / amdgpu_irq_add_id.
>>>
>>> amdgpu_irq_add_id:
>>> if (adev->irq.client[client_id].sources[src_id] != NULL)
>>>         return -EINVAL;
>>>
>>
>> We had some side discussions on IP block-per-instance approach.
>> Personally, I was not in favour of it as I thought allowing IP block
>> to handle its own instances is the better approach and that could
>> handle dependencies between instances. Turns out that there are more
>> like handling common things for all instances as in this example.
>
> We tried that route as well before this one and it was ugly as well.
>
>>
>> I would prefer to revert the patch series and consider all angles
>> before moving forward on the approach. Will leave this to
>> Alex/Christian/Leo on the final decsion.
>
> Do the attached patches fix it?

Yes, using your patch vcn sw init will pass.
Please help to check this patch: [PATCH 1/2] drm/admgpu: fix vcn sw init failed
Otherwise duplicate sysfs warnings will be displayed.

 sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:01.1/0000:01:00.0/0000:02:00.0/0000:03:00.0/vcn_reset_mask'
[  562.443738] CPU: 13 PID: 4888 Comm: modprobe Tainted: G            E      6.10.0+ #51
[  562.443740] Hardware name: AMD Splinter/Splinter-RPL, BIOS VS2683299N.FD 05/10/2023
[  562.443741] Call Trace:
[  562.443743]  <TASK>
[  562.443746]  dump_stack_lvl+0x70/0x90
[  562.443751]  dump_stack+0x14/0x20
[  562.443753]  sysfs_warn_dup+0x60/0x80
[  562.443757]  sysfs_add_file_mode_ns+0x126/0x130
[  562.443760]  sysfs_create_file_ns+0x68/0xa0
[  562.443762]  device_create_file+0x46/0x90
[  562.443766]  amdgpu_vcn_sysfs_reset_mask_init+0x1c/0x30 [amdgpu]
[  562.443991]  vcn_v4_0_3_sw_init+0x270/0x3e0 [amdgpu]
[  562.444120]  amdgpu_device_init+0x1a0e/0x35a0 [amdgpu]

Regards
Jesse

This is kind of a piece-meal fix. This doesn't address the larger problem of how to handle things common for all IP instances.

Thanks,
Lijo

> Alex
>
>>
>> Thanks,
>> Lijo
>>
>>> Regards
>>> Jesse
>>>
>>>
>>> Thanks,
>>> Lijo
>>>
>>>> +};
>>>>
>>>>  static int vcn_v4_0_3_start_sriov(struct amdgpu_device *adev);
>>>> static void vcn_v4_0_3_set_unified_ring_funcs(struct amdgpu_device
>>>> *adev, int inst); @@ -150,9 +154,9 @@ static int vcn_v4_0_3_sw_init(struct amdgpu_ip_block *ip_block)
>>>>       if (r)
>>>>               return r;
>>>>
>>>> -     /* VCN DEC TRAP */
>>>> -     r = amdgpu_irq_add_id(adev, SOC15_IH_CLIENTID_VCN,
>>>> -             VCN_4_0__SRCID__UVD_ENC_GENERAL_PURPOSE, &adev->vcn.inst->irq);
>>>> +     /* VCN UNIFIED TRAP */
>>>> +     r = amdgpu_irq_add_id(adev, amdgpu_ih_clientid_vcns[inst],
>>>> +                     VCN_4_0__SRCID__UVD_ENC_GENERAL_PURPOSE,
>>>> +&adev->vcn.inst[inst].irq);
>>>>       if (r)
>>>>               return r;
>>>>
>>>> @@ -174,7 +178,7 @@ static int vcn_v4_0_3_sw_init(struct
>>>> amdgpu_ip_block *ip_block)
>>>>
>>>>       ring->vm_hub = AMDGPU_MMHUB0(adev->vcn.inst[inst].aid_id);
>>>>       sprintf(ring->name, "vcn_unified_%d", adev->vcn.inst[inst].aid_id);
>>>> -     r = amdgpu_ring_init(adev, ring, 512, &adev->vcn.inst->irq, 0,
>>>> +     r = amdgpu_ring_init(adev, ring, 512,
>>>> + &adev->vcn.inst[inst].irq, 0,
>>>>                                AMDGPU_RING_PRIO_DEFAULT,
>>>>                                &adev->vcn.inst[inst].sched_score);
>>>>       if (r)
>>>> @@ -1734,9 +1738,12 @@ static const struct amdgpu_irq_src_funcs vcn_v4_0_3_irq_funcs = {
>>>>   */
>>>>  static void vcn_v4_0_3_set_irq_funcs(struct amdgpu_device *adev,
>>>> int
>>>> inst)  {
>>>> -     adev->vcn.inst->irq.num_types++;
>>>> +     if (adev->vcn.harvest_config & (1 << inst))
>>>> +             return;
>>>> +
>>>> +     adev->vcn.inst[inst].irq.num_types = adev->vcn.num_enc_rings
>>>> + + 1;
>>>>
>>>> -     adev->vcn.inst->irq.funcs = &vcn_v4_0_3_irq_funcs;
>>>> +     adev->vcn.inst[inst].irq.funcs = &vcn_v4_0_3_irq_funcs;
>>>>  }
>>>>
>>>>  static void vcn_v4_0_3_print_ip_state(struct amdgpu_ip_block
>>>> *ip_block, struct drm_printer *p)