[PATCH] drm/amdgpu: fix missed gpu info firmware when cache firmware during S3

Tue Jun 6 14:45:42 UTC 2017

On Tue, Jun 6, 2017 at 10:03 AM, Alex Deucher <alexdeucher at gmail.com> wrote:
> On Tue, Jun 6, 2017 at 7:22 AM, Christian König
> <christian.koenig at amd.com> wrote:
>>> Yes, I agree with you. That's also my orignal opinion.
>>> But we encountered a random buggy when we were calling
>>> device_cache_fw_images.
>>
>> That looks like an upstream bug in device_cache_fw_images.
>>
>> We should probably open a bug report and ping the maintainer. Most likely we
>> are not correctly using the FW interface or trigger a rare bug or something
>> like this.
>>
>>> So then I check these functions and find gpu_info errors. The random buggy
>>> cannot be reproduced constantly.But we expected it can pass more than 30
>>> cycles
>>> of S3 suspend and resume. Any ideas?
>>
>> I think the real solution is to just stop calling
>> amdgpu_device_parse_gpu_info_fw() during resume.
>
> Right.  we only need to parse the firmware once during startup.

How are hitting this on resume?  amdgpu_device_parse_gpu_info_fw() is
called indirectly from amdgpu_device_init() which is only called once
at driver load time.

Alex

>
> Alex
>
>>
>> That function just sets up the adev->gfx.config fields and that is
>> unnecessary after resume.
>>
>> Regards,
>> Christian.
>>
>>
>> Am 06.06.2017 um 10:33 schrieb Huang Rui:
>>>
>>> On Tue, Jun 06, 2017 at 04:00:29PM +0800, Christian König wrote:
>>>>
>>>> Hi Ray,
>>>>
>>>> mhm, indeed a nice catch.
>>>>
>>>> But why do we need to load the gpu info after resume in the first place?
>>>>
>>>> I mean we already know what GPU we have, loading it again looks
>>>> superfluous to me.
>>>>
>>> Yes, I agree with you. That's also my orignal opinion.
>>> But we encountered a random buggy when we were calling
>>> device_cache_fw_images.
>>>
>>> [  558.288976] cache_firmware: amdgpu/vega10_sdma1.bin
>>> [  558.288976] cache_firmware: amdgpu/vega10_sdma.bin ret=0
>>> [  558.288981] fw_set_page_data: fw-amdgpu/vega10_sdma1.bin
>>> buf=ffff8803f1e64a80 data=ffffc90002411000 size=17408
>>> [  558.288981] cache_firmware: amdgpu/vega10_sdma1.bin ret=0
>>> [  558.288997] BUG: unable to handle kernel NULL pointer dereference at
>>> 0000000000000008
>>> [  558.289001] IP: devres_for_each_res+0x5e/0x100
>>> [  558.289001] PGD 0
>>> [  558.289002] Oops: 0000 [#3] SMP
>>> [  558.289003] Modules linked in: joydev hid_generic usbhid amdgpu(OE)
>>> ttm(OE) drm_kms_helper(OE) drm(OE) i2c_algo_bit fb_sys_fops syscopyarea
>>> sysfillrect sysimgblt rpcsec_gss_krb5 nfsv4 nfs fscache
>>> snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel
>>> snd_hda_codec snd_hda_core intel_rapl snd_hwdep x86_pkg_temp_thermal
>>> intel_powerclamp snd_pcm kvm_intel snd_seq_midi kvm snd_seq_midi_event
>>> snd_rawmidi snd_seq snd_seq_device snd_timer irqbypass snd crct10dif_pclmul
>>> soundcore crc32_pclmul ghash_clmulni_intel pcbc mei_me aesni_intel shpchp
>>> mei aes_x86_64 crypto_simd glue_helper mac_hid cryptd acpi_pad tpm_infineon
>>> nfsd auth_rpcgss nfs_acl coretemp lockd grace sunrpc parport_pc ppdev lp
>>> parport autofs4 e1000e ptp nvme mxm_wmi ahci i2c_hid pps_core libahci
>>> nvme_core wmi video hid
>>> [  558.289027] CPU: 0 PID: 3742 Comm: pm-suspend Tainted: G      D    OE
>>> 4.11.0-custom #7
>>> [  558.289027] Hardware name: Gigabyte Technology Co., Ltd.
>>> Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016
>>> [  558.289027] task: ffff8803ebdcd940 task.stack: ffffc900029b0000
>>> [  558.289029] RIP: 0010:devres_for_each_res+0x5e/0x100
>>> [  558.289029] RSP: 0018:ffffc900029b3bc8 EFLAGS: 00010086
>>> [  558.289030] RAX: 000000000000001d RBX: ffff880426aa0c18 RCX:
>>> 0000000000000000
>>> [  558.289030] RDX: 0000000000000001 RSI: 0000000000000000 RDI:
>>> 0000000000000092
>>> [  558.289031] RBP: ffffc900029b3c20 R08: 000000000000001d R09:
>>> ffffffff821ee601
>>> [  558.289031] R10: 000000000000141d R11: 0000000000000000 R12:
>>> ffffffff81566590
>>> [  558.289032] R13: ffffffff81566870 R14: ffffc900029b3c30 R15:
>>> ffff880426aa0e98
>>> [  558.289032] FS:  00007f63006c3700(0000) GS:ffff88043ec00000(0000)
>>> knlGS:0000000000000000
>>> [  558.289033] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  558.289033] CR2: 0000000000000008 CR3: 00000003f257e000 CR4:
>>> 00000000003406f0
>>> [  558.289034] Call Trace:
>>> [  558.289036]  ? alloc_fw_cache_entry+0x60/0x60
>>> [  558.289037]  ? request_firmware_nowait+0x140/0x140
>>> [  558.289038]  dev_cache_fw_image+0x46/0x120
>>> [  558.289039]  ? request_firmware_nowait+0x140/0x140
>>> [  558.289040]  dpm_for_each_dev+0x44/0x70
>>> [  558.289041]  fw_pm_notify+0x164/0x190
>>> [  558.289043]  ? prepare_to_wait_event+0x110/0x110
>>> [  558.289044]  notifier_call_chain+0x49/0x70
>>> [  558.289046]  __blocking_notifier_call_chain+0x4d/0x70
>>> [  558.289047]  __pm_notifier_call_chain+0x1f/0x40
>>> [  558.289047]  pm_suspend+0x27f/0x3a0
>>> [  558.289048]  state_store+0x80/0xf0
>>> [  558.289050]  kobj_attr_store+0xf/0x20
>>> [  558.289051]  sysfs_kf_write+0x3a/0x50
>>> [  558.289053]  kernfs_fop_write+0xff/0x180
>>> [  558.289054]  __vfs_write+0x28/0x120
>>> [  558.289056]  ? apparmor_file_permission+0x1a/0x20
>>>
>>> So then I check these functions and find gpu_info errors. The random buggy
>>> cannot be reproduced constantly.But we expected it can pass more than 30
>>> cycles
>>> of S3 suspend and resume. Any ideas?
>>>
>>> Thanks,
>>> Ray
>>
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx