[PATCH] drm/amdgpu: fix missed gpu info firmware when cache firmware during S3

Huang Rui ray.huang at amd.com
Tue Jun 6 08:33:12 UTC 2017


On Tue, Jun 06, 2017 at 04:00:29PM +0800, Christian König wrote:
> Hi Ray,
> 
> mhm, indeed a nice catch.
> 
> But why do we need to load the gpu info after resume in the first place?
> 
> I mean we already know what GPU we have, loading it again looks
> superfluous to me.
> 

Yes, I agree with you. That's also my orignal opinion.
But we encountered a random buggy when we were calling
device_cache_fw_images.

[  558.288976] cache_firmware: amdgpu/vega10_sdma1.bin
[  558.288976] cache_firmware: amdgpu/vega10_sdma.bin ret=0
[  558.288981] fw_set_page_data: fw-amdgpu/vega10_sdma1.bin buf=ffff8803f1e64a80 data=ffffc90002411000 size=17408
[  558.288981] cache_firmware: amdgpu/vega10_sdma1.bin ret=0
[  558.288997] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[  558.289001] IP: devres_for_each_res+0x5e/0x100
[  558.289001] PGD 0
[  558.289002] Oops: 0000 [#3] SMP
[  558.289003] Modules linked in: joydev hid_generic usbhid amdgpu(OE) ttm(OE) drm_kms_helper(OE) drm(OE) i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt rpcsec_gss_krb5 nfsv4 nfs fscache snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hda_core intel_rapl snd_hwdep x86_pkg_temp_thermal intel_powerclamp snd_pcm kvm_intel snd_seq_midi kvm snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer irqbypass snd crct10dif_pclmul soundcore crc32_pclmul ghash_clmulni_intel pcbc mei_me aesni_intel shpchp mei aes_x86_64 crypto_simd glue_helper mac_hid cryptd acpi_pad tpm_infineon nfsd auth_rpcgss nfs_acl coretemp lockd grace sunrpc parport_pc ppdev lp parport autofs4 e1000e ptp nvme mxm_wmi ahci i2c_hid pps_core libahci nvme_core wmi video hid
[  558.289027] CPU: 0 PID: 3742 Comm: pm-suspend Tainted: G      D    OE   4.11.0-custom #7
[  558.289027] Hardware name: Gigabyte Technology Co., Ltd. Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016
[  558.289027] task: ffff8803ebdcd940 task.stack: ffffc900029b0000
[  558.289029] RIP: 0010:devres_for_each_res+0x5e/0x100
[  558.289029] RSP: 0018:ffffc900029b3bc8 EFLAGS: 00010086
[  558.289030] RAX: 000000000000001d RBX: ffff880426aa0c18 RCX: 0000000000000000
[  558.289030] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000092
[  558.289031] RBP: ffffc900029b3c20 R08: 000000000000001d R09: ffffffff821ee601
[  558.289031] R10: 000000000000141d R11: 0000000000000000 R12: ffffffff81566590
[  558.289032] R13: ffffffff81566870 R14: ffffc900029b3c30 R15: ffff880426aa0e98
[  558.289032] FS:  00007f63006c3700(0000) GS:ffff88043ec00000(0000) knlGS:0000000000000000
[  558.289033] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  558.289033] CR2: 0000000000000008 CR3: 00000003f257e000 CR4: 00000000003406f0
[  558.289034] Call Trace:
[  558.289036]  ? alloc_fw_cache_entry+0x60/0x60
[  558.289037]  ? request_firmware_nowait+0x140/0x140
[  558.289038]  dev_cache_fw_image+0x46/0x120
[  558.289039]  ? request_firmware_nowait+0x140/0x140
[  558.289040]  dpm_for_each_dev+0x44/0x70
[  558.289041]  fw_pm_notify+0x164/0x190
[  558.289043]  ? prepare_to_wait_event+0x110/0x110
[  558.289044]  notifier_call_chain+0x49/0x70
[  558.289046]  __blocking_notifier_call_chain+0x4d/0x70
[  558.289047]  __pm_notifier_call_chain+0x1f/0x40
[  558.289047]  pm_suspend+0x27f/0x3a0
[  558.289048]  state_store+0x80/0xf0
[  558.289050]  kobj_attr_store+0xf/0x20
[  558.289051]  sysfs_kf_write+0x3a/0x50
[  558.289053]  kernfs_fop_write+0xff/0x180
[  558.289054]  __vfs_write+0x28/0x120
[  558.289056]  ? apparmor_file_permission+0x1a/0x20

So then I check these functions and find gpu_info errors. The random buggy
cannot be reproduced constantly.But we expected it can pass more than 30 cycles
of S3 suspend and resume. Any ideas?

Thanks,
Ray


More information about the amd-gfx mailing list