[PATCH 2/3] nvkm/gsp: correctly calculate the available space of the GSP cmdq buffer
Danilo Krummrich
dakr at kernel.org
Mon Oct 14 15:31:48 UTC 2024
On Sun, Oct 13, 2024 at 06:27:32PM +0000, Zhi Wang wrote:
> On 04/10/2024 20.16, Danilo Krummrich wrote:
> > External email: Use caution opening links or attachments
> >
> >
>
> Hey Danilo. I am just back from my vacation. Sorry for the delay. See my
> comments below.
Hi Zhi. No worries, just so you know I may have a bit of delay in the coming
weeks too -- vacation, relocation...
>
> > On Sun, Sep 22, 2024 at 06:07:08AM -0700, Zhi Wang wrote:
> >> r535_gsp_cmdq_push() waits for the available page in the GSP cmdq
> >> buffer when handling a large RPC request. When it sees at least one
> >> available page in the cmdq, it quits the waiting with the amount of
> >> free buffer pages in the queue.
> >>
> >> Unfortunately, it always takes the [write pointer, buf_size) as
> >> available buffer pages before rolling back and wrongly calculates the
> >> size of the data should be copied. Thus, it can overwrite the RPC
> >> request that GSP is currently reading, which causes GSP hang due
> >> to corrupted RPC request:
> >>
> >> [ 549.209389] ------------[ cut here ]------------
> >> [ 549.214010] WARNING: CPU: 8 PID: 6314 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:116 r535_gsp_msgq_wait+0xd0/0x190 [nvkm]
> >> [ 549.225678] Modules linked in: nvkm(E+) gsp_log(E) snd_seq_dummy(E) snd_hrtimer(E) snd_seq(E) snd_timer(E) snd_seq_device(E) snd(E) soundcore(E) rfkill(E) qrtr(E) vfat(E) fat(E) ipmi_ssif(E) amd_atl(E) intel_rapl_msr(E) intel_rapl_common(E) mlx5_ib(E) amd64_edac(E) edac_mce_amd(E) kvm_amd(E) ib_uverbs(E) kvm(E) ib_core(E) acpi_ipmi(E) ipmi_si(E) mxm_wmi(E) ipmi_devintf(E) rapl(E) i2c_piix4(E) wmi_bmof(E) joydev(E) ptdma(E) acpi_cpufreq(E) k10temp(E) pcspkr(E) ipmi_msghandler(E) xfs(E) libcrc32c(E) ast(E) i2c_algo_bit(E) crct10dif_pclmul(E) drm_shmem_helper(E) nvme_tcp(E) crc32_pclmul(E) ahci(E) drm_kms_helper(E) libahci(E) nvme_fabrics(E) crc32c_intel(E) nvme(E) cdc_ether(E) mlx5_core(E) nvme_core(E) usbnet(E) drm(E) libata(E) ccp(E) ghash_clmulni_intel(E) mii(E) t10_pi(E) mlxfw(E) sp5100_tco(E) psample(E) pci_hyperv_intf(E) wmi(E) dm_multipath(E) sunrpc(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) be2iscsi(E) bnx2i(E) cnic(E) uio(E) cxgb4i(E) cxgb4(E) tls(E) libcxgbi(E) libcxgb(E) qla4xxx(E)
> >> [ 549.225752] iscsi_boot_sysfs(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E) scsi_transport_iscsi(E) fuse(E) [last unloaded: gsp_log(E)]
> >> [ 549.326293] CPU: 8 PID: 6314 Comm: insmod Tainted: G E 6.9.0-rc6+ #1
> >> [ 549.334039] Hardware name: ASRockRack 1U1G-MILAN/N/ROMED8-NL, BIOS L3.12E 09/06/2022
> >> [ 549.341781] RIP: 0010:r535_gsp_msgq_wait+0xd0/0x190 [nvkm]
> >> [ 549.347343] Code: 08 00 00 89 da c1 e2 0c 48 8d ac 11 00 10 00 00 48 8b 0c 24 48 85 c9 74 1f c1 e0 0c 4c 8d 6d 30 83 e8 30 89 01 e9 68 ff ff ff <0f> 0b 49 c7 c5 92 ff ff ff e9 5a ff ff ff ba ff ff ff ff be c0 0c
> >> [ 549.366090] RSP: 0018:ffffacbccaaeb7d0 EFLAGS: 00010246
> >> [ 549.371315] RAX: 0000000000000000 RBX: 0000000000000012 RCX: 0000000000923e28
> >> [ 549.378451] RDX: 0000000000000000 RSI: 0000000055555554 RDI: ffffacbccaaeb730
> >> [ 549.385590] RBP: 0000000000000001 R08: ffff8bd14d235f70 R09: ffff8bd14d235f70
> >> [ 549.392721] R10: 0000000000000002 R11: ffff8bd14d233864 R12: 0000000000000020
> >> [ 549.399854] R13: ffffacbccaaeb818 R14: 0000000000000020 R15: ffff8bb298c67000
> >> [ 549.406988] FS: 00007f5179244740(0000) GS:ffff8bd14d200000(0000) knlGS:0000000000000000
> >> [ 549.415076] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [ 549.420829] CR2: 00007fa844000010 CR3: 00000001567dc005 CR4: 0000000000770ef0
> >> [ 549.427963] PKRU: 55555554
> >> [ 549.430672] Call Trace:
> >> [ 549.433126] <TASK>
> >> [ 549.435233] ? __warn+0x7f/0x130
> >> [ 549.438473] ? r535_gsp_msgq_wait+0xd0/0x190 [nvkm]
> >> [ 549.443426] ? report_bug+0x18a/0x1a0
> >> [ 549.447098] ? handle_bug+0x3c/0x70
> >> [ 549.450589] ? exc_invalid_op+0x14/0x70
> >> [ 549.454430] ? asm_exc_invalid_op+0x16/0x20
> >> [ 549.458619] ? r535_gsp_msgq_wait+0xd0/0x190 [nvkm]
> >> [ 549.463565] r535_gsp_msg_recv+0x46/0x230 [nvkm]
> >> [ 549.468257] r535_gsp_rpc_push+0x106/0x160 [nvkm]
> >> [ 549.473033] r535_gsp_rpc_rm_ctrl_push+0x40/0x130 [nvkm]
> >> [ 549.478422] nvidia_grid_init_vgpu_types+0xbc/0xe0 [nvkm]
> >> [ 549.483899] nvidia_grid_init+0xb1/0xd0 [nvkm]
> >> [ 549.488420] ? srso_alias_return_thunk+0x5/0xfbef5
> >> [ 549.493213] nvkm_device_pci_probe+0x305/0x420 [nvkm]
> >> [ 549.498338] local_pci_probe+0x46/0xa0
> >> [ 549.502096] pci_call_probe+0x56/0x170
> >> [ 549.505851] pci_device_probe+0x79/0xf0
> >> [ 549.509690] ? driver_sysfs_add+0x59/0xc0
> >> [ 549.513702] really_probe+0xd9/0x380
> >> [ 549.517282] __driver_probe_device+0x78/0x150
> >> [ 549.521640] driver_probe_device+0x1e/0x90
> >> [ 549.525746] __driver_attach+0xd2/0x1c0
> >> [ 549.529594] ? __pfx___driver_attach+0x10/0x10
> >> [ 549.534045] bus_for_each_dev+0x78/0xd0
> >> [ 549.537893] bus_add_driver+0x112/0x210
> >> [ 549.541750] driver_register+0x5c/0x120
> >> [ 549.545596] ? __pfx_nvkm_init+0x10/0x10 [nvkm]
> >> [ 549.550224] do_one_initcall+0x44/0x300
> >> [ 549.554063] ? do_init_module+0x23/0x240
> >> [ 549.557989] do_init_module+0x64/0x240
> >>
> >> Calculate the available buffer page before rolling back based on
> >> the result from the waiting.
> >
> > It looks like you hit this one while working on the VFIO stuff too. So, same
> > question here,
>
> Yes. But theses bugs are not specific to vGPU because two-page GSP RPC
> are part of valid RPC vehicle format of GSP RPC protocol family. The
> fixes are for a better sophisticated GSP RPC handling in Nouveau. Other
> GSP RPC can use this vehicle format as well.
Yeah, that's fine. I just try to figure out whether we need those patches in
-fixes, or if we can take them through -next.
>
> can we hit this case with "vanilla nouveau"?
>
> Not yet. But introducing new GSP RPCs that using this vehicle format
> (related to vGPU/not-related to vGPU) in nouveau might hit this bug later.
Great, so we can take all three through -next, right?
>
> Out of curiostiy, do we have any unit-test package or flows to test the
> patches? Like CIs.
There's igt_gpu_tools [1], but unfortunately Nouveau has practically no test
cases.
> I am using the Phoronix test suite in the ubuntu with
> a PPA repo that has latest mesa/drm userspace libraries.
I typically use Vulkan CTS [2] and some manual smoke testing for the KMS parts.
However, for the long term I definitely want to get to have proper (unit) tests
additionally.
For Nova we can easily use KUnit tests through examples in the code
documentation. Additionally, I could think of something analogous to
igt_gpu_tools I already played around with for Nova [3].
[1] https://gitlab.freedesktop.org/drm/igt-gpu-tools
[2] https://github.com/KhronosGroup/VK-GL-CTS
[3] https://gitlab.freedesktop.org/dakr/drm-test
>
> It would be nice that I can align with others. :)
>
> Thanks,
> Zhi.
> >
>
> >>
> >> Fixes: 176fdcbddfd28 ("drm/nouveau/gsp/r535: add support for booting GSP-RM")
> >
> > Same as in patch 1.
> >
> >> Cc: Ben Skeggs <bskeggs at nvidia.com>
> >> Cc: Karol Herbst <kherbst at redhat.com>
> >> Cc: Lyude Paul <lyude at redhat.com>
> >> Cc: Danilo Krummrich <dakr at redhat.com>
> >> Cc: David Airlie <airlied at gmail.com>
> >> Signed-off-by: Zhi Wang <zhiw at nvidia.com>
> >> ---
> >> drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c | 8 ++++++--
> >> 1 file changed, 6 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c
> >> index 736cde1987d0..49721935013b 100644
> >> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c
> >> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c
> >> @@ -161,7 +161,7 @@ r535_gsp_cmdq_push(struct nvkm_gsp *gsp, void *argv)
> >> u64 *end;
> >> u64 csum = 0;
> >> int free, time = 1000000;
> >> - u32 wptr, size;
> >> + u32 wptr, size, step;
> >> u32 off = 0;
> >>
> >> argc = ALIGN(GSP_MSG_HDR_SIZE + argc, GSP_PAGE_SIZE);
> >> @@ -178,11 +178,13 @@ r535_gsp_cmdq_push(struct nvkm_gsp *gsp, void *argv)
> >> cmd->checksum = upper_32_bits(csum) ^ lower_32_bits(csum);
> >>
> >> wptr = *gsp->cmdq.wptr;
> >> +
> >
> > Please remove the addition of empty lines here...
> >
> >> do {
> >> do {
> >> free = *gsp->cmdq.rptr + gsp->cmdq.cnt - wptr - 1;
> >> if (free >= gsp->cmdq.cnt)
> >> free -= gsp->cmdq.cnt;
> >> +
> >
> > and here.
> >
> >> if (free >= 1)
> >> break;
> >>
> >> @@ -195,7 +197,9 @@ r535_gsp_cmdq_push(struct nvkm_gsp *gsp, void *argv)
> >> }
> >>
> >> cqe = (void *)((u8 *)gsp->shm.cmdq.ptr + 0x1000 + wptr * 0x1000);
> >> - size = min_t(u32, argc, (gsp->cmdq.cnt - wptr) * GSP_PAGE_SIZE);
> >> + step = min_t(u32, free, (gsp->cmdq.cnt - wptr));
> >> + size = min_t(u32, argc, step * GSP_PAGE_SIZE);
> >> +
> >> memcpy(cqe, (u8 *)cmd + off, size);
> >>
> >> wptr += DIV_ROUND_UP(size, 0x1000);
> >> --
> >> 2.34.1
> >>
>
More information about the Nouveau
mailing list