[PATCH 2/3] nvkm/gsp: correctly calculate the available space of the GSP cmdq buffer

Danilo Krummrich dakr at kernel.org
Mon Oct 14 15:31:48 UTC 2024


On Sun, Oct 13, 2024 at 06:27:32PM +0000, Zhi Wang wrote:
> On 04/10/2024 20.16, Danilo Krummrich wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> 
> Hey Danilo. I am just back from my vacation. Sorry for the delay. See my 
> comments below.

Hi Zhi. No worries, just so you know I may have a bit of delay in the coming
weeks too -- vacation, relocation...

> 
> > On Sun, Sep 22, 2024 at 06:07:08AM -0700, Zhi Wang wrote:
> >> r535_gsp_cmdq_push() waits for the available page in the GSP cmdq
> >> buffer when handling a large RPC request. When it sees at least one
> >> available page in the cmdq, it quits the waiting with the amount of
> >> free buffer pages in the queue.
> >>
> >> Unfortunately, it always takes the [write pointer, buf_size) as
> >> available buffer pages before rolling back and wrongly calculates the
> >> size of the data should be copied. Thus, it can overwrite the RPC
> >> request that GSP is currently reading, which causes GSP hang due
> >> to corrupted RPC request:
> >>
> >> [  549.209389] ------------[ cut here ]------------
> >> [  549.214010] WARNING: CPU: 8 PID: 6314 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:116 r535_gsp_msgq_wait+0xd0/0x190 [nvkm]
> >> [  549.225678] Modules linked in: nvkm(E+) gsp_log(E) snd_seq_dummy(E) snd_hrtimer(E) snd_seq(E) snd_timer(E) snd_seq_device(E) snd(E) soundcore(E) rfkill(E) qrtr(E) vfat(E) fat(E) ipmi_ssif(E) amd_atl(E) intel_rapl_msr(E) intel_rapl_common(E) mlx5_ib(E) amd64_edac(E) edac_mce_amd(E) kvm_amd(E) ib_uverbs(E) kvm(E) ib_core(E) acpi_ipmi(E) ipmi_si(E) mxm_wmi(E) ipmi_devintf(E) rapl(E) i2c_piix4(E) wmi_bmof(E) joydev(E) ptdma(E) acpi_cpufreq(E) k10temp(E) pcspkr(E) ipmi_msghandler(E) xfs(E) libcrc32c(E) ast(E) i2c_algo_bit(E) crct10dif_pclmul(E) drm_shmem_helper(E) nvme_tcp(E) crc32_pclmul(E) ahci(E) drm_kms_helper(E) libahci(E) nvme_fabrics(E) crc32c_intel(E) nvme(E) cdc_ether(E) mlx5_core(E) nvme_core(E) usbnet(E) drm(E) libata(E) ccp(E) ghash_clmulni_intel(E) mii(E) t10_pi(E) mlxfw(E) sp5100_tco(E) psample(E) pci_hyperv_intf(E) wmi(E) dm_multipath(E) sunrpc(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) be2iscsi(E) bnx2i(E) cnic(E) uio(E) cxgb4i(E) cxgb4(E) tls(E) libcxgbi(E) libcxgb(E) qla4xxx(E)
> >> [  549.225752]  iscsi_boot_sysfs(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E) scsi_transport_iscsi(E) fuse(E) [last unloaded: gsp_log(E)]
> >> [  549.326293] CPU: 8 PID: 6314 Comm: insmod Tainted: G            E      6.9.0-rc6+ #1
> >> [  549.334039] Hardware name: ASRockRack 1U1G-MILAN/N/ROMED8-NL, BIOS L3.12E 09/06/2022
> >> [  549.341781] RIP: 0010:r535_gsp_msgq_wait+0xd0/0x190 [nvkm]
> >> [  549.347343] Code: 08 00 00 89 da c1 e2 0c 48 8d ac 11 00 10 00 00 48 8b 0c 24 48 85 c9 74 1f c1 e0 0c 4c 8d 6d 30 83 e8 30 89 01 e9 68 ff ff ff <0f> 0b 49 c7 c5 92 ff ff ff e9 5a ff ff ff ba ff ff ff ff be c0 0c
> >> [  549.366090] RSP: 0018:ffffacbccaaeb7d0 EFLAGS: 00010246
> >> [  549.371315] RAX: 0000000000000000 RBX: 0000000000000012 RCX: 0000000000923e28
> >> [  549.378451] RDX: 0000000000000000 RSI: 0000000055555554 RDI: ffffacbccaaeb730
> >> [  549.385590] RBP: 0000000000000001 R08: ffff8bd14d235f70 R09: ffff8bd14d235f70
> >> [  549.392721] R10: 0000000000000002 R11: ffff8bd14d233864 R12: 0000000000000020
> >> [  549.399854] R13: ffffacbccaaeb818 R14: 0000000000000020 R15: ffff8bb298c67000
> >> [  549.406988] FS:  00007f5179244740(0000) GS:ffff8bd14d200000(0000) knlGS:0000000000000000
> >> [  549.415076] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [  549.420829] CR2: 00007fa844000010 CR3: 00000001567dc005 CR4: 0000000000770ef0
> >> [  549.427963] PKRU: 55555554
> >> [  549.430672] Call Trace:
> >> [  549.433126]  <TASK>
> >> [  549.435233]  ? __warn+0x7f/0x130
> >> [  549.438473]  ? r535_gsp_msgq_wait+0xd0/0x190 [nvkm]
> >> [  549.443426]  ? report_bug+0x18a/0x1a0
> >> [  549.447098]  ? handle_bug+0x3c/0x70
> >> [  549.450589]  ? exc_invalid_op+0x14/0x70
> >> [  549.454430]  ? asm_exc_invalid_op+0x16/0x20
> >> [  549.458619]  ? r535_gsp_msgq_wait+0xd0/0x190 [nvkm]
> >> [  549.463565]  r535_gsp_msg_recv+0x46/0x230 [nvkm]
> >> [  549.468257]  r535_gsp_rpc_push+0x106/0x160 [nvkm]
> >> [  549.473033]  r535_gsp_rpc_rm_ctrl_push+0x40/0x130 [nvkm]
> >> [  549.478422]  nvidia_grid_init_vgpu_types+0xbc/0xe0 [nvkm]
> >> [  549.483899]  nvidia_grid_init+0xb1/0xd0 [nvkm]
> >> [  549.488420]  ? srso_alias_return_thunk+0x5/0xfbef5
> >> [  549.493213]  nvkm_device_pci_probe+0x305/0x420 [nvkm]
> >> [  549.498338]  local_pci_probe+0x46/0xa0
> >> [  549.502096]  pci_call_probe+0x56/0x170
> >> [  549.505851]  pci_device_probe+0x79/0xf0
> >> [  549.509690]  ? driver_sysfs_add+0x59/0xc0
> >> [  549.513702]  really_probe+0xd9/0x380
> >> [  549.517282]  __driver_probe_device+0x78/0x150
> >> [  549.521640]  driver_probe_device+0x1e/0x90
> >> [  549.525746]  __driver_attach+0xd2/0x1c0
> >> [  549.529594]  ? __pfx___driver_attach+0x10/0x10
> >> [  549.534045]  bus_for_each_dev+0x78/0xd0
> >> [  549.537893]  bus_add_driver+0x112/0x210
> >> [  549.541750]  driver_register+0x5c/0x120
> >> [  549.545596]  ? __pfx_nvkm_init+0x10/0x10 [nvkm]
> >> [  549.550224]  do_one_initcall+0x44/0x300
> >> [  549.554063]  ? do_init_module+0x23/0x240
> >> [  549.557989]  do_init_module+0x64/0x240
> >>
> >> Calculate the available buffer page before rolling back based on
> >> the result from the waiting.
> > 
> > It looks like you hit this one while working on the VFIO stuff too. So, same
> > question here,
> 
> Yes. But theses bugs are not specific to vGPU because two-page GSP RPC 
> are part of valid RPC vehicle format of GSP RPC protocol family. The 
> fixes are for a better sophisticated GSP RPC handling in Nouveau. Other 
> GSP RPC can use this vehicle format as well.

Yeah, that's fine. I just try to figure out whether we need those patches in
-fixes, or if we can take them through -next.

> 
> can we hit this case with "vanilla nouveau"?
> 
> Not yet. But introducing new GSP RPCs that using this vehicle format 
> (related to vGPU/not-related to vGPU) in nouveau might hit this bug later.

Great, so we can take all three through -next, right?

> 
> Out of curiostiy, do we have any unit-test package or flows to test the 
> patches? Like CIs.

There's igt_gpu_tools [1], but unfortunately Nouveau has practically no test
cases.

> I am using the Phoronix test suite in the ubuntu with 
> a PPA repo that has latest mesa/drm userspace libraries.

I typically use Vulkan CTS [2] and some manual smoke testing for the KMS parts.

However, for the long term I definitely want to get to have proper (unit) tests
additionally.

For Nova we can easily use KUnit tests through examples in the code
documentation. Additionally, I could think of something analogous to
igt_gpu_tools I already played around with for Nova [3].

[1] https://gitlab.freedesktop.org/drm/igt-gpu-tools
[2] https://github.com/KhronosGroup/VK-GL-CTS
[3] https://gitlab.freedesktop.org/dakr/drm-test

> 
> It would be nice that I can align with others. :)
> 
> Thanks,
> Zhi.
> > 
> 
> >>
> >> Fixes: 176fdcbddfd28 ("drm/nouveau/gsp/r535: add support for booting GSP-RM")
> > 
> > Same as in patch 1.
> > 
> >> Cc: Ben Skeggs <bskeggs at nvidia.com>
> >> Cc: Karol Herbst <kherbst at redhat.com>
> >> Cc: Lyude Paul <lyude at redhat.com>
> >> Cc: Danilo Krummrich <dakr at redhat.com>
> >> Cc: David Airlie <airlied at gmail.com>
> >> Signed-off-by: Zhi Wang <zhiw at nvidia.com>
> >> ---
> >>   drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c | 8 ++++++--
> >>   1 file changed, 6 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c
> >> index 736cde1987d0..49721935013b 100644
> >> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c
> >> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c
> >> @@ -161,7 +161,7 @@ r535_gsp_cmdq_push(struct nvkm_gsp *gsp, void *argv)
> >>        u64 *end;
> >>        u64 csum = 0;
> >>        int free, time = 1000000;
> >> -     u32 wptr, size;
> >> +     u32 wptr, size, step;
> >>        u32 off = 0;
> >>
> >>        argc = ALIGN(GSP_MSG_HDR_SIZE + argc, GSP_PAGE_SIZE);
> >> @@ -178,11 +178,13 @@ r535_gsp_cmdq_push(struct nvkm_gsp *gsp, void *argv)
> >>        cmd->checksum = upper_32_bits(csum) ^ lower_32_bits(csum);
> >>
> >>        wptr = *gsp->cmdq.wptr;
> >> +
> > 
> > Please remove the addition of empty lines here...
> > 
> >>        do {
> >>                do {
> >>                        free = *gsp->cmdq.rptr + gsp->cmdq.cnt - wptr - 1;
> >>                        if (free >= gsp->cmdq.cnt)
> >>                                free -= gsp->cmdq.cnt;
> >> +
> > 
> > and here.
> > 
> >>                        if (free >= 1)
> >>                                break;
> >>
> >> @@ -195,7 +197,9 @@ r535_gsp_cmdq_push(struct nvkm_gsp *gsp, void *argv)
> >>                }
> >>
> >>                cqe = (void *)((u8 *)gsp->shm.cmdq.ptr + 0x1000 + wptr * 0x1000);
> >> -             size = min_t(u32, argc, (gsp->cmdq.cnt - wptr) * GSP_PAGE_SIZE);
> >> +             step = min_t(u32, free, (gsp->cmdq.cnt - wptr));
> >> +             size = min_t(u32, argc, step * GSP_PAGE_SIZE);
> >> +
> >>                memcpy(cqe, (u8 *)cmd + off, size);
> >>
> >>                wptr += DIV_ROUND_UP(size, 0x1000);
> >> --
> >> 2.34.1
> >>
> 


More information about the Nouveau mailing list