Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

Yann Dirson ydirson at free.fr
Thu Jan 6 15:38:02 UTC 2022


Alex wrote:
> > How is the stolen memory communicated to the driver ?  That host
> > physical
> > memory probably has to be mapped at the same guest physical address
> > for
> > the magic to work, right ?
> 
> Correct.  The driver reads the physical location of that memory from
> hardware registers.  Removing this chunk of code from gmc_v9_0.c will
> force the driver to use the BAR,

That would only be a workaround for a missing mapping of stolen
memory to the guest, right ?


> but I'm not sure if there are any
> other places in the driver that make assumptions about using the
> physical host address or not on APUs off hand.

gmc_v9_0_vram_gtt_location() updates vm_manager.vram_base_offset from
the same value.  I'm not sure I understand why in this case there is
no reason to use the BAR while there are some in gmc_v9_0_mc_init().

vram_base_offset then gets used in several places:

* amdgpu_gmc_init_pdb0, that seems likely enough to be problematic,
  right ?  
  As a sidenote the XGMI offset added earlier gets substracted
  here to deduce vram base addr
  (a couple of new acronyms there: PDB, PDE -- page directory base/entry?)

* amdgpu_ttm_map_buffer, amdgpu_vm_bo_update_mapping: those seem to be
  as problematic

* amdgpu_gmc_vram_mc2pa: until I got there I had assumed MC could stand for 
  "memory controller", but then "MC address of buffer" makes me doubt


> 
>         if ((adev->flags & AMD_IS_APU) ||
>             (adev->gmc.xgmi.supported &&
>              adev->gmc.xgmi.connected_to_cpu)) {
>                 adev->gmc.aper_base =
>                         adev->gfxhub.funcs->get_mc_fb_offset(adev) +
>                         adev->gmc.xgmi.physical_node_id *
>                         adev->gmc.xgmi.node_segment_size;
>                 adev->gmc.aper_size = adev->gmc.real_vram_size;
>         }


Now for the test... it does indeed seem to go much further, I even
loose the dom0's efifb to that black screen hopefully showing the
driver started to setup the hardware.  Will probably still have to
hunt down whether it still tries to use efifb afterwards (can't see
why it would not, TBH, given the previous behaviour where it kept
using it after the guest failed to start).

The log shows many details about TMR loading

Then as expected:

[2022-01-06 15:16:09] <6>[    5.844589] amdgpu 0000:00:05.0: amdgpu: RAP: optional rap ta ucode is not available
[2022-01-06 15:16:09] <6>[    5.844619] amdgpu 0000:00:05.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[2022-01-06 15:16:09] <7>[    5.844639] [drm:amdgpu_device_init.cold [amdgpu]] hw_init (phase2) of IP block <smu>...
[2022-01-06 15:16:09] <6>[    5.845515] amdgpu 0000:00:05.0: amdgpu: SMU is initialized successfully!


not sure about that unhandled interrupt (and a bit worried about messed-up logs):

[2022-01-06 15:16:09] <7>[    6.010681] amdgpu 0000:00:05.0: [drm:amdgpu_ring_test_hel[2022-01-06 15:16:10] per [amdgpu]] ring test on sdma0 succeeded
[2022-01-06 15:16:10] <7>[    6.010831] [drm:amdgpu_ih_process [amdgpu]] amdgpu_ih_process: rptr 0, wptr 32
[2022-01-06 15:16:10] <7>[    6.011002] [drm:amdgpu_irq_dispatch [amdgpu]] Unhandled interrupt src_id: 243


then comes a first error:

[2022-01-06 15:16:10] <6>[    6.011785] [drm] Display Core initialized with v3.2.149!
[2022-01-06 15:16:10] <6>[    6.012714] [drm] DMUB hardware initialized: version=0x0101001C
[2022-01-06 15:16:10] <3>[    6.228263] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
[2022-01-06 15:16:10] <7>[    6.229125] [drm:amdgpu_dm_init.isra.0.cold [amdgpu]] amdgpu: freesync_module init done 0000000076c7b459.
[2022-01-06 15:16:10] <7>[    6.229677] [drm:amdgpu_dm_init.isra.0.cold [amdgpu]] amdgpu: hdcp_workqueue init done 0000000087e28b47.
[2022-01-06 15:16:10] <7>[    6.229979] [drm:amdgpu_dm_init.isra.0.cold [amdgpu]] amdgpu_dm_connector_init()

... which we can see again several times later though the driver seems sufficient to finish init:

[2022-01-06 15:16:10] <6>[    6.615615] [drm] late_init of IP block <smu>...
[2022-01-06 15:16:10] <6>[    6.615772] [drm] late_init of IP block <gfx_v9_0>...
[2022-01-06 15:16:10] <6>[    6.615801] [drm] late_init of IP block <sdma_v4_0>...
[2022-01-06 15:16:10] <6>[    6.615827] [drm] late_init of IP block <dm>...
[2022-01-06 15:16:10] <3>[    6.801790] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
[2022-01-06 15:16:10] <7>[    6.806079] [drm:drm_minor_register [drm]] 
[2022-01-06 15:16:10] <7>[    6.806195] [drm:drm_minor_register [drm]] new minor registered 128
[2022-01-06 15:16:10] <7>[    6.806223] [drm:drm_minor_register [drm]] 
[2022-01-06 15:16:10] <7>[    6.806289] [drm:drm_minor_register [drm]] new minor registered 0
[2022-01-06 15:16:10] <7>[    6.806355] [drm:drm_sysfs_connector_add [drm]] adding "eDP-1" to sysfs
[2022-01-06 15:16:10] <7>[    6.806424] [drm:drm_dp_aux_register_devnode [drm_kms_helper]] drm_dp_aux_dev: aux [AMDGPU DM aux hw bus 0] registered as minor 0
[2022-01-06 15:16:10] <7>[    6.806498] [drm:drm_sysfs_hotplug_event [drm]] generating hotplug event
[2022-01-06 15:16:10] <6>[    6.806533] [drm] Initialized amdgpu 3.42.0 20150101 for 0000:00:05.0 on minor 0


At one point though a new problem shows: it seem to have issues driving the CRTC in the end:

[2022-01-06 15:16:25] <7>[   11.140807] amdgpu 0000:00:05.0: [drm:drm_vblank_enable [drm]] enabling vblank on crtc 0, ret: 0
[2022-01-06 15:16:25] <3>[   11.329306] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
[2022-01-06 15:16:25] <3>[   11.524327] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
[2022-01-06 15:16:25] <4>[   11.641814] [drm] Fence fallback timer expired on ring comp_1.3.0
[2022-01-06 15:16:25] <7>[   11.641877] amdgpu 0000:00:05.0: [drm:amdgpu_ib_ring_tests [amdgpu]] ib test on comp_1.3.0 succeeded
[2022-01-06 15:16:25] <4>[   12.145804] [drm] Fence fallback timer expired on ring comp_1.0.1
[2022-01-06 15:16:25] <7>[   12.145862] amdgpu 0000:00:05.0: [drm:amdgpu_ib_ring_tests [amdgpu]] ib test on comp_1.0.1 succeeded
[2022-01-06 15:16:25] <4>[   12.649771] [drm] Fence fallback timer expired on ring comp_1.1.1
[2022-01-06 15:16:25] <7>[   12.649789] amdgpu 0000:00:05.0: [drm:amdgpu_ib_ring_tests [amdgpu]] ib test on comp_1.1.1 succeeded
[2022-01-06 15:16:25] <4>[   13.153815] [drm] Fence fallback timer expired on ring comp_1.2.1
[2022-01-06 15:16:25] <7>[   13.153836] amdgpu 0000:00:05.0: [drm:amdgpu_ib_ring_tests [amdgpu]] ib test on comp_1.2.1 succeeded
[2022-01-06 15:16:25] <4>[   13.657756] [drm] Fence fallback timer expired on ring comp_1.3.1
[2022-01-06 15:16:25] <7>[   13.657767] amdgpu 0000:00:05.0: [drm:amdgpu_ib_ring_tests [amdgpu]] ib test on comp_1.3.1 succeeded
[2022-01-06 15:16:25] <7>[   13.657899] [drm:sdma_v4_0_ring_set_wptr [amdgpu]] Setting write pointer
[2022-01-06 15:16:25] <7>[   13.658008] [drm:sdma_v4_0_ring_set_wptr [amdgpu]] Using doorbell -- wptr_offs == 0x00000198 lower_32_bits(ring->wptr) << 2 == 0x00000100 upper_32_bits(ring->wptr) << 2 == 0x00000000
[2022-01-06 15:16:25] <7>[   13.658114] [drm:sdma_v4_0_ring_set_wptr [amdgpu]] calling WDOORBELL64(0x000001e0, 0x0000000000000100)
[2022-01-06 15:16:25] <4>[   14.161792] [drm] Fence fallback timer expired on ring sdma0
[2022-01-06 15:16:25] <7>[   14.161811] amdgpu 0000:00:05.0: [drm:amdgpu_ib_ring_tests [amdgpu]] ib test on sdma0 succeeded
[2022-01-06 15:16:25] <3>[   21.609821] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:67:crtc-0] flip_done timed out


No visible change if I boot with efifb:off (aside from entering LUKS
passphrase in the dark, that is).


Tried patching gmc_v9_0_vram_gtt_location() to use the BAR too [2], but
that turns out to work even less:

[2022-01-06 16:27:48] <6>[    6.230166] amdgpu 0000:00:05.0: amdgpu: SMU is initialized successfully!
[2022-01-06 16:27:48] <7>[    6.230168] [drm:amdgpu_device_init.cold [amdgpu]] hw_init (phase2) of IP block <gfx_v9_0>...
[2022-01-06 16:27:48] <6>[    6.231948] [drm] kiq ring mec 2 pipe 1 q 0
[2022-01-06 16:27:48] <7>[    6.231861] [drm:amdgpu_ih_process [amdgpu]] amdgpu_ih_process: rptr 448, wptr 512
[2022-01-06 16:27:48] <7>[    6.231962] [drm:amdgpu_gfx_enable_kcq.cold [amdgpu]] kiq alloc'd 64
[2022-01-06 16:27:48] <7>[    6.232172] [drm:amdgpu_gfx_enable_kcq.cold [amdgpu]] kiq size init: 256
[2022-01-06 16:27:48] <7>[    6.232344] [drm:amdgpu_gfx_enable_kcq.cold [amdgpu]] kiq size after set_res: 248
[2022-01-06 16:27:48] <7>[    6.232530] [drm:amdgpu_gfx_enable_kcq.cold [amdgpu]] kiq size after map_q: 192
[2022-01-06 16:27:48] <7>[    6.232725] [drm:amdgpu_ih_process [amdgpu]] amdgpu_ih_process: rptr 512, wptr 544
[2022-01-06 16:27:48] <3>[    6.429974] amdgpu 0000:00:05.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[2022-01-06 16:27:48] <7>[    6.430167] [drm:amdgpu_gfx_enable_kcq.cold [amdgpu]] kiq size after test: 0
[2022-01-06 16:27:48] <3>[    6.430353] [drm:amdgpu_gfx_enable_kcq.cold [amdgpu]] *ERROR* KCQ enable failed
[2022-01-06 16:27:48] <3>[    6.430532] [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* hw_init of IP block <gfx_v9_0> failed -110
[2022-01-06 16:27:48] <3>[    6.430720] amdgpu 0000:00:05.0: amdgpu: amdgpu_device_ip_init failed




As a sidenote, my warning on ring_alloc() being called twice without
commiting or undoing [1] gets triggered.  Given the call chain it looks
like this would happen in the previous usage of that ring, would have to
dig deeper to understand that.  Unless I'm missing something and this would
be legal ?

[2022-01-06 15:52:17] <4>[    5.929158] ------------[ cut here ]------------
[2022-01-06 15:52:17] <4>[    5.929170] WARNING: CPU: 1 PID: 458 at drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c:74 amdgpu_ring_alloc+0x62/0x70 [amdgpu]
[2022-01-06 15:52:17] <4>[    5.929323] Modules linked in: ip6table_filter ip6table_mangle joydev ip6table_raw ip6_tables ipt_REJECT nf_reject_ipv4 xt_state xt_conntrack iptable_filter iptable_mangle iptable_raw xt_MASQUERADE iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel amdgpu(+) iommu_v2 gpu_sched i2c_algo_bit drm_ttm_helper ttm drm_kms_helper ehci_pci cec pcspkr ehci_hcd i2c_piix4 serio_raw ata_generic pata_acpi xen_scsiback target_core_mod xen_netback xen_privcmd xen_gntdev xen_gntalloc xen_blkback fuse drm xen_evtchn bpf_preload ip_tables overlay xen_blkfront
[2022-01-06 15:52:17] <4>[    5.929458] CPU: 1 PID: 458 Comm: sdma0 Not tainted 5.15.4-1.fc32.qubes.x86_64+ #8
[2022-01-06 15:52:17] <4>[    5.929474] Hardware name: Xen HVM domU, BIOS 4.14.3 01/03/2022
[2022-01-06 15:52:17] <4>[    5.929487] RIP: 0010:amdgpu_ring_alloc+0x62/0x70 [amdgpu]
[2022-01-06 15:52:17] <4>[    5.929628] Code: 87 28 02 00 00 48 8b 82 b8 00 00 00 48 85 c0 74 05 e8 b2 ae 90 ee 44 89 e0 41 5c c3 0f 0b 41 bc f4 ff ff ff 44 89 e0 41 5c c3 <0f> 0b 48 8b 57 08 eb bc 66 0f 1f 44 00 00 0f 1f 44 00 00 85 f6 0f
[2022-01-06 15:52:17] <4>[    5.929667] RSP: 0018:ffffb129005f3dd8 EFLAGS: 00010206
[2022-01-06 15:52:17] <4>[    5.929678] RAX: 0000000000000060 RBX: ffff96209112d230 RCX: 0000000000000050
[2022-01-06 15:52:17] <4>[    5.929693] RDX: ffffffffc0ac6c60 RSI: 000000000000006d RDI: ffff96208c5eb8f8
[2022-01-06 15:52:17] <4>[    5.929707] RBP: ffff96209112d000 R08: ffffb129005f3e50 R09: ffff96208c5eba98
[2022-01-06 15:52:17] <4>[    5.929722] R10: 0000000000000000 R11: 0000000000000001 R12: ffff962090a0c780
[2022-01-06 15:52:17] <4>[    5.929736] R13: 0000000000000001 R14: ffff96208c5eb8f8 R15: ffff96208c5eb970
[2022-01-06 15:52:17] <4>[    5.929752] FS:  0000000000000000(0000) GS:ffff9620bcd00000(0000) knlGS:0000000000000000
[2022-01-06 15:52:17] <4>[    5.929768] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2022-01-06 15:52:17] <4>[    5.929781] CR2: 00007c1130d0f860 CR3: 00000000040c4000 CR4: 0000000000350ee0
[2022-01-06 15:52:17] <4>[    5.929797] Call Trace:
[2022-01-06 15:52:17] <4>[    5.929805]  <TASK>
[2022-01-06 15:52:17] <4>[    5.929812]  amdgpu_ib_schedule+0xa9/0x540 [amdgpu]
[2022-01-06 15:52:17] <4>[    5.929956]  ? _raw_spin_unlock_irqrestore+0xa/0x20
[2022-01-06 15:52:17] <4>[    5.929969]  amdgpu_job_run+0xce/0x1f0 [amdgpu]
[2022-01-06 15:52:17] <4>[    5.930131]  drm_sched_main+0x300/0x500 [gpu_sched]
[2022-01-06 15:52:17] <4>[    5.930146]  ? finish_wait+0x80/0x80
[2022-01-06 15:52:17] <4>[    5.930156]  ? drm_sched_rq_select_entity+0xa0/0xa0 [gpu_sched]
[2022-01-06 15:52:17] <4>[    5.930171]  kthread+0x127/0x150
[2022-01-06 15:52:17] <4>[    5.930181]  ? set_kthread_struct+0x40/0x40
[2022-01-06 15:52:17] <4>[    5.930192]  ret_from_fork+0x22/0x30
[2022-01-06 15:52:17] <4>[    5.930203]  </TASK>
[2022-01-06 15:52:17] <4>[    5.930208] ---[ end trace cf0edb400b0116c7 ]---


[1] https://github.com/ydirson/linux/commit/4a010943e74d6bf621bd9e72a7620a65af23ecc9
[2] https://github.com/ydirson/linux/commit/e90230e008ce204d822f07e36b3c3e196d561c28

> 
> 
> 
> >
> > > > >
> > > > > > ... which brings me to a point that's been puzzling me for
> > > > > > some
> > > > > > time, which is
> > > > > > that as the hw init fails, the efifb driver is still using
> > > > > > the
> > > > > > framebuffer.
> > > > >
> > > > > No, it isn't. You are probably just still seeing the same
> > > > > screen.
> > > > >
> > > > > The issue is most likely that while efi was kicked out nobody
> > > > > re-programmed the display hardware to show something
> > > > > different.
> > > > >
> > > > > > Am I right in suspecting that efifb should get stripped of
> > > > > > its
> > > > > > ownership of the
> > > > > > fb aperture first, and that if I don't get a black screen
> > > > > > on
> > > > > > hw_init failure
> > > > > > that issue should be the first focus point ?
> > > > >
> > > > > You assumption with the black screen is incorrect. Since the
> > > > > hardware
> > > > > works independent even if you kick out efi you still have the
> > > > > same
> > > > > screen content, you just can't update it anymore.
> > > >
> > > > It's not only that the screen keeps its contents, it's that the
> > > > dom0
> > > > happily continues updating it.
> > >
> > > If the hypevisor is using efifb, then yes that could be a problem
> > > as
> > > the hypervisor could be writing to the efifb resources which ends
> > > up
> > > writing to the same physical memory.  That applies to any GPU on
> > > a
> > > UEFI system.  You'll need to make sure efifb is not in use in the
> > > hypervisor.

> >
> > That remark evokes several things to me.  First one is that every
> > time
> > I've tried booting with efifb disabled in dom0, there was no
> > visible
> > improvements in the guest driver - i.i. I really have to dig how
> > vram mapping
> > is performed and check things are as expected anyway.
> 
> Ultimately you end up at the same physical memory.  efifb uses the
> PCI
> BAR which points to the same physical memory that the driver directly
> maps.
> 
> >
> > The other is that, when dom0 cannot use efifb, entering a luks key
> > is
> > suddenly less user-friendly.  But in theory I'd think we could
> > overcome
> > this by letting dom0 use efifb until ready to start the guest, a
> > simple
> > driver unbind at the right moment should be expected to work, right
> > ?
> > Going further and allowing the guest to use efifb on its own could
> > possibly be more tricky (starting with a different state?) but does
> > not seem to sound completely outlandish either - or does it ?
> >
> 
> efifb just takes whatever hardware state the GOP driver in the pre-OS
> environment left the GPU in.  Once you have a driver loaded in the
> OS,
> that state is gone so I I don't see much value in using efifb once
> you
> have a real driver in the mix.  If you want a console on the host,
> it's probably better to use 2 GPU or just load the real driver as
> needed in both the host and guest.
> 
> > >
> > > Alex
> > >
> > >
> > > >
> > > > > But putting efi asside what Alex pointed out pretty much
> > > > > breaks
> > > > > your
> > > > > neck trying to forward the device. You maybe could try to
> > > > > hack
> > > > > the
> > > > > driver to use the PCIe BAR for framebuffer access, but that
> > > > > might
> > > > > be
> > > > > quite a bit slower.
> > > > >
> > > > > Regards,
> > > > > Christian.
> > > > >
> > > > > >
> > > > > >> Alex
> > > > > >>
> > > > > >> On Mon, Dec 13, 2021 at 3:29 PM Alex Deucher
> > > > > >> <alexdeucher at gmail.com>
> > > > > >> wrote:
> > > > > >>> On Sun, Dec 12, 2021 at 5:19 PM Yann Dirson
> > > > > >>> <ydirson at free.fr>
> > > > > >>> wrote:
> > > > > >>>> Alex wrote:
> > > > > >>>>> On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson
> > > > > >>>>> <ydirson at free.fr>
> > > > > >>>>> wrote:
> > > > > >>>>>> Hi Alex,
> > > > > >>>>>>
> > > > > >>>>>>> We have not validated virtualization of our
> > > > > >>>>>>> integrated
> > > > > >>>>>>> GPUs.  I
> > > > > >>>>>>> don't
> > > > > >>>>>>> know that it will work at all.  We had done a bit of
> > > > > >>>>>>> testing but
> > > > > >>>>>>> ran
> > > > > >>>>>>> into the same issues with the PSP, but never had a
> > > > > >>>>>>> chance
> > > > > >>>>>>> to
> > > > > >>>>>>> debug
> > > > > >>>>>>> further because this feature is not productized.
> > > > > >>>>>> ...
> > > > > >>>>>>> You need a functional PSP to get the GPU driver up
> > > > > >>>>>>> and
> > > > > >>>>>>> running.
> > > > > >>>>>> Ah, thanks for the hint :)
> > > > > >>>>>>
> > > > > >>>>>> I guess that if I want to have any chance to get the
> > > > > >>>>>> PSP
> > > > > >>>>>> working
> > > > > >>>>>> I'm
> > > > > >>>>>> going to need more details on it.  A quick search some
> > > > > >>>>>> time
> > > > > >>>>>> ago
> > > > > >>>>>> mostly
> > > > > >>>>>> brought reverse-engineering work, rather than official
> > > > > >>>>>> AMD
> > > > > >>>>>> doc.
> > > > > >>>>>>   Are
> > > > > >>>>>> there some AMD resources I missed ?
> > > > > >>>>> The driver code is pretty much it.
> > > > > >>>> Let's try to shed some more light on how things work,
> > > > > >>>> taking
> > > > > >>>> as
> > > > > >>>> excuse
> > > > > >>>> psp_v12_0_ring_create().
> > > > > >>>>
> > > > > >>>> First, register access through [RW]REG32_SOC15() is
> > > > > >>>> implemented
> > > > > >>>> in
> > > > > >>>> terms of __[RW]REG32_SOC15_RLC__(), which is basically a
> > > > > >>>> [RW]REG32(),
> > > > > >>>> except it has to be more complex in the SR-IOV case.
> > > > > >>>> Has the RLC anything to do with SR-IOV ?
> > > > > >>> When running the driver on a SR-IOV virtual function
> > > > > >>> (VF),
> > > > > >>> some
> > > > > >>> registers are not available directly via the VF's MMIO
> > > > > >>> aperture
> > > > > >>> so
> > > > > >>> they need to go through the RLC.  For bare metal or
> > > > > >>> passthrough
> > > > > >>> this
> > > > > >>> is not relevant.
> > > > > >>>
> > > > > >>>> It accesses registers in the MMIO range of the MP0 IP,
> > > > > >>>> and
> > > > > >>>> the
> > > > > >>>> "MP0"
> > > > > >>>> name correlates highly with MMIO accesses in
> > > > > >>>> PSP-handling
> > > > > >>>> code.
> > > > > >>>> Is "MP0" another name for PSP (and "MP1" for SMU) ?  The
> > > > > >>>> MP0
> > > > > >>>> version
> > > > > >>> Yes.
> > > > > >>>
> > > > > >>>> reported at v11.0.3 by discovery seems to contradict the
> > > > > >>>> use
> > > > > >>>> of
> > > > > >>>> v12.0
> > > > > >>>> for RENOIR as set by soc15_set_ip_blocks(), or do I miss
> > > > > >>>> something ?
> > > > > >>> Typo in the ip discovery table on renoir.
> > > > > >>>
> > > > > >>>> More generally (and mostly out of curiosity while we're
> > > > > >>>> at
> > > > > >>>> it),
> > > > > >>>> do we
> > > > > >>>> have a way to match IPs listed at discovery time with
> > > > > >>>> the
> > > > > >>>> ones
> > > > > >>>> used
> > > > > >>>> in the driver ?
> > > > > >>> In general, barring typos, the code is shared at the
> > > > > >>> major
> > > > > >>> version
> > > > > >>> level.  The actual code may or may not need changes to
> > > > > >>> handle
> > > > > >>> minor
> > > > > >>> revision changes in an IP.  The driver maps the IP
> > > > > >>> versions
> > > > > >>> from
> > > > > >>> the
> > > > > >>> ip discovery table to the code contained in the driver.
> > > > > >>>
> > > > > >>>> ---
> > > > > >>>>
> > > > > >>>> As for the register names, maybe we could have a short
> > > > > >>>> explanation of
> > > > > >>>> how they are structured ?  Eg. mmMP0_SMN_C2PMSG_69: that
> > > > > >>>> seems
> > > > > >>>> to
> > > > > >>>> be
> > > > > >>>> a MMIO register named "C2PMSG_69" in the "MP0" IP, but
> > > > > >>>> I'm
> > > > > >>>> not
> > > > > >>>> sure
> > > > > >>>> of the "SMN" part -- that could refer to the "System
> > > > > >>>> Management
> > > > > >>>> Network",
> > > > > >>>> described in [0] as an internal bus.  Are we accessing
> > > > > >>>> this
> > > > > >>>> register
> > > > > >>>> through this SMN ?
> > > > > >>> These registers are just mailboxes for the PSP firmware.
> > > > > >>>  All
> > > > > >>> of
> > > > > >>> the
> > > > > >>> C2PMSG registers functionality is defined by the PSP
> > > > > >>> firmware.
> > > > > >>>   They
> > > > > >>> are basically scratch registers used to communicate
> > > > > >>> between
> > > > > >>> the
> > > > > >>> driver
> > > > > >>> and the PSP firmware.
> > > > > >>>
> > > > > >>>>
> > > > > >>>>>   On APUs, the PSP is shared with
> > > > > >>>>> the CPU and the rest of the platform.  The GPU driver
> > > > > >>>>> just
> > > > > >>>>> interacts
> > > > > >>>>> with it for a few specific tasks:
> > > > > >>>>> 1. Loading Trusted Applications (e.g., trusted firmware
> > > > > >>>>> applications
> > > > > >>>>> that run on the PSP for specific functionality, e.g.,
> > > > > >>>>> HDCP
> > > > > >>>>> and
> > > > > >>>>> content
> > > > > >>>>> protection, etc.)
> > > > > >>>>> 2. Validating and loading firmware for other engines on
> > > > > >>>>> the
> > > > > >>>>> SoC.
> > > > > >>>>>   This
> > > > > >>>>> is required to use those engines.
> > > > > >>>> Trying to understand in more details how we start the
> > > > > >>>> PSP
> > > > > >>>> up, I
> > > > > >>>> noticed
> > > > > >>>> that psp_v12_0 has support for loading a sOS firmware,
> > > > > >>>> but
> > > > > >>>> never
> > > > > >>>> calls
> > > > > >>>> init_sos_microcode() - and anyway there is no sos
> > > > > >>>> firmware
> > > > > >>>> for
> > > > > >>>> renoir
> > > > > >>>> and green_sardine, which seem to be the only ASICs with
> > > > > >>>> this
> > > > > >>>> PSP
> > > > > >>>> version.
> > > > > >>>> Is it something that's just not been completely wired up
> > > > > >>>> yet
> > > > > >>>> ?
> > > > > >>> On APUs, the PSP is shared with the CPU so the PSP
> > > > > >>> firmware
> > > > > >>> is
> > > > > >>> part
> > > > > >>> of
> > > > > >>> the sbios image.  The driver doesn't load it.  We only
> > > > > >>> load
> > > > > >>> it on
> > > > > >>> dGPUs where the driver is responsible for the chip
> > > > > >>> initialization.
> > > > > >>>
> > > > > >>>> That also rings a bell, that we have nothing about
> > > > > >>>> Secure OS
> > > > > >>>> in
> > > > > >>>> the doc
> > > > > >>>> yet (not even the acronym in the glossary).
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>> I'm not too familiar with the PSP's path to memory from
> > > > > >>>>> the
> > > > > >>>>> GPU
> > > > > >>>>> perspective.  IIRC, most memory used by the PSP goes
> > > > > >>>>> through
> > > > > >>>>> carve
> > > > > >>>>> out
> > > > > >>>>> "vram" on APUs so it should work, but I would double
> > > > > >>>>> check
> > > > > >>>>> if
> > > > > >>>>> there
> > > > > >>>>> are any system memory allocations that used to interact
> > > > > >>>>> with
> > > > > >>>>> the PSP
> > > > > >>>>> and see if changing them to vram helps.  It does work
> > > > > >>>>> with
> > > > > >>>>> the
> > > > > >>>>> IOMMU
> > > > > >>>>> enabled on bare metal, so it should work in passthrough
> > > > > >>>>> as
> > > > > >>>>> well
> > > > > >>>>> in
> > > > > >>>>> theory.
> > > > > >>>> I can see a single case in the PSP code where GTT is
> > > > > >>>> used
> > > > > >>>> instead
> > > > > >>>> of
> > > > > >>>> vram: to create fw_pri_bo when SR-IOV is not used (and
> > > > > >>>> there
> > > > > >>>> has
> > > > > >>>> to be a reason, since the SR-IOV code path does use
> > > > > >>>> vram).
> > > > > >>>> Changing it to vram does not make a difference, but then
> > > > > >>>> the
> > > > > >>>> only bo that seems to be used at that point is the one
> > > > > >>>> for
> > > > > >>>> the
> > > > > >>>> psp ring,
> > > > > >>>> which is allocated in vram, so I'm not too much
> > > > > >>>> surprised.
> > > > > >>>>
> > > > > >>>> Maybe I should double-check bo_create calls to hunt for
> > > > > >>>> more
> > > > > >>>> ?
> > > > > >>> We looked into this a bit ourselves and ran into the same
> > > > > >>> issues.
> > > > > >>> We'd probably need to debug this with the PSP team to
> > > > > >>> make
> > > > > >>> further
> > > > > >>> progress, but this was not productized so neither team
> > > > > >>> had
> > > > > >>> the
> > > > > >>> resources to delve further.
> > > > > >>>
> > > > > >>> Alex
> > > > > >>>
> > > > > >>>>
> > > > > >>>> [0]
> > > > > >>>> https://github.com/PSPReverse/psp-docs/blob/master/masterthesis-eichner-psp-2020.pdf
> > > > >
> > > > >
> > >
> 


More information about the amd-gfx mailing list