[PATCH 1/2] drm/amdgpu: add the IP information of the soc
Alex Deucher
alexdeucher at gmail.com
Thu Mar 14 17:51:42 UTC 2024
On Thu, Mar 14, 2024 at 12:16 PM Khatri, Sunil <sukhatri at amd.com> wrote:
>
>
> On 3/14/2024 8:12 PM, Alex Deucher wrote:
> > On Thu, Mar 14, 2024 at 1:44 AM Khatri, Sunil <sukhatri at amd.com> wrote:
> >>
> >> On 3/14/2024 1:58 AM, Alex Deucher wrote:
> >>> On Tue, Mar 12, 2024 at 8:41 AM Sunil Khatri <sunil.khatri at amd.com> wrote:
> >>>> Add all the IP's information on a SOC to the
> >>>> devcoredump.
> >>>>
> >>>> Signed-off-by: Sunil Khatri <sunil.khatri at amd.com>
> >>>> ---
> >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++++++++++++++++++
> >>>> 1 file changed, 19 insertions(+)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> >>>> index a0dbccad2f53..611fdb90a1fc 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> >>>> @@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count,
> >>>> coredump->reset_task_info.process_name,
> >>>> coredump->reset_task_info.pid);
> >>>>
> >>>> + /* GPU IP's information of the SOC */
> >>>> + if (coredump->adev) {
> >>>> + drm_printf(&p, "\nIP Information\n");
> >>>> + drm_printf(&p, "SOC Family: %d\n", coredump->adev->family);
> >>>> + drm_printf(&p, "SOC Revision id: %d\n", coredump->adev->rev_id);
> >>>> +
> >>>> + for (int i = 0; i < coredump->adev->num_ip_blocks; i++) {
> >>>> + struct amdgpu_ip_block *ip =
> >>>> + &coredump->adev->ip_blocks[i];
> >>>> + drm_printf(&p, "IP type: %d IP name: %s\n",
> >>>> + ip->version->type,
> >>>> + ip->version->funcs->name);
> >>>> + drm_printf(&p, "IP version: (%d,%d,%d)\n\n",
> >>>> + ip->version->major,
> >>>> + ip->version->minor,
> >>>> + ip->version->rev);
> >>>> + }
> >>>> + }
> >>> I think the IP discovery table would be more useful. Either walk the
> >>> adev->ip_versions structure, or just include the IP discovery binary.
> >> I did explore the adev->ip_versions and if i just go through the array
> >> it doesn't give any useful information directly.
> >> There are no ways to find directly from adev->ip_versions below things
> >> until i also reparse the discovery binary again like done the discovery
> >> amdgpu_discovery_reg_base_init and walk through the headers of various
> >> ips using discovery binary.
> >> a. Which IP is available on soc or not.
> >> b. How many instances are there
> >> Also i again have to change back to major, minor and rev convention for
> >> this information to be useful. I am exploring it more if i find some
> >> other information i will update.
> >>
> >> adev->ip_block[] is derived from ip discovery only for each block which
> >> is there on the SOC, so we are not reading information which isnt
> >> applicable for the soc. We have name , type and version no of the IPs
> >> available on the soc. If you want i could add no of instances of each IP
> >> too if you think that's useful information here. Could you share what
> >> information is missing in this approach so i can include that.
> > I was hoping to get the actual IP versions for the IPs from IP
> > discovery rather than the versions from the ip_block array. The
> > latter are common so you can end up with the same version used across
> > a wide variety of chips (e.g., all gfx10.x based chips use the same
> > gfx 10 IP code even if the actual IP version is different for most of
> > the chips).
> Got it. let me check how to get it could be done rightly.
> >
> >> For dumping the IP discovery binary, i dont understand how that
> >> information would be useful directly and needs to be decoded like we are
> >> doing in discovery init. Please correct me if my understanding is wrong
> >> here.
> > It's probably not a high priority, I was just thinking it might be
> > useful to have in case there ended up being some problem related to
> > the IP discovery table on some boards. E.g., we'd know that all
> > boards with a certain harvest config seem to align with a reported
> > problem. Similar for vbios. It's more for telemetry. E.g., all the
> > boards reporting some problem have a particular powerplay config or
> > whatever.
> I got it.
> But two points of contention here in my understanding. The dump works
> only where there is reset and not sure if it could be used very early in
> development of not. Second point is that devcoredump is 4096
> bytes/4Kbyte of memory where we are dumping all the information. Not
> sure if that could be increased but it might not be enough if we are
> planning to dump all to it.
ah, ok. Let's skip the IP versions in that case, we can use the
family and rev_id and external_rev_id to look up the IP versions.
Alex
>
> Another point is since we have sysfs/debugfs/info ioctl etc information
> available. We should sort out what really is helpful in debugging GPU
> hang and that's added in devcore.
>
> Regards
> Sunil
>
> >
> > Alex
> >
> >
> >>> Alex
> >>>
> >>>> +
> >>>> if (coredump->ring) {
> >>>> drm_printf(&p, "\nRing timed out details\n");
> >>>> drm_printf(&p, "IP Type: %d Ring Name: %s\n",
> >>>> --
> >>>> 2.34.1
> >>>>
More information about the amd-gfx
mailing list