[RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl
Alex Deucher
alexdeucher at gmail.com
Tue May 2 18:41:25 UTC 2023
On Tue, May 2, 2023 at 11:22 AM Timur Kristóf <timur.kristof at gmail.com> wrote:
>
> On Tue, 2023-05-02 at 09:45 -0400, Alex Deucher wrote:
> > On Tue, May 2, 2023 at 9:35 AM Timur Kristóf
> > <timur.kristof at gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > On Tue, 2023-05-02 at 13:14 +0200, Christian König wrote:
> > > > >
> > > > > Christian König <christian.koenig at amd.com> ezt írta (időpont:
> > > > > 2023.
> > > > > máj. 2., Ke 9:59):
> > > > >
> > > > > > Am 02.05.23 um 03:26 schrieb André Almeida:
> > > > > > > Em 01/05/2023 16:24, Alex Deucher escreveu:
> > > > > > >> On Mon, May 1, 2023 at 2:58 PM André Almeida
> > > > > > <andrealmeid at igalia.com>
> > > > > > >> wrote:
> > > > > > >>>
> > > > > > >>> I know that devcoredump is also used for this kind of
> > > > > > information,
> > > > > > >>> but I believe
> > > > > > >>> that using an IOCTL is better for interfacing Mesa +
> > > > > > Linux
> > > > > > rather
> > > > > > >>> than parsing
> > > > > > >>> a file that its contents are subjected to be changed.
> > > > > > >>
> > > > > > >> Can you elaborate a bit on that? Isn't the whole point
> > > > > > of
> > > > > > devcoredump
> > > > > > >> to store this sort of information?
> > > > > > >>
> > > > > > >
> > > > > > > I think that devcoredump is something that you could use
> > > > > > to
> > > > > > submit to
> > > > > > > a bug report as it is, and then people can read/parse as
> > > > > > they
> > > > > > want,
> > > > > > > not as an interface to be read by Mesa... I'm not sure
> > > > > > that
> > > > > > it's
> > > > > > > something that I would call an API. But I might be wrong,
> > > > > > if
> > > > > > you know
> > > > > > > something that uses that as an API please share.
> > > > > > >
> > > > > > > Anyway, relying on that for Mesa would mean that we would
> > > > > > need
> > > > > > to
> > > > > > > ensure stability for the file content and format, making
> > > > > > it
> > > > > > less
> > > > > > > flexible to modify in the future and probe to bugs, while
> > > > > > the
> > > > > > IOCTL is
> > > > > > > well defined and extensible. Maybe the dump from Mesa +
> > > > > > devcoredump
> > > > > > > could be complementary information to a bug report.
> > > > > >
> > > > > > Neither using an IOCTL nor devcoredump is a good approach
> > > > > > for
> > > > > > this since
> > > > > > the values read from the hw register are completely
> > > > > > unreliable.
> > > > > > They
> > > > > > could not be available because of GFXOFF or they could be
> > > > > > overwritten or
> > > > > > not even updated by the CP in the first place because of a
> > > > > > hang
> > > > > > etc....
> > > > > >
> > > > > > If you want to track progress inside an IB what you do
> > > > > > instead
> > > > > > is to
> > > > > > insert intermediate fence write commands into the IB. E.g.
> > > > > > something
> > > > > > like write value X to location Y when this executes.
> > > > > >
> > > > > > This way you can not only track how far the IB processed,
> > > > > > but
> > > > > > also in
> > > > > > which stages of processing we where when the hang occurred.
> > > > > > E.g.
> > > > > > End of
> > > > > > Pipe, End of Shaders, specific shader stages etc...
> > > > > >
> > > > > >
> > > > >
> > > > > Currently our biggest challenge in the userspace driver is
> > > > > debugging "random" GPU hangs. We have many dozens of bug
> > > > > reports
> > > > > from users which are like: "play the game for X hours and it
> > > > > will
> > > > > eventually hang the GPU". With the currently available tools,
> > > > > it is
> > > > > impossible for us to tackle these issues. André's proposal
> > > > > would be
> > > > > a step in improving this situation.
> > > > >
> > > > > We already do something like what you suggest, but there are
> > > > > multiple problems with that approach:
> > > > >
> > > > > 1. we can only submit 1 command buffer at a time because we
> > > > > won't
> > > > > know which IB hanged
> > > > > 2. we can't use chaining because we don't know where in the IB
> > > > > it
> > > > > hanged
> > > > > 3. it needs userspace to insert (a lot of) extra commands such
> > > > > as
> > > > > extra synchronization and memory writes
> > > > > 4. It doesn't work when GPU recovery is enabled because the
> > > > > information is already gone when we detect the hang
> > > > >
> > > > You can still submit multiple IBs and even chain them. All you
> > > > need
> > > > to do is to insert into each IB commands which write to an extra
> > > > memory location with the IB executed and the position inside the
> > > > IB.
> > > >
> > > > The write data command allows to write as many dw as you want
> > > > (up to
> > > > multiple kb). The only potential problem is when you submit the
> > > > same
> > > > IB multiple times.
> > > >
> > > > And yes that is of course quite some extra overhead, but I think
> > > > that should be manageable.
> > >
> > > Thanks, this sounds doable and would solve the limitation of how
> > > many
> > > IBs are submitted at a time. However it doesn't address the problem
> > > that enabling this sort of debugging will still have extra
> > > overhead.
> > >
> > > I don't mean the overhead from writing a couple of dwords for the
> > > trace, but rather, the overhead from needing to emit flushes or top
> > > of
> > > pipe events or whatever else we need so that we can tell which
> > > command
> > > hung the GPU.
> > >
> > > >
> > > > > In my opinion, the correct solution to those problems would be
> > > > > if
> > > > > the kernel could give userspace the necessary information about
> > > > > a
> > > > > GPU hang before a GPU reset.
> > > > >
> > > > The fundamental problem here is that the kernel doesn't have
> > > > that
> > > > information either. We know which IB timed out and can
> > > > potentially do
> > > > a devcoredump when that happens, but that's it.
> > >
> > >
> > > Is it really not possible to know such a fundamental thing as what
> > > the
> > > GPU was doing when it hung? How are we supposed to do any kind of
> > > debugging without knowing that?
> > >
> > > I wonder what AMD's Windows driver team is doing with this problem,
> > > surely they must have better tools to deal with GPU hangs?
> >
> > For better or worse, most teams internally rely on scan dumps via
> > JTAG
> > which sort of limits the usefulness outside of AMD, but also gives
> > you
> > the exact state of the hardware when it's hung so the hardware teams
> > prefer it.
> >
>
> How does this approach scale? It's not something we can ask users to
> do, and even if all of us in the radv team had a JTAG device, we
> wouldn't be able to play every game that users experience random hangs
> with.
It doesn't scale or lend itself particularly well to external
development, but that's the current state of affairs.
Alex
More information about the amd-gfx
mailing list