[Bug 204181] NULL pointer dereference regression in amdgpu

Sun Sep 29 21:54:23 UTC 2019

https://bugzilla.kernel.org/show_bug.cgi?id=204181

--- Comment #59 from Sergey Kondakov (virtuousfox at gmail.com) ---
(In reply to Andrey Grodzovsky from comment #57)
> Sergey, instead of throwing tantrums why can't you just do what you are
> asked ? You present an extremely convoluted set of driver config params and
> demand from us resolving the bug with those parameters in place. This
> introduces unneeded complication of the failure scenario which in turn
> introduces a lot of unknowns. Alex asks you to simplify the settings so less
> unknows are in the system so it's easier for us to try and figure out what
> goes wrong while we inspect the code. 
> So please, bring the parameters back to default as this is the most well
> tested configuration and gives a baseline and also please provide addr2line
> for 0010:amdgpu_dm_atomic_commit_tail+0x2ee so we can get a better idea
> where in code the NULL ptr happened.

And how about instead of knowingly pushing untested code with known fatal
errors you stop taking QA notes from FGLRX in the first place and do your own
full testing ? You do realize that I, as all others, paid for that card to your
employer, right ? And people don't buy your top cards, RX[4-5][7-8]0, VEGAs and
so on, to use them as expensive bare output controllers.

DO NOT SHOOT THE MESSENGER. What you ignore from me, others will get one way or
another, most of which would be incapable to even report it and just resort to
cursing you and sell the hardware, going on Nvidia & Intel combo forever
instead. Do you have any idea how many times in my life I've heard "at least
it's without hassle" spiel about all (yes, all) AMD stuff from "normal people"
?

I don't demand from you resolving this personally for me and whatever I might
configure. But I do demand you not pushing untested code, hide it under
parameters that limit all cards to bare minimum and then use it as an excuse to
continue not to test it. And then silently expect me to work as your QA as if I
trained on how to debug kernel-level code and telepathically know what might be
on your minds. What else, should I be expected to whip out chip programmer and
write custom asm-code for your mystery chips by myself ? I don't have a
laboratory or a dedicated debug station.

_Regarding this notion of "testing on defaults"_. Maybe I was not clear on
that: that #3 dereference happened just once after about a full day of uptime.
The machine sometime was running for more than a week straight without issues.
So, defaulting will not show any difference on my end unless I run both configs
no less than 2 weeks of pure uptime each without shutting down the machine. And
it still be useless guesswork which will not produce any more pointers on what
exactly goes wrong, at best it just will repeat or not.

However, you as a developers of that code and a trained experts, can use that
little data there is to recheck exact offending code about no one else have a
clue about. You also can fully reproduce my configuration (including exact
packages of my kernel with debug-info) and work with full data of your own,
since you not willing to test all your codepaths regularly as a rule.

I will try to figure out what the hell this "addr2line" is but it will probably
include installing gigabytes of debug-symbols on SSD that has no space for
them, so… it will take a while.

But the way, what happened with my answer about #2 ? You know, the `list
*(amdgpu_vm_update_directories+0xe7)` part, which was real time-consuming pain
to get, with:
0x2e127 is in amdgpu_vm_update_directories
(../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:1191).
where line #1191 is:
struct amdgpu_bo *bo = parent->base.bo, *pbo;

Have you even seen it ? Was it the right thing ? Any thoughts on the cause for
this one ? Should I do the same for the #3 ? Will it also go into a void of
silence ?

(In reply to Damian Nowak from comment #58)
> I encounter this error once a week on average on my Radeon 7 (Vega 20).
> Great on see you guys actively working on it. When 5.3.2 releases to Arch,
> I'll keep using it for a week or two and report back whether I encounter an
> issue again or not. Thanks! 
> 
> @Sergey You could revert to defaults just for the duration of
> testing/debugging. It'll sure make things easier for developers, and you can
> still go back to your settings once the issue is fixed. Great settings
> nonetheless, do these kernel parameters really improve the power performance
> of RX 580, or did you need to do something in addition too? By the way, I
> used RX 580 on default Arch Linux settings (so most likely kernel defaults)
> for a year and it was fine so you probably don't have to worry about frying
> it. Now I'm using Radeon 7, while RX 580 is still alive in a different
> Windows-based computer.

Ok, I can. But what's next ? How exactly does that would give any more data ?
What exactly should I do after booting the machine ?

Power ? No, the custom hacked GPU BIOS does. Although, after fiddling with
voltages, I just left them on auto-defaults, where driver/firmware uses
built-in per-card "chip quality" as multiplier for defaults, and limited
frequency to 1300. Power-draw increases exponentially with frequency and after
1300 it increases ridiculous on RX580's 14nm chips. I also made fans never stop
and act more aggressively but not to the point of out-noising the case and CPU
fans. And I tightened memory timings too. 90-120W are numbers from MSI
Afterburner, mostly about 90W and rarely 120W in some specific loads.

Pre-RX cards, the whole 2008-2015 generation of AMD GPU chips (and chipsets,
for that matter), especially mobile ones, are well known to be
self-destructive. And not long ago my 6870 has joined them. Ironically, default
firmware settings on commercial GPUs are not safe, at least not on those
generations. They are balanced by the manufacturers to barely survive warranty
periods. That's why pre-overclocked cards, or any chips, is not a product that
anyone should be exited about. AMD chips are knows as "the stoves" for the
reason but device manufacturers bring it them from "inefficient" to
"half-dead". Price's good though.

With the software parameters I mainly try to balance latency and CPU time,
remove sources of stuttering, do proper prioritization during CPU & I/O
contention, and enable features that can be safely enabled, so when I run my
live test/install distro build on unknown hardware, I could test and/or use it
fully without redoing and customizing the whole damn thing. But it's more of a
guesswork with GPU than with everything else. Unfortunately, developers in
general are not much of the fans of "multi-task desktop user experience" on
last-gen ("last" being "older than one in laboratory") hardware.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.