[Bug 204181] NULL pointer dereference regression in amdgpu

bugzilla-daemon at bugzilla.kernel.org bugzilla-daemon at bugzilla.kernel.org
Fri Sep 27 20:18:47 UTC 2019


--- Comment #56 from Sergey Kondakov (virtuousfox at gmail.com) ---
(In reply to Alex Deucher from comment #54)
> (In reply to Sergey Kondakov from comment #53)
> > Or any of these ?
> > options amdgpu cik_support=1 si_support=1 msi=1 disp_priority=2 dpm=1
> > runpm=1 sched_policy=1 compute_multipipe=1 vm_fragment_size=9 gartsize=1024
> > max_num_of_queues_per_device=65536 sched_hw_submission=32 sched_jobs=1024
> > job_hang_limit=8000 halt_if_hws_hang=1 vm_fault_stop=0 vm_update_mode=0
> > deep_color=1 gpu_recovery=1 lockup_timeout=2500,5000,8000,1000 ras_enable=1
> > mcbp=1 queue_preemption_timeout_ms=48 mes=1 hws_gws_support=1 discovery=1
> remove all of those.  You should use the defaults unless you are
> specifically debugging something.

Then you may consider that I "specifically debugging" THIS. Because when I ask
these questions here or in freedesktop.org, I specifically hope for an factual
response from people with actual understanding and experience of how it works
and what to be a proper way to debug without guesswork, based on knowledge that
would compensate for the lack of meaningful documentation and one of the
highest entry-barriers in software (even corporate monstrosity like Intel can't
figure out GPUs still, market that is dominated by 2 oligopolists that run it
with impunity however they feel like it, after all). This third dereference
would be really hard to debug, though, because there is no clear reproduction
steps, UNLESS you KNOW where and how to look as a developer. Or are you all
just going to ignore the presence of kernel-crashing code because it "may" (or
may not) be not triggered by your defaults ?

So, can you actually tell which code-path may result in this or, better yet,
test it yourself so things like that just would not go into releases ?
The original dereference is triggered by mere presence of PageFlip which is on
by default, so blindly running developer defaults (you can see what exactly I
think about them here: https://bugzilla.kernel.org/show_bug.cgi?id=203703#c9
and c11) didn't help much anyone now, did it ?

Or can you at least explain on what exactly each of these options does, what
may be desired and undesired consequences and how your consensus about defaults
came to be ? Short summary (but not as short as modinfo) or links to mailing
list discussions maybe ? Because my goals (as they are for any desktop user)
are: minimal guaranteed latency (meaning, full aggressive preemption, lowest
scheduling granularity and strict RT priorities) of audio/video/input/network
pipelines under stress-load and in that specific order of priority, with
working fast fail-over or recovery instead of hangs and reboots.

If I'd be using defaults then I still would be sitting on 3,3Ghz (instead of
4Ghz + 2,4Ghz for MMU & cache) FX CPU, non-ECC RAM ran by literally retarded
AMD FX's MMU (you KNOW the one, the laughing stock of 2011-2017 x86 CPUs !) by
slow default JEDEC timings, ~200W (instead of down-clocked and/or
under-voltaged 90-120W) RX580 GPU (that would, no doubt, fry itself at some
point like my previous 6870 did) with slow memory timings, sluggish non-patched
kwin, 64ms of audio latency (instead of 8-12ms) and whole bunch of random
hangs/drops in audio, video stuttering and input delays/skips due to scheduling
priorities that are all other the place by default. So, no, thank you very
much, on that. And YOU should NOT be testing exclusively on defaults either.

(In reply to Tom Seewald from comment #55)
> (In reply to Sergey Kondakov from comment #53)
> > Created attachment 285209 [details]
> > dmesg_2019-09-26-amdgpu-old_dereference_on_patched_5.3.1
> > 
> > After about a day of uptime my patched 5.3.1 hanged during hours-long
> > Youtube video with dereference that is almost identical to the original
> one:
> I don't believe the patches[1] have landed in a stable kernel release yet,
> at least going by the 5.3.1 change log[2] I don't see any reference to them.
> [1] https://patchwork.freedesktop.org/series/64505/
> [2] https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.3.1

They seem to be in queue for 5.3.2:
BUT those only address #1 (PageFlip) dereference, NOT #2 (when vm_update_mode
not 0) and #3 !

You are receiving this mail because:
You are watching the assignee of the bug.

More information about the dri-devel mailing list