Kernel 5.15.150 black screen with AMD Raven/Picasso GPU

Armin Wolf W_Armin at gmx.de
Thu May 23 15:59:39 UTC 2024


Am 23.05.24 um 15:13 schrieb Barry Kauler:

> On Wed, May 22, 2024 at 12:58 AM Armin Wolf <W_Armin at gmx.de> wrote:
>> Am 20.05.24 um 18:22 schrieb Alex Deucher:
>>
>>> On Sat, May 18, 2024 at 8:17 PM Armin Wolf <W_Armin at gmx.de> wrote:
>>>> Am 17.05.24 um 03:30 schrieb Barry Kauler:
>>>>
>>>>> Armin, Yifan, Prike,
>>>>> I will top-post, so you don't have to scroll down.
>>>>> After identifying the commit that causes black screen with my gpu, I
>>>>> posted the result to you guys, on May 9.
>>>>> It is now May 17 and no reply.
>>>>> OK, I have now created a patch that reverts Yifan's commit, compiled
>>>>> 5.15.158, and my gpu now works.
>>>>> Note, the radeon module is not loaded, so it is not a factor.
>>>>> I'm not a kernel developer. I have identified the culprit and it is up
>>>>> to you guys to fix it, Yifan especially, as you are the person who has
>>>>> created the regression.
>>>>> I will attach my patch.
>>>>> Regards,
>>>>> Barry Kauler
>>>> Hi,
>>>>
>>>> sorry for not responding to your findings. I normally do not work with GPU drivers,
>>>> so i hoped one of the amdgpu developers would handle this.
>>>>
>>>> I CCeddri-devel at lists.freedesktop.org  and amd-gfx at lists.freedesktop.org so that other
>>>> amdgpu developers hear from this issue.
>>>>
>>>> Thanks you for you persistence in finding the offending commit.
>>> Likely this patch should not have been ported to 5.15 in the first
>>> place.  The IOMMU requirements have been dropped from the driver for
>>> the last few kernel versions so it is no longer relevant on newer
>>> kernels.
>>>
>>> Alex
>> Barry, can you verify that the latest upstream kernel works on you device?
>> If yes, then the commit itself is ok and just the backporting itself was wrong.
>>
>> Thanks,
>> Armin Wolf
> Armin,
> The unmodified 6.8.1 kernel works ok.
> I presume that patch was applied long before 6.8.1 got released and
> only got backported to 5.15.x recently.
>
> Regards,
> Barry
>
Great to hear, that means we only have to revert commit 56b522f46681 ("drm/amdgpu: init iommu after amdkfd device init")
from the 5.15.y series.

I CCed the stable mailing list so that they can revert the offending commit.

Thanks,
Armin Wolf

>>>> Armin Wolf
>>>>
>>>>> On Thu, May 9, 2024 at 4:08 PM Barry Kauler <bkauler at gmail.com> wrote:
>>>>>> On Fri, May 3, 2024 at 9:03 PM Armin Wolf <W_Armin at gmx.de> wrote:
>>>>>>>> ...
>>>>>>>> # lspci | grep VGA
>>>>>>>> 05:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
>>>>>>>> [AMD/ATI] Picasso/Raven 2 [Radeon Vega Series / Radeon Vega Mobile
>>>>>>>> Series] (rev c2)
>>>>>>>> 05:00.7 Non-VGA unclassified device: Advanced Micro Devices, Inc.
>>>>>>>> [AMD] Raven/Raven2/Renoir Non-Sensor Fusion Hub KMDF driver
>>>>>>>>
>>>>>>>> # lspci -n -k
>>>>>>>> ...
>>>>>>>> 05:00.0 0300: 1002:15d8 (rev c2)
>>>>>>>> Subsystem: 1025:1456
>>>>>>>> Kernel driver in use: amdgpu
>>>>>>>> Kernel modules: amdgpu
>>>>>>>> ...
>>>>>>> thanks for informing us of this regression. Since there are four commits affecting
>>>>>>> amdgpu in 5.15.150, i suggest that you use "git bisect" to find the faulty commits,
>>>>>>> see https://docs.kernel.org/admin-guide/bug-bisect.html for details.
>>>>>>>
>>>>>>> I think you can speed up the bisecting process by limiting yourself to the AMD DRM
>>>>>>> driver directory with "git bisect start -- drivers/gpu/drm/amd", take a look at the
>>>>>>> man page of "git bisect" for details.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Armin Wolf
>>>>>> Armin,
>>>>>> Thanks for the advice. I am unfamiliar with git on the commandline.
>>>>>> Previously only used SmartGit gui.
>>>>>> EasyOS requires aufs patch, and for a few days tried to figure out how
>>>>>> to use that with git bisect, then gave up. Changed to testing with my
>>>>>> "QV" distro, which is more conventional, doesn't need any kernel
>>>>>> patches. Managed to get it down to one commit. Here are the steps I
>>>>>> followed:
>>>>>>
>>>>>> # git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
>>>>>> # cd linux-stable
>>>>>> # git tag -l | grep '5\.15\.150'
>>>>>> v5.15.150
>>>>>> # git checkout -b my5.15.150 v5.15.150
>>>>>> Updating files: 100% (65776/65776), done.
>>>>>> Switched to a new branch 'my5.15.150'
>>>>>>
>>>>>> Copied in my .config then...
>>>>>>
>>>>>> # make menuconfig
>>>>>> # git bisect start -- drivers/gpu/drm/amd
>>>>>> # git bisect bad
>>>>>> # git bisect good v5.15.149
>>>>>> Bisecting: 1 revision left to test after this (roughly 1 step)
>>>>>> [b9a61ee2bb2704e42516e3da962f99dfa98f3b20] drm/amdgpu: reset gpu for
>>>>>> s3 suspend abort case
>>>>>> # make
>>>>>> # rm -rf /boot2
>>>>>> # mkdir -p /boot2/lib/modules
>>>>>> # make INSTALL_MOD_STRIP=1 INSTALL_MOD_PATH=/boot2 modules_install
>>>>>> # cp arch/x86/boot/bzImage /boot2/vmlinuz
>>>>>> # sync
>>>>>> ...QV on Acer laptop, with amdgpu, works!
>>>>>> # git bisect good
>>>>>> Bisecting: 0 revisions left to test after this (roughly 0 steps)
>>>>>> [56b522f4668167096a50c39446d6263c96219f5f] drm/amdgpu: init iommu
>>>>>> after amdkfd device init
>>>>>> # make
>>>>>> # mkdir -p /boot2/lib/modules
>>>>>> # make INSTALL_MOD_STRIP=1 INSTALL_MOD_PATH=/boot2 modules_install
>>>>>> # cp arch/x86/boot/bzImage /boot2/vmlinuz
>>>>>> # sync
>>>>>> ...QV on Acer laptop, black screen!
>>>>>>
>>>>>> # git bisect bad
>>>>>> 56b522f4668167096a50c39446d6263c96219f5f is the first bad commit
>>>>>> commit 56b522f4668167096a50c39446d6263c96219f5f
>>>>>> Author: Yifan Zhang <yifan1.zhang at amd.com>
>>>>>> Date:   Tue Sep 28 15:42:35 2021 +0800
>>>>>>
>>>>>>        drm/amdgpu: init iommu after amdkfd device init
>>>>>>
>>>>>>        [ Upstream commit 286826d7d976e7646b09149d9bc2899d74ff962b ]
>>>>>>
>>>>>>        This patch is to fix clinfo failure in Raven/Picasso:
>>>>>>
>>>>>>        Number of platforms: 1
>>>>>>          Platform Profile: FULL_PROFILE
>>>>>>          Platform Version: OpenCL 2.2 AMD-APP (3364.0)
>>>>>>          Platform Name: AMD Accelerated Parallel Processing
>>>>>>          Platform Vendor: Advanced Micro Devices, Inc.
>>>>>>          Platform Extensions: cl_khr_icd cl_amd_event_callback
>>>>>>
>>>>>>          Platform Name: AMD Accelerated Parallel Processing Number of devices: 0
>>>>>>
>>>>>>        Signed-off-by: Yifan Zhang <yifan1.zhang at amd.com>
>>>>>>        Reviewed-by: James Zhu <James.Zhu at amd.com>
>>>>>>        Tested-by: James Zhu <James.Zhu at amd.com>
>>>>>>        Acked-by: Felix Kuehling <Felix.Kuehling at amd.com>
>>>>>>        Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
>>>>>>        Signed-off-by: Sasha Levin <sashal at kernel.org>
>>>>>>
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++----
>>>>>>     1 file changed, 4 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> Anything else I should do, to identify what in this commit is the
>>>>>> likely culprit?
>>>>>> Regards,
>>>>>> Barry Kauler


More information about the amd-gfx mailing list