It appears drm-next TTM cleanup broke something . . .

Christian König christian.koenig at amd.com
Mon Oct 19 16:37:25 UTC 2020


Hi Kevin,

> OpenChrome DDX was sending TTM_PL_FLAG_* based flags to OpenChrome DRM.

Ugh, that would be an absolute no-go for upstreaming.

> Is it too much to ask for using more BUG_ON null pointer assertions for TTM callbacks?

I don't think that this is useful at all. See a BUG_ON() has the same 
result as a NULL pointer dereference.

> While that may be true, I do plan to implement acceleration later, and this is why I do not want to settle with the GEM VRAM helpers.

TTM is quite a mess and the effort here is essential to clean it up and 
kill most of the driver specific workarounds we have in there.

As the maintainer of all of this I would probably reject any newly added 
driver which is using the layer directly instead of the VRAM GEM wrapper.

Regards,
Christian.

Am 19.10.20 um 18:20 schrieb Kevin Brace:
> Hi Christian,
>
> I looked into a few more things, and figured out why OpenChrome DRM was not booting X Server.
> Now the situation is under control in my side of the world (OpenChrome development world), and I got X Server working again with drm-next 5.10 code and OpenChrome DRM.
> Code will be committed into OpenChrome DRM upstream repository's drm-next-5.10 branch in the next few days.
>      I can see that OpenChrome DRM "just happened to" work without ttm_tt_create/destroy callbacks being specified.
> I can fix this issue easily.
> The other issue I was facing was that commit 48e07c23cbeba2a2cda7ca73be0015e727818536 (drm/ttm: nuke memory type flags) discontinued TTM_PL_FLAG_* flags.
> This caused incompatibility between OpenChrome DDX and the updated OpenChrome DRM that incorporated the changes made with commit 48e07c2.
> OpenChrome DDX was sending TTM_PL_FLAG_* based flags to OpenChrome DRM.
> I changed OpenChrome DDX code to use TTM_PL_* instead for allocating memory via TTM, and OpenChrome DRM based Linux kernel was able to boot X Server properly.
> Again, the code change to the DDX will happen in a few days.
>      Before I moved on from this issue, I will like to ask for one request regarding TTM and its callbacks.
> Is it too much to ask for using more BUG_ON null pointer assertions for TTM callbacks?
> Although I did not bring it up with you back then, I did get hit with an issue similar to this 3 years ago.
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcgit.freedesktop.org%2Fopenchrome%2Fdrm-openchrome%2Fcommit%2F%3Fh%3Ddrm-next-4.13%26id%3D3c08ec601bb1ccd5ff58a9101317b728aa7204bd&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=wmGsqFinBsTLJSgp6G5ygoW7J%2Fppph4CzgO9q7q4h34%3D&reserved=0
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcgit.freedesktop.org%2Fopenchrome%2Fdrm-openchrome%2Fcommit%2F%3Fh%3Ddrm-next-4.13%26id%3D2064175f977d9859831be653df16e3ea10415a8a&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=1Pl3PZ%2FY8P2aFE9in3oYCgl1CW6Nq%2FIQiHe75U8Dp0I%3D&reserved=0
>
>
> I did not send you an e-mail about it, I do recall complaining to Alex Deucher about the above io_mem_pfn null pointer bug at XDC2017 (I did attend it, and I presented about the state of OpenChrome Project.).
> I brought this up with Alex because I stuck for more than 2 weeks to figure out the issue (the commit for overcoming io_mem_pfn callback issue was made on 2017-09-13, but the commit prior to that was on 2017-08-28).
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcgit.freedesktop.org%2Fopenchrome%2Fdrm-openchrome%2Flog%2F%3Fh%3Ddrm-next-4.13%26ofs%3D50&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=zsLWqjAPDFhokHptCAHl4XRCZJSjcWsnvAimoUyb7Mk%3D&reserved=0
>
>
> Anyway, someone else got hit by the same issue a month or two later
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Fdri-devel%2F2017-November%2F159002.html&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=wbkUKUL0XVOviw%2Bvt0WFGvSyJiOF%2Bdz%2FZVSMSPTwuwI%3D&reserved=0
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Fdri-devel%2F2017-December%2F161168.html&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=nJ8n0lFRTX%2FoxPsKCvV%2FYvtUvmvzxLGVikUWtBsZy5U%3D&reserved=0
>
>
> These are the commits that fixed the issue.
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcgit.freedesktop.org%2Fdrm%2Fdrm%2Fcommit%2F%3Fid%3Dea642c3216cb2a60d1c0e760ae47ee85c9c16447&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=aL1C28MKl5SPRsP%2F9A6XWFMmtXmxymmGic3vLZe5%2FNw%3D&reserved=0
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcgit.freedesktop.org%2Fdrm%2Fdrm%2Fcommit%2F%3Fid%3Dc67fa6edc8b11afe22c88a23963170bf5f151acf&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=%2BY1z8sgjIf%2FPiJG9PBIOnsHuRCFq1uHNSVJodWQz%2B5k%3D&reserved=0
>
>
> If you compare the commit date, I figured it out io_mem_pfn null pointer bug 2 to 3 months earlier.
> I think the problem we got with the latest code change to the TTM is, many TTM callbacks do not check for a null pointer for mandatory callbacks.
> Also, the ttm_bo_driver struct comment section inside ttm_bo_driver.h does not make it 100% clear that what is mandatory and what is not.
> If these were done, I am sure I would have left ttm_tt_create/destroy callbacks intact.
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcgit.freedesktop.org%2Fopenchrome%2Fdrm-openchrome%2Fcommit%2F%3Fh%3Ddrm-next-5.10%26id%3D64947142a1cf8b83faa73da7aa35a17f6a24568a&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=egysIQ%2BgxKi3ujKmzyRNRvI93SY%2BFI5dlW%2FvyGglAn8%3D&reserved=0
> (scroll down to openchrome/openchrome_ttm.c)
>
>
>      Regarding your suggestion that I should abandon OpenChrome DRM's current GEM / TTM implementation and instead I should use the GEM VRAM helpers, since I was able to figure out the issues with some suggestions from you and Dave, I do not think it is worth switching to GEM VRAM helpers at this point.
> Obviously, the current implementation does not support acceleration, hence, it appears to be a good candidate for switching to GEM VRAM helpers.
> While that may be true, I do plan to implement acceleration later, and this is why I do not want to settle with the GEM VRAM helpers.
>
> Regards,
>
> Kevin Brace
> Brace Computer Laboratory blog
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbracecomputerlab.com%2F&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=a11wNdmtVlgNBWHuBnxT%2BhCZRC%2BCtZw2xrVebNylSLU%3D&reserved=0
>
>
>> Sent: Monday, October 19, 2020 at 3:13 AM
>> From: "Christian König" <ckoenig.leichtzumerken at gmail.com>
>> To: "Kevin Brace" <kevinbrace at gmx.com>, "Dave Airlie" <airlied at gmail.com>, "dri-devel" <dri-devel at lists.freedesktop.org>
>> Subject: Re: It appears drm-next TTM cleanup broke something . . .
>>
>> Hi Kevin,
>>
>> the basic problem you are facing is that ttm_tt_create/destroy is
>> mandatory (It always was). You need an implementation or otherwise you
>> won't be able to use the system domain (additional to the optional GTT
>> domain).
>>
>> My best guess is that the difference is that we now force to initiate
>> the system domain for all drivers.
>>
>> If that is correct you just that you never ran into because you never
>> correctly initialized TTM to support buffer moves.
>>
>> I'm not sure what exactly the OpenChrome DRM driver is doing, but I
>> strongly suggest to just drop TTM support completely and use the GEM
>> VRAM helper layer instead.
>>
>> Regards,
>> Christian.
>>
>> Am 19.10.20 um 09:23 schrieb Kevin Brace:
>>> Hi Dave,
>>>
>>> Yeah, with the workaround I mentioned in my previous e-mail, OpenChrome DRM does not crash for "ttm_tt_create" member being null.
>>> It is still not able to boot X Server due to some other TTM related memory allocation issue it is suffering from.
>>> I think making huge changes to TTM during this development cycle broke OpenChrome DRM.
>>>       Following up on the question I raised during the previous e-mail.
>>> Shouldn't "use_tt" parameter being "false" for ttm_range_man_init() disable TTM TT functionality?
>>> I feel like that should be the expected behavior.
>>> Again, there is only 5 to 6 more days left until Linux 5.10-rc2, so I decided to contact you on Sunday (I consider this bug to be urgent.).
>>> Assuming what I am asserting is correct, I think the reason why this was not discovered earlier was due to the following reasons.
>>>
>>> 1) nouveau, radeon, and amdgpu already use TTM TT functionality.
>>> 2) ast uses GEM VRAM helper that internally uses TTM. It populates "ttm_tt_create" and "ttm_tt_destroy" members, hence, the developers did not notice the breakage.
>>> 3) OpenChrome DRM is still not in the mainline tree, so no one other than myself noticed the problem until now.
>>>
>>>
>>> Regarding the TTM TT functionality, OpenChrome DRM currently does not support acceleration, hence, I did not believe it was necessary to populate "ttm_tt_create" and "ttm_tt_destroy" members.
>>> That implementation worked fine until the previous development cycle code.
>>> Of course, I will eventually add support for acceleration, hence, TTM TT functionality will be utilized at some point.
>>>
>>> Regards,
>>>
>>> Kevin Brace
>>> Brace Computer Laboratory blog
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbracecomputerlab.com%2F&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=a11wNdmtVlgNBWHuBnxT%2BhCZRC%2BCtZw2xrVebNylSLU%3D&reserved=0
>>>
>>>
>>>> Sent: Sunday, October 18, 2020 at 12:50 PM
>>>> From: "Dave Airlie" <airlied at gmail.com>
>>>> To: "Kevin Brace" <kevinbrace at gmx.com>, "Christian König" <ckoenig.leichtzumerken at gmail.com>
>>>> Cc: "dri-devel" <dri-devel at lists.freedesktop.org>, "Dave Airlie" <airlied at redhat.com>
>>>> Subject: Re: It appears drm-next TTM cleanup broke something . . .
>>>>
>>>> On Mon, 19 Oct 2020 at 05:15, Kevin Brace <kevinbrace at gmx.com> wrote:
>>>>> Hi Dave,
>>>>>
>>>>> It is a little urgent, so I am writing this right now.
>>>>> As usual, I pulled in DRM repository code for an out of tree OpenChrome DRM repository a few days ago.
>>>>> While going through the changes I need to make to OpenChrome DRM to compile with the latest Linux kernel, I noticed that ttm_bo_init_mm() was discontinued, and it was replaced with ttm_range_man_init().
>>>>> ttm_range_man_init() has a parameter called "bool use_tt", but honestly, I do not think it is functioning correctly.
>>>>> If I keep "ttm_tt_create" member of ttm_bo_driver struct null by not specifying it, TTM still tries to call it, and crashes due to a null pointer access.
>>>>> The workaround I found so far is to specify the "ttm_tt_create" member by copying bo_driver_ttm_tt_create() from drm/drm_gem_vram_helper.c.
>>>>> This is what the call trace looks like without specifying the "ttm_tt_create" member (i.e., this member is null).
>>>> cc'ing Christian,
>>>>
>>>> I can't remember if we did this deliberately or if just worked by
>>>> accident previously.
>>>>
>>>> Either way, you should probably need a ttm_tt_create going forward.
>>>>
>>>> Dave.
>>>>
>>>>> _______________________________________________
>>>>> . . .
>>>>> kernel: [   34.310674] [drm:openchrome_bo_create [openchrome]] Entered openchrome_bo_create.
>>>>> kernel: [   34.310697] [drm:openchrome_ttm_domain_to_placement [openchrome]] Entered openchrome_ttm_domain_to_placement.
>>>>> kernel: [   34.310706] [drm:openchrome_ttm_domain_to_placement [openchrome]] Exiting openchrome_ttm_domain_to_placement.
>>>>> kernel: [   34.310737] BUG: kernel NULL pointer dereference, address: 0000000000000000
>>>>> kernel: [   34.310742] #PF: supervisor instruction fetch in kernel mode
>>>>> kernel: [   34.310745] #PF: error_code(0x0010) - not-present page
>>>>> . . .
>>>>> kernel: [   34.310807] Call Trace:
>>>>> kernel: [   34.310827]  ttm_tt_create+0x5f/0xa0 [ttm]
>>>>> kernel: [   34.310839]  ttm_bo_validate+0xb8/0x140 [ttm]
>>>>> kernel: [   34.310886]  ? drm_vma_offset_add+0x56/0x70 [drm]
>>>>> kernel: [   34.310897]  ? openchrome_gem_create_ioctl+0x150/0x150 [openchrome]
>>>>> . . .
>>>>> _______________________________________________
>>>>>
>>>>> The erroneous call to  "ttm_tt_create" member happens right after TTM placement is performed (openchrome_ttm_domain_to_placement()).
>>>>> Currently, OpenChrome DRM's TTM implementation does not use "ttm_tt_create" member, and this arrangement worked fine until Linux 5.9's drm-next code.
>>>>> It appears that Linux 5.10's drm-next code broke the code.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Kevin Brace
>>>>> Brace Computer Laboratory blog
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbracecomputerlab.com%2F&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462975986&sdata=oQr07H953zWryuBQJyPgFsqtOkXAtfprsxiA4ny%2F4LE%3D&reserved=0
>>>>>
>>>>> _______________________________________________
>>>>> dri-devel mailing list
>>>>> dri-devel at lists.freedesktop.org
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462975986&sdata=gq2l57u0hZD2arzcWU2ppTuA0mgcshZHI%2BVvHYJ8hcs%3D&reserved=0
>>



More information about the dri-devel mailing list