Regression: drm: Lobotomize set_busid nonsense for !pci drivers (a325725633c2)

Laszlo Ersek lersek at redhat.com
Mon Oct 3 14:27:29 UTC 2016


On 10/03/16 16:15, Laszlo Ersek wrote:
> On 10/03/16 13:34, Emil Velikov wrote:
>> On 30 September 2016 at 18:26, Laszlo Ersek <lersek at redhat.com> wrote:
>>> On 09/30/16 18:38, Hans de Goede wrote:
>>>> Hi,
>>>>
>>>> On 30-09-16 17:33, Laszlo Ersek wrote:
>>>>> On 09/30/16 16:59, Hans de Goede wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 30-09-16 16:51, Laszlo Ersek wrote:
>>>>>>> On 09/30/16 12:35, Hans de Goede wrote:
>>>>>>>
>>>>>>>> Attached are 2 patches against the xserver which should fix this,
>>>>>>>> please give them a try.
>>>>>>>
>>>>>>> Sorry about the delay.
>>>>>>>
>>>>>>> The patches don't seem to fix the issue for me. Please see the Xorg log
>>>>>>> attached.
>>>>>>>
>>>>>>> I tested the patches as follows. Given that my bisection had been done
>>>>>>> in a Fedora 24 guest, using
>>>>>>>
>>>>>>>   xorg-x11-server-1.18.4-4.fc24
>>>>>>>   http://koji.fedoraproject.org/koji/buildinfo?buildID=794494
>>>>>>>
>>>>>>> I now rebuilt the guest kernel exactly at the failing commit (a325725
>>>>>>> "drm: Lobotomize set_busid nonsense for !pci drivers"), and first
>>>>>>> reproduced the issue with the above X server.
>>>>>>>
>>>>>>> Then, I ported your patches to "xorg-server-1.18.4" (using the upstream
>>>>>>> xserver tree), and rebuilt the Fedora package with the backport. For
>>>>>>> the
>>>>>>> backport, I had to cherry-pick the following two patches from master
>>>>>>> first:
>>>>>>>
>>>>>>> 1 ca8d88e50310 xfree86: recognize primary BUS_PCI device in
>>>>>>>                xf86IsPrimaryPlatform()
>>>>>>> 2 ea91db4b8331 config: fix GPUDevice fail when AutoAddGPU off + BusID
>>>>>>>
>>>>>>> This way your patches applied cleanly. (Cherry pick #1 above is
>>>>>>> actually
>>>>>>> necessary for semantics, while cherry pick #2 is needed for a clean
>>>>>>> context only, and has no impact for this test.)
>>>>>>>
>>>>>>> That is, in total, I added the following four patches to the Fedora 24
>>>>>>> package:
>>>>>>>
>>>>>>> 1 xfree86: recognize primary BUS_PCI device in xf86IsPrimaryPlatform()
>>>>>>> 2 config: fix GPUDevice fail when AutoAddGPU off + BusID
>>>>>>> 3 xfree86: Make adding unclaimed devices as GPU devices a separate step
>>>>>>> 4 xfree86: Try harder to find atleast 1 non GPU Screen
>>>>>>>
>>>>>>> You can find the scratch build that I used for testing here:
>>>>>>>
>>>>>>>   xorg-x11-server-1.18.4-4.hans_bz1366842_2.fc24
>>>>>>>   http://koji.fedoraproject.org/koji/taskinfo?taskID=15875087
>>>>>>>
>>>>>>> Another reason I used F24's X server as basis, rather than upstream
>>>>>>> HEAD, is that Fedora 24 is pretty young, and it's already on kernel
>>>>>>> 4.7.4, and I believe it will soon move to kernel 4.8, without
>>>>>>> (necessarily) rebasing its X server package to upstream. IOW the kernel
>>>>>>> upgrade to 4.8 will break X in Fedora 24 too, and then I expect the
>>>>>>> Fedora X maintainers would have to cherry pick those two patches as
>>>>>>> dependencies just the same.
>>>>>>>
>>>>>>> To summarize, the patches don't seem to help. I shall nonetheless thank
>>>>>>> you for spending your Friday on this!
>>>>>>
>>>>>> Hmm, do you have a xorg.conf file lying around somewhere, the message
>>>>>> about the xserver not being able to find an entry for screen 0 does
>>>>>> not make sense ...
>>>>>
>>>>> Good catch, I actually had two files under "/etc/X11/xorg.conf.d/":
>>>>>
>>>>> * "00-keyboard.conf", from package "systemd-229-13.fc24.x86_64", with
>>>>> contents
>>>>>
>>>>> ------------
>>>>> # Read and parsed by systemd-localed. It's probably wise not to edit
>>>>> this file
>>>>> # manually too freely.
>>>>> Section "InputClass"
>>>>>         Identifier "system-keyboard"
>>>>>         MatchIsKeyboard "on"
>>>>>         Option "XkbLayout" "us"
>>>>> EndSection
>>>>> ------------
>>>>>
>>>>> * "01-resolution.conf", which I had created, in order to set the
>>>>> preferred display resolution:
>>>>>
>>>>> ------------
>>>>> Section "Screen"
>>>>>   Identifier "Default Screen"
>>>>>   Device     "Default Device"
>>>>>   Monitor    "Default Monitor"
>>>>> EndSection
>>>>>
>>>>> Section "Device"
>>>>>   Identifier "Default Device"
>>>>>   Driver     "modesetting"
>>>>> EndSection
>>>>>
>>>>> Section "Monitor"
>>>>>   Identifier "Default Monitor"
>>>>>   Option     "PreferredMode"   "640x480"
>>>>> # Option     "PreferredMode"   "1440x900"
>>>>> EndSection
>>>>> ------------
>>>>>
>>>>> I removed these files now, and repeated the test. Again, the X server
>>>>> wouldn't start, but I think the log file looks a bit different now.
>>>>> Attached.
>>>>
>>>> Ah, ok so it seems that my initial analysis is wrong, the problem
>>>> is not a re-occuring of the device getting identified as a GPU screen,
>>>> libdrm sorta depends on bus-ids and the lack of one is causing the
>>>> server to misbehave. I guess that even with a xorg.conf things
>>>> will fail with the troublesome kernel version (might be worth
>>>> trying).
>>>>
>>>> Emil's analysis seems to be spot on. This does not seem easily
>>>> fixable in userspace / does seem like a real regression as it
>>>> even breaks things when specifying the device through xorg.conf
>>>> (I or so I believe) which is something which uses to work ...
>>>
>>> In order to check this hypothesis, I did the following:
>>> - I downgraded my xorg-x11-server installation to the most recent
>>> official F24 packages, that is, "1.18.4-4.fc24",
>>> - I kept the kernel that I built exactly at the regressive commit
>>> (a325725633c2)
>>> - I modified "01-resolution.conf" (see it above in the context) like this:
>>>
>>> ----
>>> Section "Device"
>>>   Identifier "Default Device"
>>>   Driver     "modesetting"
>>>   BusID      "PCI:00:02:0" <------------ new option added
>>> EndSection
>>> ----
>>>
>>> where BusID matches the B/D/F of the virtio-vga device from "lspci".
>>>
>>> This setup (modulo the kernel of course) was known to work, but now the
>>> X server actually segfaults (apparently in the
>>> xf86PlatformDeviceCheckBusID() function). Please find the logfile attached.
>>>
>>> (NB: this is unrelated to upstream commit de9ce6757c2e -- which the
>>> pristine FC24 build lacks -- because I don't set AutoAddGPU to "off" --
>>> it is left at its default "on" value.)
>>>
>> Where is this upstream commit again - it shows as unknown for the
>> kernel, xserver and libdrm ?
>>
>> So my theory was a bit off - SetVersion is the one responsible to set
>> the "BusID", as retrieved by drmGetBusID, regardless if drmOpen or
>> open is used.
>>
>> Here's a bit of a brain dump from the other day:
>>
>>  - The commit mentioned 'affects' the drmSetBusid/DRM_IOCTL_SET_UNIQUE
>> userspace codepaths.
>>  - The latter itself is dri1/legacy (xserver hw/xfree86/dri/) which is
>> not functional for platform devices.
>> The latter of which seems to be the case for virt-gpu based on the
>> kernel module.
>>  - The modesetting driver should/cannot reach the above xserver codepath
>>
>> That said, it seems that (at least some) userspace expects a PCI
>> device despite the kernel module 'advertising' itself as platform one
>> :-\
> 
> With kernel commit a325725633c2 applied, the drmGetBusid() call in
> get_drm_info() [hw/xfree86/os-support/linux/lnx_platform.c] returns
> "virtio0".
> 
> Without kernel commit a325725633c2 in place, the same function call
> produces "pci:0000:00:02.0".
> 
> I think that the idea of kernel commit a325725633c2:
> 
>     We already have a fallback in place to fill out the unique from
>     dev->unique, which is set to something reasonable in drm_dev_alloc.
> 
> is inappropriate for virtio devices.
> 
> While it is true that "virtioN" is unique across the guest system, those
> identifiers are not stable.
> 
> # ls -1d /sys/devices/pci*/*/virtio*
> /sys/devices/pci0000:00/0000:00:02.0/virtio0
> /sys/devices/pci0000:00/0000:00:03.0/virtio1
> /sys/devices/pci0000:00/0000:00:05.0/virtio2
> /sys/devices/pci0000:00/0000:00:06.0/virtio3
> /sys/devices/pci0000:00/0000:00:09.0/virtio4
> 
> If you tweak the PCI addresses of other virtio devices on the QEMU
> command line, while keeping the PCI address of the virtio-vga device
> intact, the "virtioN" identifier of the virtio-vga device may change in
> an unspecified / unexpected manner.
> 
> From the above listing, it seems like higher PCI Device numbers happen
> to end up with higher "virtioN" identifiers. While this is unspecified /
> non-guaranteed behavior, in practice it means the following:
> 
> Moving the "virtio1" device from PCI address 00:03.0 to 00:01.0, while
> preserving the PCI address of virtio-vga as 00:02.0, will bump
> virtio-vga to "virtio1" from "virtio0" (and sink that other device from
> "virtio1" to "virtio0").
> 
> I think that unstable identifiers don't qualify for BusID use,
> regardless of the actual format of the IDs in question.
> 
> Searching the patch quickly, drm_virtio_set_busid() is the only
> implementation of the "drm_driver.set_busid" callback that delegates the
> task to drm_pci_set_busid(). All other implementations of this callback
> were provided by drm_platform_set_busid().
> 
> drm_platform_set_busid() would ultimately format
> "dev->platformdev->name" and "dev->platformdev->id" into the bus
> identifier. Replacing that common logic with the drm_dev_alloc()
> fallback that is already in place is probably justified: judging from
> "virtio0", the fallback likely captures the exact same information, just
> with a different format.
> 
> However, drm_virtio_set_busid() used to implement a different logic
> (encoding different information). It intentionally used stable PCI
> addresses rather than (platformdev->name, platformdev->id).
> 
> So, I think that removing drm_virtio_set_busid() was an actual
> regression. I propose that the related hunks of a325725633c2 be
> reverted. For virtio-vga and virtio-gpu-pci, "virtioN" is an unstable
> and inappropraite BusID pattern.

In support of my argument, please see the following remark in commit
message:

    v4: Drop accidental amdgpu hunk (Emil).

Now, if you look up the v3 posting for that amdgpu hunk:

  https://lists.freedesktop.org/archives/dri-devel/2016-June/111200.html
  https://lists.freedesktop.org/archives/dri-devel/2016-June/111486.html

the v3 patch accidentally removed the similarly customized set_busid
callback for amdgpu -- drm_pci_set_busid(). Emil caught that error in
review, hence the v4 patch wouldn't contain the same error.

My argument is exactly the same -- just as the amdgpu hunk shouldn't be
in the patch, the drm_virtio_set_busid() hunk shouldn't be there either.
Both amdgpu and virtio-gpu implement a set_busid callback -- by
delegating to drm_pci_set_busid() -- that *cannot* be replaced with the
fallback in drm_dev_alloc().

Please revert the drm_virtio_set_busid() part of kernel commit
a325725633c2; at this point I'm 95% sure it was an oversight that didn't
get caught in review.

The rest of kernel commit a325725633c2 looks sane to me; while it
changes the formatting of the busid for the affected platform devices,
relying on the fallback in those cases actually preserves the same
information content. (This is not true for amdgpu and virtio-gpu.)

Thanks
Laszlo


More information about the dri-devel mailing list