Regression: drm: Lobotomize set_busid nonsense for !pci drivers (a325725633c2)

Laszlo Ersek lersek at redhat.com
Mon Oct 3 14:15:38 UTC 2016


On 10/03/16 13:34, Emil Velikov wrote:
> On 30 September 2016 at 18:26, Laszlo Ersek <lersek at redhat.com> wrote:
>> On 09/30/16 18:38, Hans de Goede wrote:
>>> Hi,
>>>
>>> On 30-09-16 17:33, Laszlo Ersek wrote:
>>>> On 09/30/16 16:59, Hans de Goede wrote:
>>>>> Hi,
>>>>>
>>>>> On 30-09-16 16:51, Laszlo Ersek wrote:
>>>>>> On 09/30/16 12:35, Hans de Goede wrote:
>>>>>>
>>>>>>> Attached are 2 patches against the xserver which should fix this,
>>>>>>> please give them a try.
>>>>>>
>>>>>> Sorry about the delay.
>>>>>>
>>>>>> The patches don't seem to fix the issue for me. Please see the Xorg log
>>>>>> attached.
>>>>>>
>>>>>> I tested the patches as follows. Given that my bisection had been done
>>>>>> in a Fedora 24 guest, using
>>>>>>
>>>>>>   xorg-x11-server-1.18.4-4.fc24
>>>>>>   http://koji.fedoraproject.org/koji/buildinfo?buildID=794494
>>>>>>
>>>>>> I now rebuilt the guest kernel exactly at the failing commit (a325725
>>>>>> "drm: Lobotomize set_busid nonsense for !pci drivers"), and first
>>>>>> reproduced the issue with the above X server.
>>>>>>
>>>>>> Then, I ported your patches to "xorg-server-1.18.4" (using the upstream
>>>>>> xserver tree), and rebuilt the Fedora package with the backport. For
>>>>>> the
>>>>>> backport, I had to cherry-pick the following two patches from master
>>>>>> first:
>>>>>>
>>>>>> 1 ca8d88e50310 xfree86: recognize primary BUS_PCI device in
>>>>>>                xf86IsPrimaryPlatform()
>>>>>> 2 ea91db4b8331 config: fix GPUDevice fail when AutoAddGPU off + BusID
>>>>>>
>>>>>> This way your patches applied cleanly. (Cherry pick #1 above is
>>>>>> actually
>>>>>> necessary for semantics, while cherry pick #2 is needed for a clean
>>>>>> context only, and has no impact for this test.)
>>>>>>
>>>>>> That is, in total, I added the following four patches to the Fedora 24
>>>>>> package:
>>>>>>
>>>>>> 1 xfree86: recognize primary BUS_PCI device in xf86IsPrimaryPlatform()
>>>>>> 2 config: fix GPUDevice fail when AutoAddGPU off + BusID
>>>>>> 3 xfree86: Make adding unclaimed devices as GPU devices a separate step
>>>>>> 4 xfree86: Try harder to find atleast 1 non GPU Screen
>>>>>>
>>>>>> You can find the scratch build that I used for testing here:
>>>>>>
>>>>>>   xorg-x11-server-1.18.4-4.hans_bz1366842_2.fc24
>>>>>>   http://koji.fedoraproject.org/koji/taskinfo?taskID=15875087
>>>>>>
>>>>>> Another reason I used F24's X server as basis, rather than upstream
>>>>>> HEAD, is that Fedora 24 is pretty young, and it's already on kernel
>>>>>> 4.7.4, and I believe it will soon move to kernel 4.8, without
>>>>>> (necessarily) rebasing its X server package to upstream. IOW the kernel
>>>>>> upgrade to 4.8 will break X in Fedora 24 too, and then I expect the
>>>>>> Fedora X maintainers would have to cherry pick those two patches as
>>>>>> dependencies just the same.
>>>>>>
>>>>>> To summarize, the patches don't seem to help. I shall nonetheless thank
>>>>>> you for spending your Friday on this!
>>>>>
>>>>> Hmm, do you have a xorg.conf file lying around somewhere, the message
>>>>> about the xserver not being able to find an entry for screen 0 does
>>>>> not make sense ...
>>>>
>>>> Good catch, I actually had two files under "/etc/X11/xorg.conf.d/":
>>>>
>>>> * "00-keyboard.conf", from package "systemd-229-13.fc24.x86_64", with
>>>> contents
>>>>
>>>> ------------
>>>> # Read and parsed by systemd-localed. It's probably wise not to edit
>>>> this file
>>>> # manually too freely.
>>>> Section "InputClass"
>>>>         Identifier "system-keyboard"
>>>>         MatchIsKeyboard "on"
>>>>         Option "XkbLayout" "us"
>>>> EndSection
>>>> ------------
>>>>
>>>> * "01-resolution.conf", which I had created, in order to set the
>>>> preferred display resolution:
>>>>
>>>> ------------
>>>> Section "Screen"
>>>>   Identifier "Default Screen"
>>>>   Device     "Default Device"
>>>>   Monitor    "Default Monitor"
>>>> EndSection
>>>>
>>>> Section "Device"
>>>>   Identifier "Default Device"
>>>>   Driver     "modesetting"
>>>> EndSection
>>>>
>>>> Section "Monitor"
>>>>   Identifier "Default Monitor"
>>>>   Option     "PreferredMode"   "640x480"
>>>> # Option     "PreferredMode"   "1440x900"
>>>> EndSection
>>>> ------------
>>>>
>>>> I removed these files now, and repeated the test. Again, the X server
>>>> wouldn't start, but I think the log file looks a bit different now.
>>>> Attached.
>>>
>>> Ah, ok so it seems that my initial analysis is wrong, the problem
>>> is not a re-occuring of the device getting identified as a GPU screen,
>>> libdrm sorta depends on bus-ids and the lack of one is causing the
>>> server to misbehave. I guess that even with a xorg.conf things
>>> will fail with the troublesome kernel version (might be worth
>>> trying).
>>>
>>> Emil's analysis seems to be spot on. This does not seem easily
>>> fixable in userspace / does seem like a real regression as it
>>> even breaks things when specifying the device through xorg.conf
>>> (I or so I believe) which is something which uses to work ...
>>
>> In order to check this hypothesis, I did the following:
>> - I downgraded my xorg-x11-server installation to the most recent
>> official F24 packages, that is, "1.18.4-4.fc24",
>> - I kept the kernel that I built exactly at the regressive commit
>> (a325725633c2)
>> - I modified "01-resolution.conf" (see it above in the context) like this:
>>
>> ----
>> Section "Device"
>>   Identifier "Default Device"
>>   Driver     "modesetting"
>>   BusID      "PCI:00:02:0" <------------ new option added
>> EndSection
>> ----
>>
>> where BusID matches the B/D/F of the virtio-vga device from "lspci".
>>
>> This setup (modulo the kernel of course) was known to work, but now the
>> X server actually segfaults (apparently in the
>> xf86PlatformDeviceCheckBusID() function). Please find the logfile attached.
>>
>> (NB: this is unrelated to upstream commit de9ce6757c2e -- which the
>> pristine FC24 build lacks -- because I don't set AutoAddGPU to "off" --
>> it is left at its default "on" value.)
>>
> Where is this upstream commit again - it shows as unknown for the
> kernel, xserver and libdrm ?
> 
> So my theory was a bit off - SetVersion is the one responsible to set
> the "BusID", as retrieved by drmGetBusID, regardless if drmOpen or
> open is used.
> 
> Here's a bit of a brain dump from the other day:
> 
>  - The commit mentioned 'affects' the drmSetBusid/DRM_IOCTL_SET_UNIQUE
> userspace codepaths.
>  - The latter itself is dri1/legacy (xserver hw/xfree86/dri/) which is
> not functional for platform devices.
> The latter of which seems to be the case for virt-gpu based on the
> kernel module.
>  - The modesetting driver should/cannot reach the above xserver codepath
> 
> That said, it seems that (at least some) userspace expects a PCI
> device despite the kernel module 'advertising' itself as platform one
> :-\

With kernel commit a325725633c2 applied, the drmGetBusid() call in
get_drm_info() [hw/xfree86/os-support/linux/lnx_platform.c] returns
"virtio0".

Without kernel commit a325725633c2 in place, the same function call
produces "pci:0000:00:02.0".

I think that the idea of kernel commit a325725633c2:

    We already have a fallback in place to fill out the unique from
    dev->unique, which is set to something reasonable in drm_dev_alloc.

is inappropriate for virtio devices.

While it is true that "virtioN" is unique across the guest system, those
identifiers are not stable.

# ls -1d /sys/devices/pci*/*/virtio*
/sys/devices/pci0000:00/0000:00:02.0/virtio0
/sys/devices/pci0000:00/0000:00:03.0/virtio1
/sys/devices/pci0000:00/0000:00:05.0/virtio2
/sys/devices/pci0000:00/0000:00:06.0/virtio3
/sys/devices/pci0000:00/0000:00:09.0/virtio4

If you tweak the PCI addresses of other virtio devices on the QEMU
command line, while keeping the PCI address of the virtio-vga device
intact, the "virtioN" identifier of the virtio-vga device may change in
an unspecified / unexpected manner.

>From the above listing, it seems like higher PCI Device numbers happen
to end up with higher "virtioN" identifiers. While this is unspecified /
non-guaranteed behavior, in practice it means the following:

Moving the "virtio1" device from PCI address 00:03.0 to 00:01.0, while
preserving the PCI address of virtio-vga as 00:02.0, will bump
virtio-vga to "virtio1" from "virtio0" (and sink that other device from
"virtio1" to "virtio0").

I think that unstable identifiers don't qualify for BusID use,
regardless of the actual format of the IDs in question.

Searching the patch quickly, drm_virtio_set_busid() is the only
implementation of the "drm_driver.set_busid" callback that delegates the
task to drm_pci_set_busid(). All other implementations of this callback
were provided by drm_platform_set_busid().

drm_platform_set_busid() would ultimately format
"dev->platformdev->name" and "dev->platformdev->id" into the bus
identifier. Replacing that common logic with the drm_dev_alloc()
fallback that is already in place is probably justified: judging from
"virtio0", the fallback likely captures the exact same information, just
with a different format.

However, drm_virtio_set_busid() used to implement a different logic
(encoding different information). It intentionally used stable PCI
addresses rather than (platformdev->name, platformdev->id).

So, I think that removing drm_virtio_set_busid() was an actual
regression. I propose that the related hunks of a325725633c2 be
reverted. For virtio-vga and virtio-gpu-pci, "virtioN" is an unstable
and inappropraite BusID pattern.

Thanks
Laszlo


More information about the dri-devel mailing list