[Intel-gfx] [PATCH 03/51] drm: add managed resources tied to drm_device

Wed Feb 26 14:38:55 UTC 2020

On 26.02.2020 11:21, Daniel Vetter wrote:
> On Wed, Feb 26, 2020 at 10:21:17AM +0100, Andrzej Hajda wrote:
>> On 25.02.2020 16:03, Daniel Vetter wrote:
>>> On Tue, Feb 25, 2020 at 11:27 AM Andrzej Hajda <a.hajda at samsung.com> wrote:
>>>> Hi Daniel,
>>>>
>>>>
>>>> The patchset looks interesting.
>>>>
>>>>
>>>> On 21.02.2020 22:02, Daniel Vetter wrote:
>>>>> We have lots of these. And the cleanup code tends to be of dubious
>>>>> quality. The biggest wrong pattern is that developers use devm_, which
>>>>> ties the release action to the underlying struct device, whereas
>>>>> all the userspace visible stuff attached to a drm_device can long
>>>>> outlive that one (e.g. after a hotunplug while userspace has open
>>>>> files and mmap'ed buffers). Give people what they want, but with more
>>>>> correctness.
>>>> I am not familiar with this stuff, so forgive me stupid questions.
>>>>
>>>> Is it documented how uapi should behave in such case?
>>>>
>>>> I guess the general rule is to return errors on most ioctls (ENODEV,
>>>> EIO?), and wait until userspace releases everything, as there is not
>>>> much more to do.
>>>>
>>>> If that is true what is the point of keeping these structs anyway -
>>>> trivial functions with small context data should do the job.
>>>>
>>>> I suspect I am missing something but I do not know what :)
>>> We could do the above (also needs unmapping of all mmaps, so userspace
>>> then gets SIGSEGV everywhere) and watch userspace crash&burn.
>>> Essentially if the kernel can't do this properly, then there's no hope
>>> that userspace will be any better.
>>
>> We do not want to crash userspace. We just need to tell userspace that
>> the kernel objects userspace has references to are not valid.
>>
>> For this two mechanism should be enough:
>>
>> - signal hot-unplug,
>>
>> - report error (ENODEV for example) on any userspace requests (ioctls)
>> on invalid objects.
>>
>> Expecting from userspace properly handling ioctl errors seems to be fair.
> The trouble is that maybe it's fair, practice says it's just not going to
> happen.

So what? Bad API usage causes bad things, crashes will force developers
to fix it, if not we can assume it is not so harmful.

The gain is that kernel side is simpler and don't need to lie :)

>> Regarding mmap I am not sure how to properly handle disappearing
>> devices, but this is common problem regardless which solution we use.
> signal handler wrapped around every mmap access. Which doesn't compose
> across libraries, so is essentially impossible.
>
> Note that e.g. GL's robustness extensions works exactly like this here
> too: GPU dies, kernel kills all your objects and contexts and everything.
> But the driver keeps "working". The only way to get information that
> everything is actually dead is by querying the robustness extension, which
> then will tell you what's happened.
>
> Again this is because it's impossible to make sure userspace actually
> checks error codes every where. It's also prohibitively expensive. vk goes
> as far as outright removing all error validation (at least as much as
> possible).

vk is different story, and is for me counter-example - it has clear
policy - user should take care of proper API handling otherwise it risks
undefined behavior/crash. In your proposition I see opposition: lets
baby-sit user and protect him from his mistakes.

>
>>> Hence the idea is that we keep everything userspace facing still
>>> around, except it doesn't do much anymore. So connectors still there,
>>> but they look disconnected.
>>
>> It looks like lying to userspace that physical connectors still exists.
>> If we want to lie we need good reason for that. What is that reason?
>>
>> Why not just tell connectors are gone?
> Userspace sucks at handling hotunplugged connectors. Most of it is special
> case code for DP MST connectors only.
>
>>> Userspace can then hopefully eventually
>>> get around to processing the sysfs hotunplug event and remove the
>>> device from all its list. So the long-term idea is that a lot of stuff
>>> keeps working, except the driver doesn't talk to the hardware anymore.
>>> And we just sit around waiting for userspace to clean things up.
>>
>> What does it mean "lot of stuff keeps working"? What drm driver can do
>> without hardware? Could you show some examples?
> Nothing will "work", the goal is simply for userspace to not explode in
> fire and take the entire desktop down with it.

And why do we need to keep whole drm device for this task? What exactly
causes userspace explosion?

>
>>> I guess once we have a bunch of the panel/usb drivers converted over
>>> we could indeed document how this is all supposed to work from an uapi
>>> pov. But right now a lot of this is all rather aspirational, I think
>>> only the recent simple display pipe based drivers implement this as
>>> described above.
>>>
>>>>> Mostly copied from devres.c, with types adjusted to fit drm_device and
>>>>> a few simplifications - I didn't (yet) copy over everything. Since
>>>>> the types don't match code sharing looked like a hopeless endeavour.
>>>>>
>>>>> For now it's only super simplified, no groups, you can't remove
>>>>> actions (but kfree exists, we'll need that soon). Plus all specific to
>>>>> drm_device ofc, including the logging. Which I didn't bother to make
>>>>> compile-time optional, since none of the other drm logging is compile
>>>>> time optional either.
>>>> I saw in v1 thread that copy/paste is OK and merging back devres and
>>>> drmres can be done later, but experience shows that after short time
>>>> things get de-synchronized and merging process becomes quite painful.
>>>>
>>>> On the other side I guess it shouldn't be difficult to split devres into
>>>> consumer agnostic core and "struct device" helpers and then use the core
>>>> in drm.
>>>>
>>>> For example currently devres uses two fields from struct device:
>>>>
>>>>     spinlock_t        devres_lock;
>>>>     struct list_head    devres_head;
>>>>
>>>> Lets put it into separate struct:
>>>>
>>>> struct devres {
>>>>
>>>>     spinlock_t        lock;
>>>>     struct list_head    head;
>>>>
>>>> };
>>>>
>>>> And embed this struct into "struct device".
>>>>
>>>> Then convert all core devres functions to take "struct devres *"
>>>> argument instead of "struct device *" and then these core functions can
>>>> be usable in drm.
>>>>
>>>> Looks quite simple separation of abstraction (devres) and its consumer
>>>> (struct device).
>>>>
>>>> After such split one could think about changing name devres to something
>>>> more reliable.
>>> There was a long discussion on v1 exactly about this, Greg's
>>> suggestion was to "just share a struct device". So we're not going to
>>> do this here, and the struct device seems like slight overkill and not
>>> a good enough fit here.
>>
>> But my proposition is different, I want to get rid of "struct device"
>> from devres core - devres has nothing to do with device, it was bound to
>> it probably because it was convenient as device was the only client of
>> devres (I guess). Now if we want to have more devres clients abstracting
>> out devres from device seems quite natural. This way we will have proper
>> abstractions without code duplication.
>>
>> Examples of devres related code according to my proposition:
>>
>> // devres core
>>
>> void devres_add(struct devres_head *dh, void *res)
>> {
>>
>>    struct devres *dr = container_of(res, struct devres, data);
>>
>>     unsigned long flags;
>>
>>     spin_lock_irqsave(&dh->lock, flags);
>>     add_dr(dev, &dr->node);
>>     spin_unlock_irqrestore(&dh->lock, flags);
>> }
>>
>> // device devres helper (non core)
>>
>> struct clk *devm_clk_get(struct device *dev, const char *id)
>> {
>>     struct clk **ptr, *clk;
>>
>>     ptr = devres_alloc(devm_clk_release, sizeof(*ptr), GFP_KERNEL);
>>     if (!ptr)
>>         return ERR_PTR(-ENOMEM);
>>
>>     clk = clk_get(dev, id);
>>     if (!IS_ERR(clk)) {
>>         *ptr = clk;
>>         devres_add(&dev->devres, ptr);
>>     } else {
>>         devres_free(ptr);
>>     }
>>
>>     return clk;
>> }
>>
>>
>> Changes are cosmetic. But then you can easily add devres to drmdev:
>>
>> struct drm_device {
>>
>>    ...
>>
>> +   struct devres_head devres;
>>
>> };
>>
>> // then copy/modify from your patch:
>>
>> +void *drmm_kmalloc(struct drm_device *dev, size_t size, gfp_t gfp)
>> +{
>> +	struct drmres *dr;
>> +
>> +	dr = alloc_dr(NULL, size, gfp, dev_to_node(dev->dev));
>> +	if (!dr)
>> +		return NULL;
>> +	dr->node.name = "kmalloc";
>> +
>> +	devres_add(&dev->devres, dr); // the only change is here
>> +
>> +	return dr->data;
>> +}
>>
>>
>> Btw, reimplemented add_dr is different of original add_dr and is similar
>> to original devres_add, so your implementation differs already from
>> original one, merging back these two will be painfull :)
> Oh I know, I guess I could go more into details about why exactly. One
> reason is that I want type-checking, so struct drm_device * instead of
> something else. At least for the userspace callbacks. That's going to be
> tough with your approach - kmalloc is easy, it's the _add_action which
> gets nasty with the type checking.

Sth like this:

+static void drmm_action_release(struct devres_head *dh, void *res)
+{
+	struct drm_action_devres *devres = res;
+       struct drm_device *dev = container_of(dh, struct drm_device, devres);
+
+	devres->action(dev, devres->data);
+}
+
+int __drmm_add_action(struct drm_device *dev,
+		      drmres_release_t action,
+		      void *data, const char *name)
+{
+	struct action_devres *devres;
+
+	devres = devres_alloc(drmm_action_release,
+			      sizeof(struct drm_action_devres), GFP_KERNEL);
+	if (!devres)
+		return -ENOMEM;
+
+	devres->data = data;
+	devres->action = action;
+
+	devres_add(dev, devres);
+	return 0;
+}

Regards

Andrzej