[PATCH v3] drm/plane: Add documentation about software color conversion.

Pekka Paalanen ppaalanen at gmail.com
Mon Sep 11 08:38:27 UTC 2023


On Fri, 8 Sep 2023 17:10:46 +0200
Thomas Zimmermann <tzimmermann at suse.de> wrote:

> Hi
> 
> Am 08.09.23 um 16:41 schrieb Pekka Paalanen:
> > On Fri, 8 Sep 2023 15:56:51 +0200
> > Thomas Zimmermann <tzimmermann at suse.de> wrote:
> >   
> >> Hi
> >>
> >> Am 08.09.23 um 13:16 schrieb Pekka Paalanen:  
> >>> On Fri, 8 Sep 2023 11:21:51 +0200
> >>> Thomas Zimmermann <tzimmermann at suse.de> wrote:
> >>>      
> >>>> Hi
> >>>>
> >>>> Am 25.08.23 um 16:04 schrieb Jocelyn Falempe:
> >>>> [...]  
> >>>>> + *
> >>>>> + *     But there are two exceptions only for dumb buffers:
> >>>>> + *     * To support XRGB8888 if it's not supported by the hardware.  
> >>>>
> >>>>     
> >>>>> + *     * Any driver is free to modify its internal representation of the format,
> >>>>> + *       as long as it doesn't alter the visible content in any way, and doesn't
> >>>>> + *       modify the user-provided buffer. An example would be to drop the
> >>>>> + *       padding component from a format to save some memory bandwidth.  
> >>>>
> >>>> I have strong objections to this point, _especially_ as you're
> >>>> apparently trying to sneak this in after our discussion. NAK on this
> >>>> part from my side.
> >>>>
> >>>> If you want userspace to be able to use a certain format, then export
> >>>> the corresponding 4cc code. Then let userspace decide what to do about
> >>>> it. If userspace pick a certain format, go with it.  
> >>>
> >>> What is the reason for your objection, exactly?
> >>>      
> >>>> Hence, no implicit conversion from XRGB888 to RGB888, just because it's
> >>>> possible.  
> >>>
> >>> For the particular driver in question though, the conversion allows
> >>> using a display resolution that is otherwise not possible. I also hear
> >>> it improves performance since 25% less data needs to travel across a
> >>> slow bus. There is also so little VRAM, than all dumb buffers need to
> >>> be allocated from sysram instead anyway, so a copy is always necessary.
> >>>
> >>> Since XRGB8888 is the one format that is recommended to be supported by
> >>> all drivers, I don't see a problem here. Did you test on your
> >>> incredibly slow g200 test rig if the conversion ends up hurting instead
> >>> of helping performance there?
> >>>
> >>> If it hurts, then I see that you have a good reason to NAK this.
> >>>
> >>> It's hard to imagine how it would hurt, since you always need a copy
> >>> from sysram dumb buffers to VRAM - or do you?  
> >>
> >> I have a number of concerns. My point it not that we shouldn't optimize.
> >> I just don't want it in the kernel. Mgag200 can export DRM_FORMAT_RGB888
> >> for userspace to use.
> >>
> >> AFAICT the main argument against userspace is that Mesa doesn't like
> >> 3-byte pixels. But I don't see how this conversion cannot be a
> >> post-processing step within Mesa: do the rendering in RGB32 and then
> >> convert to a framebuffer in RGB24. Userspace can do that more
> >> efficiently than the kernel. This has all of the upsides of reduced
> >> bandwidth, but none of the downsides of kernel code. Applications and/or
> >> Mesa would be in control of the buffer format and apply the optimization
> >> where it makes sense. And it would be available for all drivers that are
> >> similar to mgag200.
> >>
> >> My main point is simplicity of the driver: I prefer the driver to be
> >> simple without unnecessary indirection or overhead. Optimizations like
> >> these my or may not work on a given system with a certain workload. I'd
> >> better leave this heuristic to userspace.
> >>
> >> Another point of concern is CPU consumption: Slow I/O buses may stall
> >> the display thread, but the CPU could do something else in the meantime.
> >> Doing format conversion on the CPU prevents that, hence affecting other
> >> parts of the system negatively. Of course, that's more of a gut feeling
> >> than hard data.
> >>
> >> Please note that the kernel's conversion code uses memory allocation of
> >> intermediate buffers. We even recently had a discussion about allocation
> >> overhead during display updates. Userspace can surely do a better job at
> >> keeping such buffers around.
> >>
> >> And finally a note the hardware itself: on low-end hardware like those
> >> Matrox chips, just switch to RGB16. That will be pretty and fast enough
> >> for these chips' server systems. Anyone who cares about fast and
> >> beautiful should buy a real graphics card.  
> > 
> > Fair enough.
> > 
> > Did you consider that every frame will be copied twice: once in
> > userspace to get RGB888, then again by the driver into VRAM, since dumb
> > buffers reside in sysram?  
> 
> In the kernel, we reduce the copying to the changed parts, if we have 
> damage information from userspace. IDK Mesa's software renderer, but it 
> could certainly apply a similar optimization.

I have already assumed that everything uses damage information to
optimize the regions to copy. It's still two copies instead of one.
Actually, it's slightly more than two copies.

Due to damage tracking and the driver needing to flip between two VRAM
buffers, it is always copying current + previous damage, not only
current damage.

> > I suspect the intermediate buffers (dumb buffers in this case) need to
> > be full frame size rather than a scanline, too. I'm not sure why the
> > driver would need any scratch buffers beyond a couple dozen bytes while
> > doing a software conversion, just enough to have the lowest common
> > multiple of 3 bytes and optimal bus write width.  
> 
> For the conversion in the kernel we allocate enough temporary memory to 
> hold the part of each scanline that changed. Then we go over dirty 
> scanlines, convert each into the output format and store it in that 
> temporary buffer. Then copy the temporary buffer to VRAM. The buffer can 
> be reused for the scanlines of a single conversion. But the nature of 
> the framebuffer (buffers are possibly shared with concurrent access from 
> multiple planes) makes it hard to keep that temporary memory around. 
> Hence it's freed after each conversion. The code is at [1].

Yes, that's the conversion in the kernel. However, before the kernel
can run that conversion, userspace must have already allocated full
sized dumb buffers to convert its full sized internal framebuffer into.
Since KMS is modernly used with double-buffering, there must always be
two dumb buffers, which means updating a dumb buffer needs to copy not
just current damage but also previous damage. Userspace has no way to
know that single-buffering would be equally good in this case for this
particular driver.

If userspace gave its internal framebuffer to the kernel without doing
the conversion to RGB888, then that internal buffer would be the dumb
buffer, and there would be no need to allocate a second full size
buffer (third, because double-buffering towards KMS).

The driver allocating even full scanlines would be a lot less memory
touched if userspace didn't convert to RGB888, and if the driver used a
tailored conversion function (it literally needs to handle only a single
combination of input and output conditions), it wouldn't need even that
nor to allocate and free on every conversion. I understand you do not
want to pay the cost having such special-case code, and that's ok. It
would just be even further optimization after getting rid of a static
full sized buffer allocation.


> I assume that a userspace software renderer could do a much better job 
> at keeping the temporary memory allocated.

I'm not sure why the kernel can't keep the temporary scanline buffer
allocated with the CRTC. It only needs to be reallocated if it's too
small. Sure, people probably don't want to spend time on such code.

All the above discussion is based on the assumption that dumb buffers
are allocated in sysram.

It's fine to say you don't want to optimize, but I want to be clear on
the exact trade-off.

The trade-offs vary greatly depending on each particular use case,
which is why I wouldn't make a hard rule of no in-kernel conversions,
just a rule of thumb since it's *usually* not useful or is actively
hurting. Here we are talking about XRGB8888 which is already exempt
from the rule of thumb, with two more special conditions: dumb buffers
in sysram, and the performance traits of RGB888 on the specific
hardware.


Thanks,
pq

> 
> Best regards
> Thomas
> 
> [1] 
> https://elixir.bootlin.com/linux/v6.5/source/drivers/gpu/drm/drm_format_helper.c#L88
> 
> > 
> > 
> > Thanks,
> > pq
> >   
> >>
> >> Best regards
> >> Thomas
> >>  
> >>>
> >>>
> >>> Thanks,
> >>> pq
> >>>      
> >>>>> + *     On most hardware, VRAM read access are slow, so when doing the software
> >>>>> + *     conversion, the dumb buffer should be allocated in system RAM in order to
> >>>>> + *     have decent performance.
> >>>>> + *     Extra care should be taken when doing software conversion with
> >>>>> + *     DRM_CAP_DUMB_PREFER_SHADOW, there are more detailed explanations here:
> >>>>> + *     https://lore.kernel.org/dri-devel/20230818162415.2185f8e3@eldfell/
> >>>>>      */
> >>>>>     
> >>>>>     static unsigned int drm_num_planes(struct drm_device *dev)
> >>>>>
> >>>>> base-commit: 82d750e9d2f5d0594c8f7057ce59127e701af781  
> >>>>     
> >>>      
> >>  
> >   
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20230911/fae218e1/attachment-0001.sig>


More information about the dri-devel mailing list