[PATCH 6/6] drm: vkms: Refactor the plane composer to accept new formats

Tue Oct 19 21:10:57 UTC 2021

Hi Pekka,

On 10/19/21 5:05 AM, Pekka Paalanen wrote:
> On Mon, 18 Oct 2021 16:26:06 -0300
> Igor Matheus Andrade Torrente <igormtorrente at gmail.com> wrote:
> 
>> Hi Pekka,
>>
>> On 10/18/21 5:30 AM, Pekka Paalanen wrote:
>>> On Tue,  5 Oct 2021 17:16:37 -0300
>>> Igor Matheus Andrade Torrente <igormtorrente at gmail.com> wrote:
>>>    
>>>> Currently the blend function only accepts XRGB_8888 and ARGB_8888
>>>> as a color input.
>>>>
>>>> This patch refactors all the functions related to the plane composition
>>>> to overcome this limitation.
>>>>
>>>> Now the blend function receives a format handler to each plane and a
>>>> blend function pointer. It will take two ARGB_1616161616 pixels, one
>>>> for each handler, and will use the blend function to calculate and
>>>> store the final color in the output buffer.
>>>>
>>>> These format handlers will receive the `vkms_composer` and a pair of
>>>> coordinates. And they should return the respective pixel in the
>>>> ARGB_16161616 format.
>>>>
>>>> The blend function will receive two ARGB_16161616 pixels, x, y, and
>>>> the vkms_composer of the output buffer. The method should perform the
>>>> blend operation and store output to the format aforementioned
>>>> ARGB_16161616.
>>>
>>> Hi,
>>>
>>> here are some drive-by comments which you are free to take or leave.
>>>
>>> To find more efficient ways to do this, you might want research 
>>> library. It's MIT licensed.
>>>
>>> IIRC, the elementary operations there operate on pixel lines (pointer +
>>> length), not individual pixels (image, x, y). Input conversion function
>>> reads and converts a whole line a time. Blending function blends two
>>> lines and writes the output into back one of the lines. Output
>>> conversion function takes a line and converts it into destination
>>> buffer. That way the elementary operations do not need to take stride
>>> into account, and blending does not need to deal with varying memory
>>> alignment FWIW. The inner loops involve much less function calls and
>>> state, probably.
>>
>> I was doing some rudimentary profiling, and I noticed that the code
>> spends most of the time with the indirect system call overhead and not
>> the actual computation. This can definitively help with it.
> 
> Hi,
> 
> I suppose you mean function (pointer) call, not system call?

Yes, I misspelled it.

> 
> Really good that you have already profiled things. All optimisations
> should be guided by profiling, otherwise they are just guesses (and I
> got lucky this time I guess).
> 
>>> Pixman may not do exactly this, but it does something very similar.
>>> Pixman also has multiple different levels of optimisations, which may
>>> not be necessary for VKMS.
>>>
>>> It's a trade-off between speed and temporary memory consumed. You need
>>> temporary buffers for two lines, but not more (not a whole image in
>>> full intermediate precision). Further optimisation could determine
>>> whether to process whole image rows as lines, or split a row into
>>> multiple lines to stay within CPU cache size.
>>>    
>>
>> Sorry, I didn't understand the idea of the last sentence.
> 
> If an image is very wide, a single row could still be relatively large
> in size (bytes). If it is large enough that the working set falls out
> of a faster CPU cache into a slower CPU cache (or worse yet, into RAM
> accesses), performance may suffer and become memory bandwidth limited
> rather than ALU rate limited. Theoretically that can be worked around
> by limiting the maximum size of a line, and splitting an image row into
> multiple lines.

Got it now, thanks.

> 
> However, this is an optimisation one probably should not do until there
> is performance profiling data showing that it actually is useful. The
> optimal line size limit depends on the CPU model as well. So it's a bit
> far out, but something you could keep in mind just in case.

OK. For now I will not deal with this complexity, and if necessary I
will revisit it latter.

> 
>>> Since it seems you are blending multiple planes into a single
>>> destination which is assumed to be opaque, you might also be able to do
>>> the multiple blends without convert-writing and read-converting to/from
>>> the destination between every plane. I think that might be possible to
>>> architect on top of the per-line operations still.
>>
>> I tried it. But I don't know how to do this without looking like a mess.

I don't think it changes anything, but I forgot to mention that I tried
it with the blend per pixel and not a per line.

> 
> Dedicate one of the temporary line buffers for the destination, and
> instead of writing it out and loading it back for each input plane,
> leave it in place over all planes and write it out just once at the end.
> 
> I do expect more state tracking needed. You need to iterate over the
> list of planes for each output row, extract only the used part of an
> input plane's buffer into the other temporary line buffer, and offset
> the destination line buffer and length to match when passing those into
> a blending function.+

It's exactly this state tracking that was a mess when I was trying to
implement something similar. But this is one more thing that probably
I will have to revisit later.

> 
> It's not an obvious win but a trade-off, so profiling is again needed.
> 
> Btw. the use of temporary line buffers should also help with
> implementing scaling. You could do the filtering during reading of the
> input buffer. If the filter is not nearest, meaning you need to access
> more than one input pixel per pixel-for-blending, there are a few ways
> to go about that. You could do the filtering during the input buffer
> reading, or you could load two input buffer rows into temporary line
> buffers and do filtering as a separate step into yet another line
> buffer. As the composition advances from top to bottom, only one of the
> input buffer rows will change at a time (during up-scaling) so you can
> avoid re-loading a row by swapping the line buffers.
> 
> This is getting ahead of things, so don't worry about scaling or
> filtering yet. The first thing is to see if you can make the line
> buffer approach give you a significant speed-up.
> 
>> Does the pixman perform some similar?
> 
> No, Pixman composition operation has only three images: source,
> mask, and destination. All those it can handle simultaneously, but
> since there is no "multi-blending" API, it doesn't need to take care of
> more.
> 
> IIRC, Pixman also has a form of optimised operations that do blending
> and converting to destination in the same pass. The advantage of that
> is that blending can work with less precision when it knows what
> precision the output will be converted to and it saves some bandwidth
> by not needing to write-after-blending and read-for-conversion
> intermediate precision values. The disadvantage is that the number of
> needed specialised blending functions explodes by the number of
> possible destination formats. Pixman indeed makes those specialised
> functions optional, falling back to more generic C code. I would hope
> that VKMS does not need to go this far in optimisations though.

This should be plenty fast indeed. Maybe worth for formats that are
extremely common.

> 
>>>    
>>>> Signed-off-by: Igor Matheus Andrade Torrente <igormtorrente at gmail.com>
>>>> ---
>>>>    drivers/gpu/drm/vkms/vkms_composer.c | 275 ++++++++++++++-------------
>>>>    drivers/gpu/drm/vkms/vkms_formats.h  | 125 ++++++++++++
>>>>    2 files changed, 271 insertions(+), 129 deletions(-)
>>>>    create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h
>>>
>>> ...
>>>    
>>>> +
>>>> +u64 ARGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
>>>> +{
>>>> +	u8 *pixel_addr = packed_pixels_addr(composer, x, y);
>>>> +
>>>> +	/*
>>>> +	 * Organizes the channels in their respective positions and converts
>>>> +	 * the 8 bits channel to 16.
>>>> +	 * The 257 is the "conversion ratio". This number is obtained by the
>>>> +	 * (2^16 - 1) / (2^8 - 1) division. Which, in this case, tries to get
>>>> +	 * the best color value in a color space with more possibilities.
>>>
>>> Pixel format, not color space. >
>>>> +	 * And a similar idea applies to others RGB color conversions.
>>>> +	 */
>>>> +	return ((u64)pixel_addr[3] * 257) << 48 |
>>>> +	       ((u64)pixel_addr[2] * 257) << 32 |
>>>> +	       ((u64)pixel_addr[1] * 257) << 16 |
>>>> +	       ((u64)pixel_addr[0] * 257);
>>>
>>> I wonder if the compiler is smart enough to choose between "mul 257"
>>> and "(v << 8) | v" operations... but that's probably totally drowning
>>> under the overhead of per (x,y) looping.
>>
>> I disassembled the code to check it. And looks like the compiler is
>> replacing the multiplication with shifts and additions.
>>
>> ARGB8888_to_ARGB16161616:
>>      0xffffffff816ad660 <+0>:     imul   0x12c(%rdi),%edx
>>      0xffffffff816ad667 <+7>:     imul   0x130(%rdi),%esi
>>      0xffffffff816ad66e <+14>:    add    %edx,%esi
>>      0xffffffff816ad670 <+16>:    add    0x128(%rdi),%esi
>>      0xffffffff816ad676 <+22>:    movslq %esi,%rax
>>      0xffffffff816ad679 <+25>:    add    0xe8(%rdi),%rax
>>      0xffffffff816ad680 <+32>:    movzbl 0x3(%rax),%ecx
>>      0xffffffff816ad684 <+36>:    movzbl 0x2(%rax),%esi
>>      0xffffffff816ad688 <+40>:    mov    %rcx,%rdx
>>      0xffffffff816ad68b <+43>:    shl    $0x8,%rdx
>>      0xffffffff816ad68f <+47>:    add    %rcx,%rdx
>>      0xffffffff816ad692 <+50>:    mov    %rsi,%rcx
>>      0xffffffff816ad695 <+53>:    shl    $0x8,%rcx
>>      0xffffffff816ad699 <+57>:    shl    $0x30,%rdx
>>      0xffffffff816ad69d <+61>:    add    %rsi,%rcx
>>      0xffffffff816ad6a0 <+64>:    movzbl (%rax),%esi
>>      0xffffffff816ad6a3 <+67>:    shl    $0x20,%rcx
>>      0xffffffff816ad6a7 <+71>:    or     %rcx,%rdx
>>      0xffffffff816ad6aa <+74>:    mov    %rsi,%rcx
>>      0xffffffff816ad6ad <+77>:    shl    $0x8,%rcx
>>      0xffffffff816ad6b1 <+81>:    add    %rsi,%rcx
>>      0xffffffff816ad6b4 <+84>:    or     %rcx,%rdx
>>      0xffffffff816ad6b7 <+87>:    movzbl 0x1(%rax),%ecx
>>      0xffffffff816ad6bb <+91>:    mov    %rcx,%rax
>>      0xffffffff816ad6be <+94>:    shl    $0x8,%rax
>>      0xffffffff816ad6c2 <+98>:    add    %rcx,%rax
>>      0xffffffff816ad6c5 <+101>:   shl    $0x10,%rax
>>      0xffffffff816ad6c9 <+105>:   or     %rdx,%rax
>>      0xffffffff816ad6cc <+108>:   ret
> 
> Nice!
> 
>>>    
>>>> +}
>>>> +
>>>> +u64 XRGB8888_to_ARGB16161616(struct vkms_composer *composer, int x, int y)
>>>> +{
>>>> +	u8 *pixel_addr = packed_pixels_addr(composer, x, y);
>>>> +
>>>> +	/*
>>>> +	 * The same as the ARGB8888 but with the alpha channel as the
>>>> +	 * maximum value as possible.
>>>> +	 */
>>>> +	return 0xffffllu << 48 |
>>>> +	       ((u64)pixel_addr[2] * 257) << 32 |
>>>> +	       ((u64)pixel_addr[1] * 257) << 16 |
>>>> +	       ((u64)pixel_addr[0] * 257);
>>>> +}
>>>> +
>>>> +u64 get_ARGB16161616(struct vkms_composer *composer, int x, int y)
>>>> +{
>>>> +	__le64 *pixel_addr = packed_pixels_addr(composer, x, y);
>>>> +
>>>> +	/*
>>>> +	 * Because the format byte order is in little-endian and this code
>>>> +	 * needs to run on big-endian machines too, we need modify
>>>> +	 * the byte order from little-endian to the CPU native byte order.
>>>> +	 */
>>>> +	return le64_to_cpu(*pixel_addr);
>>>> +}
>>>> +
>>>> +/*
>>>> + * The following functions are used as blend operations. But unlike the
>>>> + * `alpha_blend`, these functions take an ARGB16161616 pixel from the
>>>> + * source, convert it to a specific format, and store it in the destination.
>>>> + *
>>>> + * They are used in the `compose_active_planes` and `write_wb_buffer` to
>>>> + * copy and convert one pixel from/to the output buffer to/from
>>>> + * another buffer (e.g. writeback buffer, primary plane buffer).
>>>> + */
>>>> +
>>>> +void convert_to_ARGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
>>>> +			 struct vkms_composer *dst_composer)
>>>> +{
>>>> +	u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
>>>> +
>>>> +	/*This should be plenty fast indeed. Maybe worth for formats that are
extremely common.
>>>> +	 * This sequence below is important because the format's byte order is
>>>> +	 * in little-endian. In the case of the ARGB8888 the memory is
>>>> +	 * organized this way:
>>>> +	 *
>>>> +	 * | Addr     | = blue channel
>>>> +	 * | Addr + 1 | = green channel
>>>> +	 * | Addr + 2 | = Red channel
>>>> +	 * | Addr + 3 | = Alpha channel
>>>> +	 */
>>>> +	pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
>>>> +	pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
>>>> +	pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
>>>> +	pixel_addr[3] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 48)) >> 48, 257);
>>>
>>> This could be potentially expensive if the compiler ends up using an
>>> actual div instruction.
>>>    
>> Yes, I'm using the DIV_ROUND_UP because I couldn't figure out how the
>> "Faster div by 255" works to adapt to 16 bits.
> 
> But does the compiler actually do a div instruction? If not, then no
> worries. If it does, then maybe something to look into, *if* this shows
> up in profiling.

GCC isn't issuing any div instruction here.

convert_to_ARGB8888:
    0xffffffff816ad770 <+0>:	imul   0x12c(%rcx),%edx
    0xffffffff816ad777 <+7>:	mov    %rcx,%rax
    0xffffffff816ad77a <+10>:	imul   0x130(%rcx),%esi
    0xffffffff816ad781 <+17>:	add    %edx,%esi
    0xffffffff816ad783 <+19>:	movzwl %di,%edx
    0xffffffff816ad786 <+22>:	add    0x128(%rcx),%esi
    0xffffffff816ad78c <+28>:	add    $0x100,%rdx
    0xffffffff816ad793 <+35>:	movslq %esi,%rcx
    0xffffffff816ad796 <+38>:	movabs $0xff00ff00ff00ff01,%rsi
    0xffffffff816ad7a0 <+48>:	add    0xe8(%rax),%rcx
    0xffffffff816ad7a7 <+55>:	mov    %rdx,%rax
    0xffffffff816ad7aa <+58>:	mul    %rsi
    0xffffffff816ad7ad <+61>:	shr    $0x8,%rdx
    0xffffffff816ad7b1 <+65>:	mov    %dl,(%rcx)
    0xffffffff816ad7b3 <+67>:	mov    %rdi,%rdx
    0xffffffff816ad7b6 <+70>:	shr    $0x10,%rdx
    0xffffffff816ad7ba <+74>:	movzwl %dx,%edx
    0xffffffff816ad7bd <+77>:	add    $0x100,%rdx
    0xffffffff816ad7c4 <+84>:	mov    %rdx,%rax
    0xffffffff816ad7c7 <+87>:	mul    %rsi
    0xffffffff816ad7ca <+90>:	shr    $0x8,%rdx
    0xffffffff816ad7ce <+94>:	mov    %dl,0x1(%rcx)
    0xffffffff816ad7d1 <+97>:	mov    %rdi,%rdx
    0xffffffff816ad7d4 <+100>:	shr    $0x30,%rdi
    0xffffffff816ad7d8 <+104>:	shr    $0x20,%rdx
    0xffffffff816ad7dc <+108>:	movzwl %dx,%edx
    0xffffffff816ad7df <+111>:	add    $0x100,%rdx
    0xffffffff816ad7e6 <+118>:	mov    %rdx,%rax
    0xffffffff816ad7e9 <+121>:	mul    %rsi
    0xffffffff816ad7ec <+124>:	shr    $0x8,%rdx
    0xffffffff816ad7f0 <+128>:	mov    %dl,0x2(%rcx)
    0xffffffff816ad7f3 <+131>:	lea    0x100(%rdi),%rdx
    0xffffffff816ad7fa <+138>:	mov    %rdx,%rax
    0xffffffff816ad7fd <+141>:	mul    %rsi
    0xffffffff816ad800 <+144>:	shr    $0x8,%rdx
    0xffffffff816ad804 <+148>:	mov    %dl,0x3(%rcx)
    0xffffffff816ad807 <+151>:	ret

> 
> 
> Thanks,
> pq
> 
>>> Btw. this would be shorter to write as
>>>
>>> 	(argb_src1 >> 16) & 0xffff
>>>
>>> etc.
>> I will use it in the V2. Thanks.
>>
>>>
>>> Thanks,
>>> pq
>>>    
>>>> +}
>>>> +
>>>> +void convert_to_XRGB8888(u64 argb_src1, u64 argb_src2, int x, int y,
>>>> +			 struct vkms_composer *dst_composer)
>>>> +{
>>>> +	u8 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
>>>> +
>>>> +	pixel_addr[0] = DIV_ROUND_UP(argb_src1 & 0xffffllu, 257);
>>>> +	pixel_addr[1] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 16)) >> 16, 257);
>>>> +	pixel_addr[2] = DIV_ROUND_UP((argb_src1 & (0xffffllu << 32)) >> 32, 257);
>>>> +	pixel_addr[3] = 0xff;
>>>> +}
>>>> +
>>>> +void convert_to_ARGB16161616(u64 argb_src1, u64 argb_src2, int x, int y,
>>>> +			     struct vkms_composer *dst_composer)
>>>> +{
>>>> +	__le64 *pixel_addr = packed_pixels_addr(dst_composer, x, y);
>>>> +
>>>> +	*pixel_addr = cpu_to_le64(argb_src1);
>>>> +}
>>>> +
>>>> +#endif /* _VKMS_FORMATS_H_ */
>>>    
>