[PATCH v2 6/8] drm: vkms: Refactor the plane composer to accept new formats

Fri Nov 12 12:50:23 UTC 2021

Hi Pekka,

On Thu, Nov 11, 2021 at 11:37 AM Pekka Paalanen <ppaalanen at gmail.com> wrote:
>
> On Thu, 11 Nov 2021 11:07:21 -0300
> Igor Torrente <igormtorrente at gmail.com> wrote:
>
> > Hi Pekka,
> >
> > On Thu, Nov 11, 2021 at 6:33 AM Pekka Paalanen <ppaalanen at gmail.com> wrote:
> > >
> > > On Wed, 10 Nov 2021 13:56:54 -0300
> > > Igor Torrente <igormtorrente at gmail.com> wrote:
> > >
> > > > On Tue, Nov 9, 2021 at 8:40 AM Pekka Paalanen <ppaalanen at gmail.com> wrote:
> > > > >
> > > > > Hi Igor,
> > > > >
> > > > > again, that is a really nice speed-up. Unfortunately, I find the code
> > > > > rather messy and hard to follow. I hope my comments below help with
> > > > > re-designing it to be easier to understand.
> > > > >
> > > > >
> > > > > On Tue, 26 Oct 2021 08:34:06 -0300
> > > > > Igor Torrente <igormtorrente at gmail.com> wrote:
> > > > >
> > > > > > Currently the blend function only accepts XRGB_8888 and ARGB_8888
> > > > > > as a color input.
> > > > > >
> > > > > > This patch refactors all the functions related to the plane composition
> > > > > > to overcome this limitation.
> > > > > >
> > > > > > Now the blend function receives a struct `vkms_pixel_composition_functions`
> > > > > > containing two handlers.
> > > > > >
> > > > > > One will generate a buffer of each line of the frame with the pixels
> > > > > > converted to ARGB16161616. And the other will take this line buffer,
> > > > > > do some computation on it, and store the pixels in the destination.
> > > > > >
> > > > > > Both the handlers have the same signature. They receive a pointer to
> > > > > > the pixels that will be processed(`pixels_addr`), the number of pixels
> > > > > > that will be treated(`length`), and the intermediate buffer of the size
> > > > > > of a frame line (`line_buffer`).
> > > > > >
> > > > > > The first function has been totally described previously.
> > > > >
> > > > > What does this sentence mean?
> > > >
> > > > In the sentence "One will generate...", I give an overview of the two types of
> > > > handlers. And the overview of the first handler describes the full behavior of
> > > > it.
> > > >
> > > > But it doesn't look clear enough, I will improve it in the future.
> > > >
> > > > >
> > > > > >
> > > > > > The second is more interesting, as it has to perform two roles depending
> > > > > > on where it is called in the code.
> > > > > >
> > > > > > The first is to convert(if necessary) the data received in the
> > > > > > `line_buffer` and write in the memory pointed by `pixels_addr`.
> > > > > >
> > > > > > The second role is to perform the `alpha_blend`. So, it takes the pixels
> > > > > > in the `line_buffer` and `pixels_addr`, executes the blend, and stores
> > > > > > the result back to the `pixels_addr`.
> > > > > >
> > > > > > The per-line implementation was chosen for performance reasons.
> > > > > > The per-pixel functions were having performance issues due to indirect
> > > > > > function call overhead.
> > > > > >
> > > > > > The per-line code trades off memory for execution time. The `line_buffer`
> > > > > > allows us to diminish the number of function calls.
> > > > > >
> > > > > > Results in the IGT test `kms_cursor_crc`:
> > > > > >
> > > > > > |                     Frametime                       |
> > > > > > |:---------------:|:---------:|:----------:|:--------:|
> > > > > > |  implmentation  |  Current  |  Per-pixel | Per-line |
> > > > > > | frametime range |  8~22 ms  |  32~56 ms  |  6~19 ms |
> > > > > > |     Average     |  10.0 ms  |   35.8 ms  |  8.6 ms  |
> > > > > >
> > > > > > Reported-by: kernel test robot <lkp at intel.com>
> > > > > > Signed-off-by: Igor Torrente <igormtorrente at gmail.com>
> > > > > > ---
> > > > > > V2: Improves the performance drastically, by perfoming the operations
> > > > > >     per-line and not per-pixel(Pekka Paalanen).
> > > > > >     Minor improvements(Pekka Paalanen).
> > > > > > ---
> > > > > >  drivers/gpu/drm/vkms/vkms_composer.c | 321 ++++++++++++++++-----------
> > > > > >  drivers/gpu/drm/vkms/vkms_formats.h  | 155 +++++++++++++
> > > > > >  2 files changed, 342 insertions(+), 134 deletions(-)
> > > > > >  create mode 100644 drivers/gpu/drm/vkms/vkms_formats.h
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c
> > > > > > index 383ca657ddf7..69fe3a89bdc9 100644
> > > > > > --- a/drivers/gpu/drm/vkms/vkms_composer.c
> > > > > > +++ b/drivers/gpu/drm/vkms/vkms_composer.c
>
> ...
>
> > > > > > +struct vkms_pixel_composition_functions {
> > > > > > +     void (*get_src_line)(void *pixels_addr, int length, u64 *line_buffer);
> > > > > > +     void (*set_output_line)(void *pixels_addr, int length, u64 *line_buffer);
> > > > >
> > > > > I would be a little more comfortable if instead of u64 *line_buffer you
> > > > > would have something like
> > > > >
> > > > > struct line_buffer {
> > > > >         u16 *row;
> > > > >         size_t nelem;
> > > > > }
> > > > >
> > > > > so that the functions to be plugged into these function pointers could
> > > > > assert that you do not accidentally overflow the array (which would
> > > > > imply a code bug in kernel).
> > > > >
> > > > > One could perhaps go even for:
> > > > >
> > > > > struct line_pixel {
> > > > >         u16 r, g, b, a;
> > > > > };
> > > > >
> > > > > struct line_buffer {
> > > > >         struct line_pixel *row;
> > > > >         size_t npixels;
> > > > > };
> > > >
> > > > If we decide to follow this representation, would it be possible
> > > > to calculate the crc in the similar way that is being done currently?
> > > >
> > > > Something like that:
> > > >
> > > > crc = crc32_le(crc, line_buffer.row, w * sizeof(line_pixel));
> > >
> > > Hi Igor,
> > >
> > > yes. I think the CRC calculated does not need to be reproducible in
> > > userspace, so you can very well compute it from the internal
> > > intermediate representation. It also does not need to be portable
> > > between architectures, AFAIU.
> >
> > Great! This will make things easier.
> >
> > >
> > > > I mean, If the compiler can decide to put a padding somewhere, it
> > > > would mess with the crc value. Right?
> > >
> > > Padding could mess it up, yes. However, I think in kernel it is a
> > > convention to define structs (especially UAPI structs but this is not
> > > one) such that there is no implicit padding. So there must be some
> > > recommended practises on how to achieve and ensure that.
> > >
> > > The size of struct line_pixel as defined above is 8 bytes which is a
> > > "very round" number, and every field has the same type, so there won't
> > > be gaps between fields either. So I think the struct should already be
> > > fine and have no padding, but how to make sure it is, I'm not sure what
> > > you would do in kernel land.
> > >
> > > In userspace I would put a static assert to ensure that
> > > sizeof(struct line_pixel) = 8. That would be enough, because sizeof
> > > counts not just internal implicit padding but also the needed size
> > > extension for alignment in an array of those. The accumulated size of
> > > the fields individually is 8 bytes, so if the struct size is 8, there
> > > cannot be padding.
> > >
> >
> > Apparently the kernel uses a compiler extension in a macro to do this
> > kind of struct packing.
> >
> > include/linux/compiler_attributes.h
> > 265:#define __packed                        __attribute__((__packed__))
>
> Hi Igor,
>
> we do not actually want to force packing, though.
>
> If there would be padding without packing, then packing may incur a
> noticeable speed penalty in accessing the fields. We don't want to risk
> that.

I understand...

>
> So I think it's better to just assert that no padding exists instead.
> There would be something quite strange going on if there was padding in
> this case, but better safe than sorry, because debugging that would be
> awful.
>

OK. I will do that and also test some alternatives.

>
> Thanks,
> pq

Thanks,
---
Igor M. A. Torrente