[RFC PATCH v2 06/17] drm/doc/rfc: Describe why prescriptive color pipeline is needed

Tue Nov 7 16:52:19 UTC 2023

On 2023-11-04 19:01, Christopher Braga wrote:
> Just want to loop back to before we branched off deeper into the programming performance talk
> 
> On 10/26/2023 3:25 PM, Alex Goins wrote:
>> On Thu, 26 Oct 2023, Sebastian Wick wrote:
>>
>>> On Thu, Oct 26, 2023 at 11:57:47AM +0300, Pekka Paalanen wrote:
>>>> On Wed, 25 Oct 2023 15:16:08 -0500 (CDT)
>>>> Alex Goins <agoins at nvidia.com> wrote:
>>>>
>>>>> Thank you Harry and all other contributors for your work on this. Responses
>>>>> inline -
>>>>>
>>>>> On Mon, 23 Oct 2023, Pekka Paalanen wrote:
>>>>>
>>>>>> On Fri, 20 Oct 2023 11:23:28 -0400
>>>>>> Harry Wentland <harry.wentland at amd.com> wrote:
>>>>>>
>>>>>>> On 2023-10-20 10:57, Pekka Paalanen wrote:
>>>>>>>> On Fri, 20 Oct 2023 16:22:56 +0200
>>>>>>>> Sebastian Wick <sebastian.wick at redhat.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks for continuing to work on this!
>>>>>>>>>
>>>>>>>>> On Thu, Oct 19, 2023 at 05:21:22PM -0400, Harry Wentland wrote:

snip

>>>>> Actually, the current examples in the proposal don't include a multiplier color
>>>>> op, which might be useful. For AMD as above, but also for NVIDIA as the
>>>>> following issue arises:
>>>>>
>>>>> As discussed further below, the NVIDIA "degamma" LUT performs an implicit fixed
> 
> If possible, let's declare this as two blocks. One that informatively declares the conversion is present, and another for the de-gamma. This will help with block-reuse between vendors.
> 
>>>>> point to FP16 conversion. In that conversion, what fixed point 0xFFFFFFFF maps
>>>>> to in floating point varies depending on the source content. If it's SDR
>>>>> content, we want the max value in FP16 to be 1.0 (80 nits), subject to a
>>>>> potential boost multiplier if we want SDR content to be brighter. If it's HDR PQ
>>>>> content, we want the max value in FP16 to be 125.0 (10,000 nits). My assumption
>>>>> is that this is also what AMD's "HDR Multiplier" stage is used for, is that
>>>>> correct?
>>>>
>>>> It would be against the UAPI design principles to tag content as HDR or
>>>> SDR. What you can do instead is to expose a colorop with a multiplier of
>>>> 1.0 or 125.0 to match your hardware behaviour, then tell your hardware
>>>> that the input is SDR or HDR to get the expected multiplier. You will
>>>> never know what the content actually is, anyway.
>>
>> Right, I didn't mean to suggest that we should tag content as HDR or SDR in the
>> UAPI, just relating to the end result in the pipe, ultimately it would be
>> determined by the multiplier color op.
>>
> 
> A multiplier could work but we would should give OEMs the option to either make it "informative" and fixed by the hardware, or fully configurable. With the Qualcomm pipeline how we absorb FP16 pixel buffers, as well as how we convert them to fixed point data actually has a dependency on the desired de-gamma and gamma processing. So for an example:
> 
> If a source pixel buffer is scRGB encoded FP16 content we would expect input pixel content to be up to 7.5, with the IGC output reaching 125 as in the NVIDIA case. Likewise gamma 2.2 encoded FP16 content would be 0-1 in and 0-1 out.
> 
> So in the Qualcomm case the expectations are fixed depending on the use case.
> 
> It is sounding to me like we would need to be able to declare three things here:
> 1. Value range expectations *into* the de-gamma block. A multiplier wouldn't work here because it would be more of a clipping operation. I guess we would have to add an explicit clamping block as well.
> 2. What the value range expectations  at the *output* of de-gamma processing block. Also covered by using another multiplier block.
> 3. Value range expectations *into* a gamma processing block. This should be covered by declaring a multiplier post-csc, but only assuming CSC output is normalized in the desired value range. A clamping block would be preferable because it describes what happens when it isn't.
> 

What about adding informational input and output range properties
to colorops? I think Intel's PWL definitions had something like
that, but I'd have to take a look at that again. While I'm not
in favor of defining segmented LUTs at the uAPI the input/output
ranges seem to be something of value.

> All this is do-able, but it seems like it would require the definition of multiple color pipelines to expose the different limitations for color block configuration combinations. Additionally, would it be easy for user space to find the right pipeline?
> 

I'm also a little concerned that some of these proposals mean we'd
have to expose an inordinate number of color pipelines and color
pipeline selection becomes difficult and error prone.

snip

>>>> Given that elements like various kinds of look-up tables inherently
>>>> assume that the domain is [0.0, 1.0] (because the it is a table that
>>>> has a beginning and an end, and the usual convention is that the
>>>> beginning is zero and the end is one), I think it is best to stick to
>>>> the [0.0, 1.0] range where possible. If we go out of that range, then
>>>> we have to define how a LUT would apply in a sensible way.
>>
>> In my last reply I mentioned a static (but actually programmable) LUT that is
>> typically used to convert FP16 linear pixels to fixed point PQ before handing
>> them to the scaler and tone mapping operator. You're actually right that it
>> indexes in the fixed point [0.0, 1.0] range for the reasons you describe, but
>> because the input pixels are expected to be FP16 in the [0.0, 125.0] range, it
>> applies a non-programmable 1/125.0 normalization factor first.
>>
>> In this case, you could think of the LUT as indexing on [0.0, 125.0], but as you
>> point out there would need to be some way to describe that. Maybe we actually
>> need a fractional multiplier / divider color op. NVIDIA pipes that include this
>> LUT would need to include a mandatory 1/125.0 factor immediately prior to the
>> LUT, then LUT can continue assuming a range of [0.0, 1.0].
>>
>> Assuming you are using the hardware in a conventional way, specifying a
>> multiplier of 1.0 after the "degamma" LUT would then map to the 80-nit PQ range
>> after the static (but actually programmable) PQ LUT, whereas specifying a
>> multiplier of 125.0 would map to the 10,000-nit PQ range, which is what we want.
>> I guess it's kind of messy, but the effect would be that color ops other than
>> multipliers/dividers would still be in the [0.0, 1.0] domain, and any multiplier
>> that exceeds that range would have to be normalized by a divider before any
>> other color op.
>>
> 
> Hmm. A multiplier would resolve issues when input linear FP16 data that has different ideas on what 1.0 means in regards to nits values (think of Apple's EDR as an example). For a client to go from their definition to hardware definition of 1.0 = x nits, we would need to expose what the pipeline sees as 1.0 though. So in this case the multiplier would be programmable, but the divisor is informational? It seems like the later would have an influence on how the former is programmed.
> 

A programmable multiplier would either need to be backed by a HW block
to perform the operation or require a driver to scale the LUT or matrix
values of an adjacent LUT or matrix block.

snip

>>>>>>
>>>>>> Yeah, this is why we need a definition. I understand "informational" to
>>>>>> not change pixel values in any way. Previously I had some weird idea
>>>>>> that scaling doesn't alter color, but of course it may.
>>>>>
>>>>> On recent hardware, the NVIDIA pre-blending pipeline includes LUTs that do
>>>>> implicit fixed-point to FP16 conversions, and vice versa.
>>>>
>>>> Above, I claimed that the UAPI should be defined in nominal
>>>> floating-point values, but I wonder, would that work? Would we need to
>>>> have explicit colorops for converting from raw pixel data values into
>>>> nominal floating-point in the UAPI?
>>
>> Yeah, I think something like that is needed, or another solution as discussed
>> below. Even if we define the UAPI in terms of floating point, the actual
>> underlying pixel format needs to match the expectations of each stage as it
>> flows through the pipe.
>>
> 
> Strongly agree on this. Pixel format and block relationships definitely exist.
> 

Interesting to see this isn't just an AMD thing. :)

snip

>>>>
>>>> Both blending and scaling are fundamentally the same operation: you
>>>> have two or more source colors (pixels), and you want to compute a
>>>> weighted average of them following what happens in nature, that is,
>>>> physics, as that is what humans are used to.
>>>>
>>>> Both blending and scaling will suffer from the same problems if the
>>>> operation is performed on not light-linear values. The result of the
>>>> weighted average does not correspond to physics.
>>>>
>>>> The problem may be hard to observe with natural imagery, but Josh's
>>>> example shows it very clearly. Maybe that effect is sometimes useful
>>>> for some imagery in some use cases, but it is still an accidental
>>>> side-effect. You might get even better results if you don't rely on
>>>> accidental side-effects but design a separate operation for the exact
>>>> goal you have.
>>>>
>>>> Mind, by scaling we mean changing image size. Not scaling color values.
>>>>
>>
>> Fair enough, but it might not always be a choice given the hardware.
>>
> 
> Agreeing with Alex here. I get there is some debate over the best way to do this, but I think it is best to leave it up to the driver to declare how that is done.

Same.

snip

>>>>
>>>> What I was left puzzled about after the XDC workshop is that is it
>>>> possible to pre-load configurations in the background (slow), and then
>>>> quickly switch between them? Hardware-wise I mean.
>>
>> This works fine for our "fast" LUTs, you just point them to a surface in video
>> memory and they flip to it. You could keep multiple surfaces around and flip
>> between them without having to reprogram them in software. We can easily do that
>> with enumerated curves, populating them when the driver initializes instead of
>> waiting for the client to request them. You can even point multiple hardware
>> LUTs to the same video memory surface, if they need the same curve.
>>
>>>
>>> We could define that pipelines with a lower ID are to be preferred over
>>> higher IDs.
>>
>> Sure, but this isn't just an issue with a pipeline as a whole, but the
>> individual elements within it and how to use them in a given context.
>>
>>>
>>> The issue is that if programming a pipeline becomes too slow to be
>>> useful it probably should just not be made available to user space.
>>
>> It's not that programming the pipeline is overall too slow. The LUTs we have
>> that are relatively slow to program are meant to be set infrequently, or even
>> just once, to allow the scaler and tone mapping operator to operate in fixed
>> point PQ space. You might still want the tone mapper, so you would choose a
>> pipeline that includes them, but when it comes to e.g. animating a night light,
>> you would want to choose a different LUT for that purpose.
>>
>>>
>>> The prepare-commit idea for blob properties would help to make the
>>> pipelines usable again, but until then it's probably a good idea to just
>>> not expose those pipelines.
>>
>> The prepare-commit idea actually wouldn't work for these LUTs, because they are
>> programmed using methods instead of pointing them to a surface. I'm actually not
>> sure how slow it actually is, would need to benchmark it. I think not exposing
>> them at all would be overkill, since it would mean you can't use the preblending
>> scaler or tonemapper, and animation isn't necessary for that.
>>
>> The AMD 3DLUT is another example of a LUT that is slow to update, and it would
>> obviously be a major loss if that wasn't exposed. There just needs to be some
>> way for clients to know if they are going to kill performance by trying to
>> change it every frame.
>>
>> Thanks,
>> Alex
>>
> 
> To clarify, what are we defining as slow to update here? Something we aren't able to update within a frame (let's say at a low frame rate such as 30 fps for discussion's sake)? A block that requires a programming sequence of disable + program + enable to update? Defining performance seems like it can get murky if we start to consider frame concurrent updates among multiple color blocks as well.
> 

I think any definition for slow would need to be imprecise on some level.
In the AMD 3DLUT case we can take around 8 ms. Some compositors need the
programming time to be well under 1 ms, even for low frame rates. Those
compositors might want to know if an operation might be undesirable if
they care about latency. I'm not sure we could reliably indicate more.

Harry

> Thanks,
> Christopher
> 
>>>
>>>>
>>>>
>>>> Thanks,
>>>> pq
>>>
>>>