[RFC PATCH v2 06/17] drm/doc/rfc: Describe why prescriptive color pipeline is needed

Tue Nov 7 16:52:06 UTC 2023

On 2023-10-26 04:57, Pekka Paalanen wrote:
> On Wed, 25 Oct 2023 15:16:08 -0500 (CDT)
> Alex Goins <agoins at nvidia.com> wrote:
> 
>> Thank you Harry and all other contributors for your work on this. Responses
>> inline -
>>
>> On Mon, 23 Oct 2023, Pekka Paalanen wrote:
>>
>>> On Fri, 20 Oct 2023 11:23:28 -0400
>>> Harry Wentland <harry.wentland at amd.com> wrote:
>>>   
>>>> On 2023-10-20 10:57, Pekka Paalanen wrote:  
>>>>> On Fri, 20 Oct 2023 16:22:56 +0200
>>>>> Sebastian Wick <sebastian.wick at redhat.com> wrote:
>>>>>     
>>>>>> Thanks for continuing to work on this!
>>>>>>
>>>>>> On Thu, Oct 19, 2023 at 05:21:22PM -0400, Harry Wentland wrote:    
>>>>>>> v2:
>>>>>>>  - Update colorop visualizations to match reality (Sebastian, Alex Hung)
>>>>>>>  - Updated wording (Pekka)
>>>>>>>  - Change BYPASS wording to make it non-mandatory (Sebastian)
>>>>>>>  - Drop cover-letter-like paragraph from COLOR_PIPELINE Plane Property
>>>>>>>    section (Pekka)
>>>>>>>  - Use PQ EOTF instead of its inverse in Pipeline Programming example (Melissa)
>>>>>>>  - Add "Driver Implementer's Guide" section (Pekka)
>>>>>>>  - Add "Driver Forward/Backward Compatibility" section (Sebastian, Pekka)  
>>>>>
>>>>> ...
>>>>>  
>>>>>>> +An example of a drm_colorop object might look like one of these::
>>>>>>> +
>>>>>>> +    /* 1D enumerated curve */
>>>>>>> +    Color operation 42
>>>>>>> +    ├─ "TYPE": immutable enum {1D enumerated curve, 1D LUT, 3x3 matrix, 3x4 matrix, 3D LUT, etc.} = 1D enumerated curve
>>>>>>> +    ├─ "BYPASS": bool {true, false}
>>>>>>> +    ├─ "CURVE_1D_TYPE": enum {sRGB EOTF, sRGB inverse EOTF, PQ EOTF, PQ inverse EOTF, …}
>>>>>>> +    └─ "NEXT": immutable color operation ID = 43  
>>
>> I know these are just examples, but I would also like to suggest the possibility
>> of an "identity" CURVE_1D_TYPE. BYPASS = true might get different results
>> compared to setting an identity in some cases depending on the hardware. See
>> below for more on this, RE: implicit format conversions.
>>
>> Although NVIDIA hardware doesn't use a ROM for enumerated curves, it came up in
>> offline discussions that it would nonetheless be helpful to expose enumerated
>> curves in order to hide the vendor-specific complexities of programming
>> segmented LUTs from clients. In that case, we would simply refer to the
>> enumerated curve when calculating/choosing segmented LUT entries.
> 
> That's a good idea.
> 
>> Another thing that came up in offline discussions is that we could use multiple
>> color operations to program a single operation in hardware. As I understand it,
>> AMD has a ROM-defined LUT, followed by a custom 4K entry LUT, followed by an
>> "HDR Multiplier". On NVIDIA we don't have these as separate hardware stages, but
>> we could combine them into a singular LUT in software, such that you can combine
>> e.g. segmented PQ EOTF with night light. One caveat is that you will lose
>> precision from the custom LUT where it overlaps with the linear section of the
>> enumerated curve, but that is unavoidable and shouldn't be an issue in most
>> use-cases.
> 
> Indeed.
> 
>> Actually, the current examples in the proposal don't include a multiplier color
>> op, which might be useful. For AMD as above, but also for NVIDIA as the
>> following issue arises:
>>
>> As discussed further below, the NVIDIA "degamma" LUT performs an implicit fixed
>> point to FP16 conversion. In that conversion, what fixed point 0xFFFFFFFF maps
>> to in floating point varies depending on the source content. If it's SDR
>> content, we want the max value in FP16 to be 1.0 (80 nits), subject to a
>> potential boost multiplier if we want SDR content to be brighter. If it's HDR PQ
>> content, we want the max value in FP16 to be 125.0 (10,000 nits). My assumption
>> is that this is also what AMD's "HDR Multiplier" stage is used for, is that
>> correct?
> 
> It would be against the UAPI design principles to tag content as HDR or
> SDR. What you can do instead is to expose a colorop with a multiplier of
> 1.0 or 125.0 to match your hardware behaviour, then tell your hardware
> that the input is SDR or HDR to get the expected multiplier. You will
> never know what the content actually is, anyway.
> 
> Of course, if we want to have a arbitrary multiplier colorop that is
> somewhat standard, as in, exposed by many drivers to ease userspace
> development, you can certainly use any combination of your hardware
> features you need to realize the UAPI prescribed mathematical operation.
> 
> Since we are talking about floating-point in hardware, a multiplier
> does not significantly affect precision.
> 
> In order to mathematically define all colorops, I believe it is
> necessary to define all colorops in terms of floating-point values (as
> in math), even if they operate on fixed-point or integer. By this I
> mean that if the input is 8 bpc unsigned integer pixel format for
> instance, 0 raw pixel channel value is mapped to 0.0 and 255 is mapped
> to 1.0, and the color pipeline starts with [0.0, 1.0], not [0, 255]
> domain. We have to agree on this mapping for all channels on all pixel
> formats. However, there is a "but" further below.
> 
> I also propose that quantization range is NOT considered in the raw
> value mapping, so that we can handle quantization range in colorops
> explicitly, allowing us to e.g. handle sub-blacks and super-whites when
> necessary. (These are currently impossible to represent in the legacy
> color properties, because everything is converted to full range and
> clipped before any color operations.)
> 

I pretty much agree with anything you say up to here. :)

>> From the given enumerated curves, it's not clear how they would map to the
>> above. Should sRGB EOTF have a max FP16 value of 1.0, and the PQ EOTF a max FP16
>> value of 125.0? That may work, but it tends towards the "descriptive" notion of
>> assuming the source content, which may not be accurate in all cases. This is
>> also an issue for the custom 1D LUT, as the blob will need to be converted to
>> FP16 in order to populate our "degamma" LUT. What should the resulting max FP16
>> value be, given that we no longer have any hint as to the source content?
> 
> In my opinion, all finite non-negative transfer functions should
> operate with [0.0, 1.0] domain and [0.0, 1.0] range, and that includes
> all sRGB, power 2.2, and PQ curves.
> 

That wouldn't work with AMD HW that encodes a PQ transfer function that
has an output range of [0.0, 125.0]. I suggest making the range a part
of the named TF definition.

> If we look at BT.2100, there is no such encoding even mentioned where
> 125.0 would correspond to 10k cd/m². That 125.0 convention already has
> a built-in assumption what the color spaces are and what the conversion
> is aiming to do. IOW, I would say that choice is opinionated from the
> start. The multiplier in BT.2100 is always 10000.
> 

Sure, the choice is opinionated but a certain large OS vendor has had
a large influence in how HW vendors designed their color pipelines.

snip

>> On recent hardware, the NVIDIA pre-blending pipeline includes LUTs that do
>> implicit fixed-point to FP16 conversions, and vice versa.
> 
> Above, I claimed that the UAPI should be defined in nominal
> floating-point values, but I wonder, would that work? Would we need to
> have explicit colorops for converting from raw pixel data values into
> nominal floating-point in the UAPI?
> 

I think it's important that we keep a level of abstraction a the driver level.
I'm not sure it would serve anyone if we defined this.

snip

>>>>> I think we also need a definition of "informational".
>>>>>
>>>>> Counter-example 1: a colorop that represents a non-configurable    
>>>>
>>>> Not sure what's "counter" for these examples?
>>>>   
>>>>> YUV<->RGB conversion. Maybe it determines its operation from FB pixel
>>>>> format. It cannot be set to bypass, it cannot be configured, and it
>>>>> will alter color values.  
>>
>> Would it be reasonable to expose this is a 3x4 matrix with a read-only blob and
>> no BYPASS property? I already brought up a similar idea at the XDC HDR Workshop
>> based on the principle that read-only blobs could be used to express some static
>> pipeline elements without the need to define a new type, but got mixed opinions.
>> I think this demonstrates the principle further, as clients could detect this
>> programmatically instead of having to special-case the informational element.
> 
> If the blob depends on the pixel format (i.e. the driver automatically
> chooses a different blob per pixel format), then I think we would need
> to expose all the blobs and how they correspond to pixel formats.
> Otherwise ok, I guess.
> 
> However, do we want or need to make a color pipeline or colorop
> conditional on pixel formats? For example, if you use a YUV 4:2:0 type
> of pixel format, then you must use this pipeline and not any other. Or
> floating-point type of pixel format. I did not anticipate this before,
> I assumed that all color pipelines and colorops are independent of the
> framebuffer pixel format. A specific colorop might have a property that
> needs to agree with the framebuffer pixel format, but I didn't expect
> further limitations.
> 

Yes, I think we'll want that.

> "Without the need to define a new type" is something I think we need to
> consider case by case. I have a hard time giving a general opinion.
> 
>>>>>
>>>>> Counter-example 2: image size scaling colorop. It might not be
>>>>> configurable, it is controlled by the plane CRTC_* and SRC_*
>>>>> properties. You still need to understand what it does, so you can
>>>>> arrange the scaling to work correctly. (Do not want to scale an image
>>>>> with PQ-encoded values as Josh demonstrated in XDC.)
>>>>>     
>>>>
>>>> IMO the position of the scaling operation is the thing that's important
>>>> here as the color pipeline won't define scaling properties.  
>>
>> I agree that blending should ideally be done in linear space, and I remember
>> that from Josh's presentation at XDC, but I don't recall the same being said for
>> scaling. In fact, the NVIDIA pre-blending scaler exists in a stage of the
>> pipeline that is meant to be in PQ space (more on this below), and that was
>> found to achieve better results at HDR/SDR boundaries. Of course, this only
>> bolsters the argument that it would be helpful to have an informational "scaler"
>> element to understand at which stage scaling takes place.
> 
> Both blending and scaling are fundamentally the same operation: you
> have two or more source colors (pixels), and you want to compute a
> weighted average of them following what happens in nature, that is,
> physics, as that is what humans are used to.
> 
> Both blending and scaling will suffer from the same problems if the
> operation is performed on not light-linear values. The result of the
> weighted average does not correspond to physics.
> 
> The problem may be hard to observe with natural imagery, but Josh's
> example shows it very clearly. Maybe that effect is sometimes useful
> for some imagery in some use cases, but it is still an accidental
> side-effect. You might get even better results if you don't rely on
> accidental side-effects but design a separate operation for the exact
> goal you have.
> 

Many people looked at this problem inside AMD and probably at other
companies. Not all of them arrive at the same conclusion. The type of
image will also greatly affect what one considers better.

But it sounds like we'll need an informational scaling element at least
for compositors that care. Do we need that as a first iteration of a
working DRM/KMS solution, though? So far other OSes have not cared and
people have (probably) not complained about it.

snip

>> Despite being programmable, the LUTs are updated in a manner that is less
>> efficient as compared to e.g. the non-static "degamma" LUT. Would it be helpful
>> if there was some way to tag operations according to their performance,
>> for example so that clients can prefer a high performance one when they
>> intend to do an animated transition? I recall from the XDC HDR workshop
>> that this is also an issue with AMD's 3DLUT, where updates can be too
>> slow to animate.
> 
> I can certainly see such information being useful, but then we need to
> somehow quantize the performance.
> 
> What I was left puzzled about after the XDC workshop is that is it
> possible to pre-load configurations in the background (slow), and then
> quickly switch between them? Hardware-wise I mean.
> 

On AMD HW, yes. How to fit that into the atomic API is a separate
question. :D

Harry

> 
> Thanks,
> pq