[RFC] drm/amdgpu: Add macros and documentation for format modifiers.

Tue Sep 4 19:28:27 UTC 2018

On Tue, Sep 4, 2018 at 8:31 PM, Bas Nieuwenhuizen
<bas at basnieuwenhuizen.nl> wrote:
> On Tue, Sep 4, 2018 at 8:27 PM Daniel Vetter <daniel at ffwll.ch> wrote:
>>
>> On Tue, Sep 4, 2018 at 7:57 PM, Bas Nieuwenhuizen
>> <bas at basnieuwenhuizen.nl> wrote:
>> > On Tue, Sep 4, 2018 at 7:48 PM Christian König
>> > <ckoenig.leichtzumerken at gmail.com> wrote:
>> >>
>> >> Am 04.09.2018 um 18:37 schrieb Daniel Vetter:
>> >> > On Tue, Sep 4, 2018 at 5:52 PM, Bas Nieuwenhuizen
>> >> > <bas at basnieuwenhuizen.nl> wrote:
>> >> >> On Tue, Sep 4, 2018 at 4:43 PM Daniel Vetter <daniel at ffwll.ch> wrote:
>> >> >>> On Tue, Sep 4, 2018 at 3:33 PM, Bas Nieuwenhuizen
>> >> >>> <bas at basnieuwenhuizen.nl> wrote:
>> >> >>>> On Tue, Sep 4, 2018 at 3:04 PM Daniel Vetter <daniel at ffwll.ch> wrote:
>> >> >>>>> On Tue, Sep 04, 2018 at 02:33:02PM +0200, Bas Nieuwenhuizen wrote:
>> >> >>>>>> On Tue, Sep 4, 2018 at 2:26 PM Daniel Vetter <daniel at ffwll.ch> wrote:
>> >> >>>>>>> On Tue, Sep 04, 2018 at 12:44:19PM +0200, Christian König wrote:
>> >> >>>>>>>> Am 04.09.2018 um 12:15 schrieb Daniel Stone:
>> >> >>>>>>>>> Hi,
>> >> >>>>>>>>>
>> >> >>>>>>>>> On Tue, 4 Sep 2018 at 11:05, Daniel Vetter <daniel.vetter at ffwll.ch> wrote:
>> >> >>>>>>>>>> On Tue, Sep 4, 2018 at 3:00 AM, Bas Nieuwenhuizen <bas at basnieuwenhuizen.nl> wrote:
>> >> >>>>>>>>>>> +/* The chip this is compatible with.
>> >> >>>>>>>>>>> + *
>> >> >>>>>>>>>>> + * If compression is disabled, use
>> >> >>>>>>>>>>> + *   - AMDGPU_CHIP_TAHITI for GFX6-GFX8
>> >> >>>>>>>>>>> + *   - AMDGPU_CHIP_VEGA10 for GFX9+
>> >> >>>>>>>>>>> + *
>> >> >>>>>>>>>>> + * With compression enabled please use the exact chip.
>> >> >>>>>>>>>>> + *
>> >> >>>>>>>>>>> + * TODO: Do some generations share DCC format?
>> >> >>>>>>>>>>> + */
>> >> >>>>>>>>>>> +#define AMDGPU_MODIFIER_CHIP_GEN_SHIFT                 40
>> >> >>>>>>>>>>> +#define AMDGPU_MODIFIER_CHIP_GEN_MASK                  0xff
>> >> >>>>>>>>>> Do you really need all the combinations here of DCC + gpu gen + tiling
>> >> >>>>>>>>>> details? When we had the entire discussion with nvidia folks they
>> >> >>>>>>>>>> eventually agreed that they don't need the massive pile with every
>> >> >>>>>>>>>> possible combination. Do you really plan to share all these different
>> >> >>>>>>>>>> things?
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> Note that e.g. on i915 we spec some of the tiling depending upon
>> >> >>>>>>>>>> buffer size and buffer format (because that's how the hw works), not
>> >> >>>>>>>>>> using explicit modifier flags for everything.
>> >> >>>>>>>>> Right. The conclusion, after people went through and started sorting
>> >> >>>>>>>>> out the kinds of formats for which they would _actually_ export real
>> >> >>>>>>>>> colour buffers for, that most vendors definitely have fewer than
>> >> >>>>>>>>> 115,792,089,237,316,195,423,570,985,008,687,907,853,269,984,665,640,564,039,457,584,007,913,129,639,936
>> >> >>>>>>>>> possible formats to represent, very likely fewer than
>> >> >>>>>>>>> 340,282,366,920,938,463,463,374,607,431,768,211,456 formats, probably
>> >> >>>>>>>>> fewer than 72,057,594,037,927,936 formats, and even still generally
>> >> >>>>>>>>> fewer than 281,474,976,710,656 if you want to be generous and leave 8
>> >> >>>>>>>>> bits of the 56 available.
>> >> >>>>>>>> The problem here is that at least for some parameters we actually don't know
>> >> >>>>>>>> which formats are actually used.
>> >> >>>>>>>>
>> >> >>>>>>>> The following are not real world examples, but just to explain the general
>> >> >>>>>>>> problem.
>> >> >>>>>>>>
>> >> >>>>>>>> The memory configuration for example can be not ASIC specific, but rather
>> >> >>>>>>>> determined by whoever took the ASIC and glued it together with VRAM on a
>> >> >>>>>>>> board. It is not likely that somebody puts all the VRAM chips on one
>> >> >>>>>>>> channel, but it is still perfectly possible.
>> >> >>>>>>>>
>> >> >>>>>>>> Same is true for things like harvesting, e.g. of 16 channels halve of them
>> >> >>>>>>>> could be bad and we need to know which to actually use.
>> >> >>>>>>> For my understanding: This leaks outside the chip when sharing buffers?
>> >> >>>>>>> All the information you only need locally to a given amdgpu instance
>> >> >>>>>>> don't really need to be encoded in modifiers.
>> >> >>>>>>>
>> >> >>>>>>> Pointers to code where this is all decided (kernel and radeonsi would be
>> >> >>>>>>> good starters I guess) would be really good here.
>> >> >>>>>> I extracted the information on which bits are relevant mostly from the
>> >> >>>>>> AddrFromCoord functions in addrlib in mesa:
>> >> >>>>>>
>> >> >>>>>> for macro-tiles:
>> >> >>>>>> https://gitlab.freedesktop.org/mesa/mesa/blob/master/src/amd/addrlib/r800/egbaddrlib.cpp#L1587
>> >> >>>>>>
>> >> >>>>>> for micro-tiles (or the micro-tiles in macro-tiles):
>> >> >>>>>>
>> >> >>>>>> https://gitlab.freedesktop.org/mesa/mesa/blob/master/src/amd/addrlib/core/addrlib1.cpp#L3016
>> >> >>>>> So this is the decoding thing. How many of these actually exist, even when
>> >> >>>>> taking all the other information into account?
>> >> >>>>>
>> >> >>>>> E.g. given a platform + memory config (seems needed) + drm_fourcc + stride
>> >> >>>>> + height + width, how much of all these bits do you actually still freely
>> >> >>>>> pick?
>> >> >>>> Basically you pick ARRAY_MODE (linear, micro-tile, macro-tile, sparse,
>> >> >>>> thick variants of macro-tile), MICRO_TILE_MODE(display, non-display,
>> >> >>>> depth, display-rotated) + whether to use compression, everything else
>> >> >>>> is fixed given those option, the properties of the chip and the
>> >> >>>> format.
>> >> >>>>
>> >> >>>>
>> >> >>>>> It might be that all the things you need to know from the memory config
>> >> >>>>> don't encode smaller than the macro/micro/whatever else stuff. But that's
>> >> >>>>> kinda the angle that we looked at this for everyone else.
>> >> >>>>>
>> >> >>>>> E.g. for multi-plane stuff, if everyone picks the same config for the
>> >> >>>>> 2nd/3rd plane, then you don't actually need to encode that. It just
>> >> >>>>> becomes part of the implied stuff in the modifier.
>> >> >>>> The problem is some GPUs are compatible for say 8-bpp images, but not
>> >> >>>> for 32-bpp surfaces. e.g. lets look at the following table showing the
>> >> >>>> current configuration for all GFX6-GFX8 GPU:
>> >> >>>>
>> >> >>>> format: (bank width, bank height, macro tile aspect, num banks) for
>> >> >>>> 8-bpp, 16-bpp and 32 bpp single-sample followed by the PIPE_CONFIG
>> >> >>>>
>> >> >>>> verde: (1, 4, 2, 16) (1, 2, 2, 16) (1, 1, 2, 16) ADDR_SURF_P4_8x16
>> >> >>>> oland: (1, 4, 2, 16) (1, 2, 2, 16) (1, 1, 2, 16) ADDR_SURF_P4_8x16
>> >> >>>> hainan: (1, 4, 2, 16) (1, 2, 2, 16) (1, 1, 2, 16) ADDR_SURF_P2
>> >> >>>> tahiti/pitcairn: (1, 4, 1, 16) (1, 2, 1, 16) (1, 1, 1, 16)
>> >> >>>> ADDR_SURF_P8_32x32_8x16
>> >> >>>> bonaire: (1,4,4,16) (1, 2, 4, 16) (1, 1, 2, 16) ADDR_SURF_P4_16x16
>> >> >>>> hawaii: (1, 4, 2, 16) (1, 2, 2, 16) (1, 1, 1, 16) ADDR_SURF_P16_32x32_16x16
>> >> >>>> CIK APUs: (1,4,4, 8), (1,2,4,8),  (1, 2, 2, 8) ADDR_SURF_P2
>> >> >>>> topaz: (4, 4, 2, 8) (4, 4, 2, 8) (2, 4, 2, 8) ADDR_SURF_P2
>> >> >>>> fiji: (1, 4, 2, 8) (1, 4, 2, 8) (1, 4, 2, 8) ADDR_SURF_P16_32x32_16x16
>> >> >>>> tonga:: (1, 4, 4, 16), (1, 4, 4, 16) (1, 4, 4, 16) ADDR_SURF_P8_32x32_16x16
>> >> >>>> polaris11/12: (1, 4, 4, 16), (1, 4, 4, 16) (1, 4, 4, 16) ADDR_SURF_P4_16x16
>> >> >>>> polaris10: (1, 4, 4, 16), (1, 4, 4, 16) (1, 4, 4, 16) ADDR_SURF_P8_32x32_16x16
>> >> >>>> stoney,carrizo: (1, 4, 4, 8) (1, 2, 4, 8), (1, 1, 2, 8) ADDR_SURF_P2
>> >> >>>>
>> >> >>>> We see here that e.g. Stoney and the CIK APUs are compatible for 8-bpp
>> >> >>>> and 16-bpp, but not 32-bpp. or bonaire and polaris11/12 are only
>> >> >>>> compatible for 8-bpp.
>> >> >>>>
>> >> >>>> So we can't just assume that if the first plane properties match that
>> >> >>>> they do so for the second plane, because we don't know what GPU it is
>> >> >>>> coming from.
>> >> >>>>
>> >> >>>> And we can make a canonical table from this but then if we change the
>> >> >>>> above tables in the kernel that runs into compatibility issues.
>> >> >>>
>> >> >>> I think that's roughly what nvidia ended up doing (not sure they
>> >> >>> published anything, haven't seen any patches on dri-devel). Two parts:
>> >> >>> - Bunch of flags/bits for the major mode, like {display, non-display,
>> >> >>> DCC} and so on.
>> >> >>> - For each of the major modes a list of enumerated modes actually used
>> >> >>> in reality. For DCC this would be the generation, for more plain
>> >> >>> formats it would be the above table. Table entries would be per-bpp
>> >> >>> (in case that matters, or whatever matters for amd gpus).
>> >> >>>
>> >> >>> Imo would be totally ok to put the relevant lookup tables into
>> >> >>> drm_fourcc.h directly. Or any other suitable place.
>> >> >> Right, the issue would be that the tables are configured int he kernel
>> >> >>
>> >> >> https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/gfx_v6_0.c?h=amd-staging-drm-next#n417
>> >> >>
>> >> >> and having a shared hardcoded table means that if the kernel tables
>> >> >> change we break userspace. That would be a new commitment on the side
>> >> >> of the kernel driver.
>> >> >>
>> >> >> (If I can get an okay for that commitment from Alex/Christian I'd be
>> >> >> totally happy to switch it over)
>> >> > Yeah it'd make that table uapi
>> >>
>> >> That won't work. Those lists are not necessary stable.
>> >>
>> >> What we could do is to use PCI-ID plus tile_mode/macro_tile_mode index
>> >> plus a 8bit version number.
>> >
>> > But if we use PCI-ID, then we cannot really share between different
>> > models of GPU ...
>> >
>> > So we have seen 3 options on design level:
>> > 1) Use PCI-ID/chip.
>> >   Advantage:
>> >     -simple
>> >     - Few bits
>> >   Disadvantage:
>> >     - No sharing between different GPU models
>> > 2) Use a fixed lookup/canonicalization table.
>> >   Advantage:
>> >     -Few bits
>> >     - Sharing between models
>> >   Disadvantage:
>> >     - Cannot change the kernel tiling tables.
>>
>> Why is changing the kernel table impossible? Just because it's uapi
>> doesn't mean we can never extend it. Just add a new entry at the
>> bottom if you need more combinations ...
>>
>> You don't even need to require the updated kernel amdgpu.ko, just the
>> updated drm_fourcc.h header (if it's one of these things you can
>> configure through cs packets entirely from userspace).
>
> So we have two tables we are talking about:
>  a) The kernel initialized system wide table of tiling configurations
> that already exists
>  b)The uapi table
>
> Currently mesa selects by index, i.e. if we want a 32-bit surface with
> macro-tiling that is not displayable we should select index 14 in the
> system wide tiling table. Now if we change that table entry 14 that
> means that we change which things it is compatible with, so we need a
> new entry in the uapi table. However, an old mesa does not know about
> the new entry in the uapi table, so loses the ability to select a
> modifier for the preferred tiling mode, e.g. we pretty much regress
> old userspace.

That sounds like this table already _is_ uapi. So you can't change it
... why would you want to change it then?

Note also that the fourcc table doesn't really have to match this
by-index tiling mode uapi table, you can freely add more at the end or
have any kind of fancy lookup table (one for everyone, or
per-generation, or per-whatever, doesn't matter). Just a tradeoff
between compressing the set of actually used layouts into as few
modifier bits as possible against code maintenance headaches of having
to deal with too many lookup tables for all the different things.
-Daniel

>
>>
>> Or why does this not work?
>> -Daniel
>>
>> > 3) Encode tiling mode properties in modifier.
>> >   Advantage:
>> >     - Can share between models.
>> >     - Can change the kernel tiling tables.
>> >   Disadvantage:
>> >     - Use a lot of bits.
>> >
>> > Overall the number of exposed modifiers should be pretty much the
>> > same, so encoding size that we have to deal with all possible bit
>> > patterns.
>> >
>> > I've implemented option 3 in the patch here because I think it is the
>> > best tradeoff. As far as I can see it fits, and especially if we do no
>> > multisample support and remove the tile split bytes, and remove the
>> > 3rd set of plane macro-tile options because no format has 3 different
>> > bitdepths for the 3 planes, we are left with using 37 bits, leaving 19
>> > for future extensions and a version number. It is not great but pretty
>> > doable?
>> >
>> >>
>> >> Shouldn't be more than 32 bits in total.
>> >>
>> >> Regards,
>> >> Christian.
>> >>
>> >> >   - well the compressed from with
>> >> > duplicates removed, if you really want to make it tiny. But you can
>> >> > just add new entries if the heuristics for a given platform changes,
>> >> > as long as you keep the old entries working (if you shipped them ever
>> >> > ofc, otherwise totally ok to change as you see fit).
>> >> > -Daniel
>> >> >
>> >> >>>  From a quick look about 16 bits total. Gives you 40bits of second
>> >> >>> chances and "oops totally sorry".
>> >> >> I purposefully left 8 bits as a version field for those oopses.
>> >> >>
>> >> >>> -Daniel
>> >> >>>
>> >> >>>>> -Daniel
>> >> >>>>>
>> >> >>>>>>> -Daniel
>> >> >>>>>>>
>> >> >>>>>>>> Regards,
>> >> >>>>>>>> Christian.
>> >> >>>>>>>>
>> >> >>>>>>>>> If you do use 256 bits in order to represent
>> >> >>>>>>>>> 115,792,089,237,316,195,423,570,985,008,687,907,853,269,984,665,640,564,039,457,584,007,913,129,639,936
>> >> >>>>>>>>> modifiers per format, userspace would start hitting OOM pretty quickly
>> >> >>>>>>>>> as it attempted to enumerate and negotiate acceptable modifiers.
>> >> >>>>>>>>> Either that or we need to replace the fixed 64-bit modifier tokens
>> >> >>>>>>>>> with some kind of eBPF script.
>> >> >>>>>>>>>
>> >> >>>>>>>>> Cheers,
>> >> >>>>>>>>> Daniel
>> >> >>>>>>>>> _______________________________________________
>> >> >>>>>>>>> dri-devel mailing list
>> >> >>>>>>>>> dri-devel at lists.freedesktop.org
>> >> >>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>> >> >>>>>>> --
>> >> >>>>>>> Daniel Vetter
>> >> >>>>>>> Software Engineer, Intel Corporation
>> >> >>>>>>> http://blog.ffwll.ch
>> >> >>>>>>> _______________________________________________
>> >> >>>>>>> dri-devel mailing list
>> >> >>>>>>> dri-devel at lists.freedesktop.org
>> >> >>>>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>> >> >>>>> --
>> >> >>>>> Daniel Vetter
>> >> >>>>> Software Engineer, Intel Corporation
>> >> >>>>> http://blog.ffwll.ch
>> >> >>>
>> >> >>>
>> >> >>> --
>> >> >>> Daniel Vetter
>> >> >>> Software Engineer, Intel Corporation
>> >> >>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>> >> >
>> >> >
>> >>
>>
>>
>>
>> --
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch