Mesa (main): docs/isl: Add detailed documentation about CCS compression

Fri Jun 18 13:29:58 UTC 2021

Module: Mesa
Branch: main
Commit: b97dedd365fbd8c2e62e0fecc89d01cfc38eb0e6
URL:    http://cgit.freedesktop.org/mesa/mesa/commit/?id=b97dedd365fbd8c2e62e0fecc89d01cfc38eb0e6

Author: Jason Ekstrand <jason at jlekstrand.net>
Date:   Tue Jun 15 16:57:25 2021 -0500

docs/isl: Add detailed documentation about CCS compression

Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11366>

---

 docs/isl/ccs.rst   | 171 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 docs/isl/index.rst |   1 +
 2 files changed, 172 insertions(+)

diff --git a/docs/isl/ccs.rst b/docs/isl/ccs.rst
new file mode 100644
index 00000000000..37797705cc9
--- /dev/null
+++ b/docs/isl/ccs.rst
@@ -0,0 +1,171 @@
+Single-sampled Color Compression
+================================
+
+Starting with Ivy Bridge, Intel graphics hardware provides a form of color
+compression for single-sampled surfaces.  In its initial form, this provided an
+acceleration of render target clear operations that, in the common case, allows
+you to avoid almost all of the bandwidth of a full-surface clear operation.  On
+Sky Lake, single-sampled color compression was extended to allow for the
+compression color values from actual rendering and not just the initial clear.
+From here on, the older Ivy Bridge form of color compression will be called
+"fast-clears" and term "color compression" will be reserved for the more
+powerful Sky Lake form.
+
+The documentation for Ivy Bridge through Broadwell overloads the term MCS for
+referring both to the *multisample control surface* used for multisample
+compression and the control surface used for fast-clears. In ISL, the
+:cpp:enumerator:`isl_aux_usage::ISL_AUX_USAGE_MCS` enum always refers to
+multisample color compression while the
+:cpp:enumerator:`isl_aux_usage::ISL_AUX_USAGE_CCS_` enums always refer to
+single-sampled color compression. Throughout this chapter and the rest of the
+ISL documentation, we will use the term "color control surface", abbreviated
+CCS, to denote the control surface used for both fast-clears and color
+compression.  While this is still an overloaded term, Ivy Bridge fast-clears
+are much closer to Sky Lake color compression than they are to multisample
+compression.
+
+CCS data
+--------
+
+Fast clears and CCS are possibly the single most poorly documented aspect of
+surface layout/setup for Intel graphics hardware (with HiZ coming in a neat
+second). All the documentation really says is that you can use an MCS buffer on
+single-sampled surfaces (we will call it the CCS in this case). It also
+provides some documentation on how to program the hardware to perform clear
+operations, but that's it.  How big is this buffer?  What does it contain?
+Those question are left as exercises to the reader. Almost everything we know
+about the contents of the CCS is gleaned from reverse-engineering of the
+hardware.  The best bit of documentation we have ever had comes from the
+display section of the Sky Lake PRM Vol 12 section on planes (p. 159):
+
+    The Color Control Surface (CCS) contains the compression status of the
+    cache-line pairs. The compression state of the cache-line pair is
+    specified by 2 bits in the CCS.  Each CCS cache-line represents an area
+    on the main surface of 16x16 sets of 128 byte Y-tiled cache-line-pairs.
+    CCS is always Y tiled.
+
+While this is technically for color compression and not fast-clears, it
+provides a good bit of insight into how color compression and fast-clears
+operate.  Each cache-line pair, in the main surface corresponds to 1 or 2 bits
+in the CCS.  The primary difference, as far as the current discussion is
+concerned, is that fast-clears use only 1 bit per cache-line pair whereas color
+compression uses 2 bits.
+
+What is a cache-line pair?  Both the X and Y tiling formats are arranged as an
+8x8 grid of cache lines.  (See the [chapter on tiling](#tiling) for more
+details.)  In either case, a cache-line pair is a pair of cache lines whose
+starting addresses differ by 512 bytes or 8 cache lines.  This results in the
+two cache lines being vertically adjacent when the main surface is X-tiled and
+horizontally adjacent when the main surface is Y-tiled.  For an X-tiled surface
+this forms an area of 64B x 2rows and for a Y-tiled surface this forms an area
+of 32B x 4rows.  In either case, it is guaranteed that, regardless of surface
+format, each 2x2 subspan coming out of a shader will land entirely within one
+cache-line pair.
+
+What is the correspondence between bits and cache-line pairs?  The best model I
+(Jason) know of is to consider the CCS as having a 1-bit color format for
+fast-clears and a 2-bit format for color compression and a special tiling
+format.  The CCS tiling formats operate on a 1 or 2-bit granularity rather than
+the byte granularity of most tiling formats.
+
+The following table represents the bit-layouts that yield the CCS tiling format
+on different hardware generations.  Bits 0-11 correspond to the regular swizzle
+of bytes within a 4KB page whereas the negative bits represent the address of
+the particular 1 or 2-bit portion of a byte. (Note: The haswell data was
+gathered on a dual-channel system so bit-6 swizzling was enabled.  It's unclear
+how this affects the CCS layout.)
+
+============ ======== =========== =========== ====================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== ===========
+ Generation   Tiling       11          10               9                 8           7           6           5           4           3           2           1           0          -1          -2          -3
+============ ======== =========== =========== ====================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== ===========
+ Ivy Bridge   X or Y  :math:`u_6` :math:`u_5`      :math:`u_4`       :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_2` :math:`v_3` :math:`v_1` :math:`v_0` :math:`u_3` :math:`u_2` :math:`u_1` :math:`u_0`
+ Haswell        X     :math:`u_6` :math:`u_5` :math:`v_3 \oplus u_1` :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_2` :math:`v_3` :math:`v_1` :math:`v_0` :math:`u_4` :math:`u_3` :math:`u_2` :math:`u_0`
+ Haswell        Y     :math:`u_6` :math:`u_5` :math:`v_2 \oplus u_1` :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_2` :math:`v_3` :math:`v_1` :math:`v_0` :math:`u_4` :math:`u_3` :math:`u_2` :math:`u_0`
+ Broadwell      X     :math:`u_6` :math:`u_5`      :math:`u_4`       :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`u_3` :math:`v_3` :math:`u_2` :math:`u_1` :math:`u_0` :math:`v_2` :math:`v_1` :math:`v_0`
+ Broadwell      Y     :math:`u_6` :math:`u_5`      :math:`u_4`       :math:`v_7` :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_2` :math:`v_3` :math:`u_3` :math:`u_2` :math:`u_1` :math:`v_1` :math:`v_0` :math:`u_0`
+ Sky Lake       Y     :math:`u_6` :math:`u_5`      :math:`u_4`       :math:`v_6` :math:`v_5` :math:`v_4` :math:`v_3` :math:`v_2` :math:`v_1` :math:`u_3` :math:`u_2` :math:`u_1` :math:`v_0` :math:`u_0`
+============ ======== =========== =========== ====================== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== =========== ===========
+
+CCS surface layout
+------------------
+
+Starting with Broadwell, fast-clears and color compression can be used on
+mipmapped and array surfaces.  When considered from a higher level, the CCS is
+layed out like any other surface.  The Broadwell and Sky Lake PRMs describe
+this as follows:
+
+Broadwell PRM Vol 7, "MCS Buffer for Render Target(s)" (p. 676):
+
+    Mip-mapped and arrayed surfaces are supported with MCS buffer layout with
+    these alignments in the RT space: Horizontal Alignment = 256 and Vertical
+    Alignment = 128.
+
+Broadwell PRM Vol 2d, "RENDER_SURFACE_STATE" (p. 279):
+
+    For non-multisampled render target's auxiliary surface, MCS, QPitch must be
+    computed with Horizontal Alignment = 256 and Surface Vertical Alignment =
+    128. These alignments are only for MCS buffer and not for associated render
+    target.
+
+Sky Lake PRM Vol 7, "MCS Buffer for Render Target(s)" (p. 632):
+
+    Mip-mapped and arrayed surfaces are supported with MCS buffer layout with
+    these alignments in the RT space: Horizontal Alignment = 128 and Vertical
+    Alignment = 64.
+
+Sky Lake PRM Vol. 2d, "RENDER_SURFACE_STATE" (p. 435):
+
+    For non-multisampled render target's CCS auxiliary surface, QPitch must be
+    computed with Horizontal Alignment = 128 and Surface Vertical Alignment
+    = 256. These alignments are only for CCS buffer and not for associated
+    render target.
+
+Empirical evidence seems to confirm this.  On Sky Lake, the vertical alignment
+is always one cache line.  The horizontal alignment, however, varies by main
+surface format: 1 cache line for 32bpp, 2 for 64bpp and 4 cache lines for
+128bpp formats.  This nicely corresponds to the alignment of 128x64 pixels in
+the primary color surface.  The second PRM citation about Sky Lake CCS above
+gives a vertical alignment of 256 rather than 64.  With a little
+experimentation, this additional alignment appears to only apply to QPitch and
+not to the miplevels within a slice.
+
+On Broadwell, each miplevel in the CCS is aligned to a cache-line pair
+boundary: horizontal when the primary surface is X-tiled and vertical when
+Y-tiled. For a 32bpp format, this works out to an alignment of 256x128 main
+surface pixels regardless of X or Y tiling.  On Sky Lake, the alignment is
+a single cache line which works out to an alignment of 128x64 main surface
+pixels.
+
+TODO: More than just 32bpp formats on Broadwell!
+
+Once armed with the above alignment information, we can lay out the CCS surface
+itself.  The way ISL does CCS layout calculations is by a very careful  and
+subtle application of its normal surface layout code.
+
+Above, we described the CCS data layout as mapping of address bits. In
+ISL, this is represented by :cpp:enumerator:`isl_tiling::ISL_TILING_CCS`.  The
+logical and physical tile dimensions corresponding to the above mapping.
+
+We also have special :cpp:enum:`isl_format` enums for CCS.  These formats are 1
+bit-per-pixel on Ivy Bridge through Broadwell and 2 bits-per-pixel on Skylake
+and above to correspond to the 1 and 2-bit values represented in the CCS data.
+They have a block size (similar to a block compressed format such as BC or
+ASTC) which says what area (in surface elements) in the main surface is covered
+by a single CCS element (1 or 2-bit).  Because this depends on the main surface
+tiling and format, we have several different CCS formats.
+
+Once the appropriate :cpp:enum:`isl_format` has been selected, computing the
+size and layout of a CCS surface is as simple as passing the same surface
+creation parameters to :cpp:func:`isl_surf_init_s` as were used to create the
+primary surface only with :cpp:enumerator:`isl_tiling::ISL_TILING_CCS` and the
+correct CCS format.  This not only results in a correctly sized surface but
+most other ISL helpers for things such as computing offsets into surfaces work
+correctly as well.
+
+CCS on Tigerlake and above
+--------------------------
+
+Starting with Tigerlake, CCS is no longer done via a surface and, instead, the
+term CCS gets overloaded once again (gotta love it!) to now refer to a form of
+universal compression which can be applied to almost any surface.  Nothing in
+this chapter applies to any hardware with a graphics IP version 12 or above.
diff --git a/docs/isl/index.rst b/docs/isl/index.rst
index 2d1714a5259..d91508d6689 100644
--- a/docs/isl/index.rst
+++ b/docs/isl/index.rst
@@ -12,6 +12,7 @@ Chery.
    units
    formats
    tiling
+   ccs
 
 The core representation of a surface in ISL is :cpp:struct:`isl_surf`.