[PATCH v13 1/2] drm/tegra: dc: Support memory bandwidth management

Wed Mar 3 23:08:27 UTC 2021

On Tue, Mar 02, 2021 at 03:44:44PM +0300, Dmitry Osipenko wrote:
> Display controller (DC) performs isochronous memory transfers, and thus,
> has a requirement for a minimum memory bandwidth that shall be fulfilled,
> otherwise framebuffer data can't be fetched fast enough and this results
> in a DC's data-FIFO underflow that follows by a visual corruption.
> 
> The Memory Controller drivers provide facility for memory bandwidth
> management via interconnect API. Let's wire up the interconnect API
> support to the DC driver in order to fix the distorted display output
> on T30 Ouya, T124 TK1 and other Tegra devices.

I did a read on the code. I have put some thoughts and nits inbetween the
diff, but here are more general questions about the patch:

Is there a reason why the bandwidth is allocated per plane instead of just
using one peak and average for the whole configuration? Or is there a need
to expose all the planes as interconnect channels and allocate their
bandwidth individually?

>From algorithmic part I see that the plane overlaps are calculated twice
(A vs B and later B vs A). The cursor plane is ignored, but nevertheless
its overlap mask is calculated before being thrown away. The bandwidths
are also calculated twice: once for pre-commit and then again for
post-commit.  Is setting bandwidth for an interconnect expensive when
re-setting a value that was already set? The code seems to avoid this
case, but maybe unnecessarily?

[...cut the big and interesting part...]

[...]
> @@ -65,7 +75,9 @@ struct tegra_dc_soc_info {
>  	unsigned int num_overlay_formats;
>  	const u64 *modifiers;
>  	bool has_win_a_without_filters;
> +	bool has_win_b_vfilter_mem_client;
>  	bool has_win_c_without_vert_filter;
> +	unsigned int plane_tiled_memory_bandwidth_x2;

This looks more like bool in the code using it.

[...]
> --- a/drivers/gpu/drm/tegra/plane.c
> +++ b/drivers/gpu/drm/tegra/plane.c
[...]
> +static int tegra_plane_check_memory_bandwidth(struct drm_plane_state *state)

The function's body looks more like 'update' or 'recalculate' rather
than 'check' the memory bandwidth.

> +	struct tegra_plane_state *tegra_state = to_tegra_plane_state(state);
> +	unsigned int i, bpp, bpp_plane, dst_w, src_w, src_h, mul;
> +	const struct tegra_dc_soc_info *soc;
> +	const struct drm_format_info *fmt;
> +	struct drm_crtc_state *crtc_state;
> +	u32 avg_bandwidth, peak_bandwidth;
> +
> +	if (!state->visible)
> +		return 0;
> +
> +	crtc_state = drm_atomic_get_new_crtc_state(state->state, state->crtc);
> +	if (!crtc_state)
> +		return -EINVAL;
> +
> +	src_w = drm_rect_width(&state->src) >> 16;
> +	src_h = drm_rect_height(&state->src) >> 16;
> +	dst_w = drm_rect_width(&state->dst);
> +
> +	fmt = state->fb->format;
> +	soc = to_tegra_dc(state->crtc)->soc;
> +
> +	/*
> +	 * Note that real memory bandwidth vary depending on format and
> +	 * memory layout, we are not taking that into account because small
> +	 * estimation error isn't important since bandwidth is rounded up
> +	 * anyway.
> +	 */
> +	for (i = 0, bpp = 0; i < fmt->num_planes; i++) {
> +		bpp_plane = fmt->cpp[i] * 8;

Nit: you could declare the bpp_plane here.

> +		/*
> +		 * Sub-sampling is relevant for chroma planes only and vertical
> +		 * readouts are not cached, hence only horizontal sub-sampling
> +		 * matters.
> +		 */
> +		if (i > 0)
> +			bpp_plane /= fmt->hsub;
> +
> +		bpp += bpp_plane;
> +	}
> +
> +	/*
> +	 * Horizontal downscale takes extra bandwidth which roughly depends
> +	 * on the scaled width.
> +	 */
> +	if (src_w > dst_w)
> +		mul = (src_w - dst_w) * bpp / 2048 + 1;
> +	else
> +		mul = 1;

Does it really need more bandwidth to scale down? Does it read the same
data multiple times just to throw it away?

> +	/* average bandwidth in bytes/s */
> +	avg_bandwidth  = src_w * src_h * bpp / 8 * mul;
> +	avg_bandwidth *= drm_mode_vrefresh(&crtc_state->mode);
> +
> +	/* mode.clock in kHz, peak bandwidth in kbit/s */
> +	peak_bandwidth = crtc_state->mode.clock * bpp * mul;

I would guess that (src_w * bpp / 8) needs rounding up unless the plane
is continuous. Or you could just add the max rounding error here and
get a safe overestimated value. Maybe this would be better done when
summing per-plane widths.

> +	/* ICC bandwidth in kbyte/s */
> +	peak_bandwidth = kbps_to_icc(peak_bandwidth);
> +	avg_bandwidth  = Bps_to_icc(avg_bandwidth);

This could be merged with the assignments after the following 'if' block.

> +	/*
> +	 * Tegra30/114 Memory Controller can't interleave DC memory requests
> +	 * and DC uses 16-bytes atom for the tiled windows, while DDR3 uses 32
> +	 * bytes atom. Hence there is x2 memory overfetch for tiled framebuffer
> +	 * and DDR3 on older SoCs.
> +	 */
> +	if (soc->plane_tiled_memory_bandwidth_x2 &&
> +	    tegra_state->tiling.mode == TEGRA_BO_TILING_MODE_TILED) {
> +		peak_bandwidth *= 2;
> +		avg_bandwidth *= 2;
> +	}
> +
> +	tegra_state->peak_memory_bandwidth = peak_bandwidth;
> +	tegra_state->avg_memory_bandwidth = avg_bandwidth;
> +
> +	return 0;
> +}

[...]
> +static const char * const tegra_plane_icc_names[] = {
> +	"wina", "winb", "winc", "", "", "", "cursor",
> +};
> +
> +int tegra_plane_interconnect_init(struct tegra_plane *plane)
> +{
> +	const char *icc_name = tegra_plane_icc_names[plane->index];

Is plane->index guaranteed to be <= 6? I would guess so, but maybe
BUILD_BUG_ON(sizeof(icc_names)==TEGRA_DC_LEGACY_PLANES_NUM) or some
other check could document this?

[...]

Best Regards
Michał Mirosław