[Mesa-dev] [PATCH v3 10/25] panfrost: Make sure the BO is 'ready' when picked from the cache

Thu Sep 5 20:43:23 UTC 2019

> +bool
> +panfrost_bo_wait(struct panfrost_bo *bo, int64_t timeout_ns)
> +{
> +        struct drm_panfrost_wait_bo req = {
> +                .handle = bo->gem_handle,
> +		.timeout_ns = timeout_ns,
> +        };
> +        int ret;
> +
> +        ret = drmIoctl(bo->screen->fd, DRM_IOCTL_PANFROST_WAIT_BO, &req);
> +        if (ret != -1)
> +                return true;
> +
> +        assert(errno == ETIMEDOUT || errno == EBUSY);
> +        return false;
> +}

I would appreciate a comment explaining what the return value of this
ioctl is. `ret != -1` and asserting an errno is... suspicious? Not
wrong, to my knowledge, but hard to decipher without context.

> +        /* Before creating a BO, we first want to check the cache but without
> +         * waiting for BO readiness (BOs in the cache can still be referenced
> +         * by jobs that are not finished yet).
> +         * If the cached allocation fails we fall back on fresh BO allocation,
> +         * and if that fails too, we try one more time to allocate from the
> +         * cache, but this time we accept to wait.
>           */

Conceptually:

We first try a ready BO from the cache. OK.

If that fails, there is no BO in the cache that is currently ready for
use; by definition of BO readiness, this is because another concurrent
job is using it. We then try to create a new BO. Suppose a given job
uses an average of `b` BOs. Then for `j` concurrent jobs, assuming all
of these allocations succeed, we have `j * b` BOs in the cache. This is
an unfortunate bump in memory usage but necessary for pipelining.

If that allocation fails, by definition of memory allocation failures,
we ran out of memory and cannot proceed with the allocation. Either:

 - The BO cache is responsible for this. In this case, continuing to use
   the BO cache (even with the waits) will just dig us deeper into the
   hole. Perhaps we should call bo_evict_all from userspace to handle
   the memory pressure? Or does madvise render this irrelevant?

 - The BO cache is not responsible for this. In this case, we could
   continue to use the BO cache, but then either:

   	- There is a BO we can wait for. Then waiting is okay.
	- There is not. Then that cache fetch fails and we kerplutz.
	  What now? If we need an allocation, cache or no cache, if the
	  kernel says no, no means no. What then?

In short, I'm not convinced this algorithm (specifically the last step)
is ideal.

If there is no memory left for us, is it responsible to continue at all?
Should we just fail the allocation after step 2, and if the caller has a
problem with that, it's their issue? Or we abort here after step 2? I
don't like the robustness implications but low memory behaviour is a
risky subject as it is; I don't want to add more unknowns into it --
aborting it with an assert(0) is something we can recognize immediately.
Strange crashes in random places with no explanation, less so.

CC'ing Rob to see if he has any advise re Panfrost madvise interactions
as well as general kernel OOM policy.