[Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30

Mon Jan 20 02:53:03 UTC 2020

13.12.2019 18:35, Dmitry Osipenko пишет:
> 13.12.2019 18:10, Thierry Reding пишет:
>> On Fri, Dec 13, 2019 at 12:25:33AM +0300, Dmitry Osipenko wrote:
>>> Hello Thierry,
>>>
>>> Commit [1] introduced a severe GPU performance regression on Tegra20 and
>>> Tegra30 using.
>>>
>>> [1]
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.5-rc1&id=fa6661b7aa0b52073681b0d26742650c8cbd30f3
>>>
>>> Interestingly the performance is okay on Tegra30 if
>>> CONFIG_TEGRA_HOST1X_FIREWALL=n, but that doesn't make difference for
>>> Tegra20.
>>>
>>> I was telling you about this problem on the #tegra IRC sometime ago and
>>> you asked to report it in a trackable form, so finally here it is.
>>>
>>> You could reproduce the problem by running [2] like this
>>> `grate/texture-filter -f -s` which should produce over 100 FPS for 720p
>>> display resolution and currently it's ~11 FPS.
>>>
>>> [2]
>>> https://github.com/grate-driver/grate/blob/master/tests/grate/texture-filter.c
>>>
>>> Previously I was seeing some memory errors coming from Host1x DMA, but
>>> don't see any errors at all right now.
>>>
>>> I don't see anything done horribly wrong in the offending commit.
>>>
>>> Unfortunately I couldn't dedicate enough time to sit down and debug the
>>> problem thoroughly yet. Please let me know if you'll find a solution,
>>> I'll be happy to test it. Thanks in advance!
>>
>> I suspect that the problem here is that we're now using the DMA API,
>> which causes the 32-bit ARM DMA/IOMMU glue to be used. I vaguely recall
>> that that code doesn't coalesce entries in the SG table, so we may end
>> up calling iommu_map() a lot of times, and miss out on much of the
>> advantages that the ->iotlb_sync_map() gives us on Tegra20.
>>
>> At the same time dma_map_sg() will flush caches, which we didn't do
>> before. This we should be able to improve by passing the attribute
>> DMA_ATTR_SKIP_CPU_SYNC to dma_map_sg() when we know that the cache
>> maintenance isn't needed.
>>
>> And while thinking about it, one other difference is that with the DMA
>> API we actually map/unmap the buffers for every submission. This is
>> because the DMA API semantics require that buffers be mapped/unmapped
>> every time you use them. Previously we would basically only map each
>> buffer once (at allocation time) and only have to deal with cache
>> maintenance, so the overhead per submission was drastically lower.
>>
>> If DMA_ATTR_SKIP_CPU_SYNC doesn't give us enough of an improvement, we
>> may want to restore explicit IOMMU usage, at least on anything prior to
>> Tegra124 where we're unlikely to ever use different IOMMU domains anyway
>> (because they are such a scarce resource).
> 
> Tegra20 doesn't use IOMMU in a vanilla upstream kernel (yet), so I don't
> think that it's the root of the problem. Disabling IOMMU for Tegra30
> also didn't help (IIRC).
> 
> The offending patch shouldn't change anything in regards to the DMA API,
> if I'm not missing something. Strange..
> 
> Please keep me up-to-date!
> 

Hello Thierry,

I took another look at the problem and here what was found:

1) The "Optionally attach clients to the IOMMU" patch is wrong because:

    1. host1x_drm_probe() is invoked *before* any of the
       host1x_client_iommu_attach() happens, so there is no way
       on earth the 'use_explicit_iommu' could ever be true.

    2. Not attaching DRM clients to IOMMU if HOST1x isn't
       attached is wrong because it never attached in the case
       of CONFIG_TEGRA_HOST1X_FIREWALL=y [1] and this also
       makes no sense for T20/30 that do not support LPAE.

[1]
https://elixir.bootlin.com/linux/v5.5-rc6/source/drivers/gpu/host1x/dev.c#L205

2) Because of the above problems, the DRM clients are erroneously not
getting attached to IOMMU at all and thus CMA is getting used for the BO
allocations. Here comes the problems introduced by the "gpu: host1x:
Support DMA mapping of buffers" patch, which makes DMA API to perform
CPU cache maintenance on each job submission and apparently this is
super bad for performance. This also makes no sense in comparison to the
case of enabled IOMMU, where cache maintenance isn't performed at all
(like it should be).

Please let me know if you're going to fix the problems or if you'd
prefer me to create the patches.

Here is a draft of the fix for #2, it doesn't cover case of imported
buffers (which should be statically mapped, IIUC):

@@ -38,7 +38,7 @@ static struct sg_table *tegra_bo_pin(struct device
*dev, struct host1x_bo *bo,
         * If we've manually mapped the buffer object through the IOMMU,
make
         * sure to return the IOVA address of our mapping.
         */
-       if (phys && obj->mm) {
+       if (phys && (obj->mm || obj->vaddr)) {
                *phys = obj->iova;
                return NULL;
        }

diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c
index 25ca54de8fc5..69adfd66196b 100644
--- a/drivers/gpu/host1x/job.c
+++ b/drivers/gpu/host1x/job.c
@@ -108,7 +108,7 @@ static unsigned int pin_job(struct host1x *host,
struct host1x_job *job)

        for (i = 0; i < job->num_relocs; i++) {
                struct host1x_reloc *reloc = &job->relocs[i];
-               dma_addr_t phys_addr, *phys;
+               dma_addr_t phys_addr;
                struct sg_table *sgt;

                reloc->target.bo = host1x_bo_get(reloc->target.bo);
@@ -117,12 +117,7 @@ static unsigned int pin_job(struct host1x *host,
struct host1x_job *job)
                        goto unpin;
                }

-               if (client->group)
-                       phys = &phys_addr;
-               else
-                       phys = NULL;
-
-               sgt = host1x_bo_pin(dev, reloc->target.bo, phys);
+               sgt = host1x_bo_pin(dev, reloc->target.bo, &phys_addr);
                if (IS_ERR(sgt)) {
                        err = PTR_ERR(sgt);
                        goto unpin;
@@ -184,7 +179,7 @@ static unsigned int pin_job(struct host1x *host,
struct host1x_job *job)
                        goto unpin;
                }

-               sgt = host1x_bo_pin(host->dev, g->bo, NULL);
+               sgt = host1x_bo_pin(host->dev, g->bo, &phys_addr);
                if (IS_ERR(sgt)) {
                        err = PTR_ERR(sgt);
                        goto unpin;
@@ -214,7 +209,7 @@ static unsigned int pin_job(struct host1x *host,
struct host1x_job *job)

                        job->unpins[job->num_unpins].size = gather_size;
                        phys_addr = iova_dma_addr(&host->iova, alloc);
-               } else {
+               } else if (sgt) {
                        err = dma_map_sg(host->dev, sgt->sgl, sgt->nents,
                                         DMA_TO_DEVICE);
                        if (!err) {