[PATCH] drm/amdgpu: only use kernel zone if need_dma32 is not required
Felix.Kuehling at amd.com
Thu Jun 13 16:37:34 UTC 2019
On 2019-06-13 8:59, Yang, Philip wrote:
> On 2019-06-13 4:54 a.m., Koenig, Christian wrote:
>> Am 12.06.19 um 23:13 schrieb Yang, Philip:
>>> On 2019-06-12 3:28 p.m., Christian König wrote:
>>>> Am 12.06.19 um 17:13 schrieb Yang, Philip:
>>>>> TTM create two zones, kernel zone and dma32 zone for system memory. If
>>>>> system memory address allocated is below 4GB, this account to dma32 zone
>>>>> and will exhaust dma32 zone and trigger unnesssary TTM eviction.
>>>>> Patch "drm/ttm: Account for kernel allocations in kernel zone only" only
>>>>> handle the allocation for acc_size, the system memory page allocation is
>>>>> through ttm_mem_global_alloc_page which still account to dma32 zone if
>>>>> page is below 4GB.
>>>> NAK, as the name says the mem_glob is global for all devices in the system.
>>>> So this will break if you mix DMA32 and non DMA32 in the same system
>>>> which is exactly the configuration my laptop here has :(
>>> I didn't find path use dma32 zone, but I may missed something.
>> Well the point is there is non in our driver, but many drivers in the
>> system still need DMA32 memory.
>>> There is
>>> an issue found by KFDTest.BigBufStressTest, it allocates buffers up to
>>> 3/8 of total 256GB system memory, each buffer size is 128MB, then use
>>> queue to write to the buffers. If ttm_mem_global_alloc_page get page pfn
>>> is below 4GB, it account to dma32 zone and will exhaust 2GB limit, then
>>> ttm_check_swapping will schedule ttm_shrink_work to start eviction. It
>>> takes minutes to finish restore (retry many times if busy), the test
>>> failed because queue timeout. This eviction is unnecessary because we
>>> still have enough free system memory.
>> No that is definitely necessary. For example on my Laptop it's the sound
>> system which needs DMA32.
>> Without this when an application uses a lot of memory we run into the
>> OOM killer as soon as some audio starts playing.
> I did not realize TTM is used by other drivers. I agree we cannot remove
> dma32 zone, this will break other device drivers which depends on dma32
If I understand Christian correctly, the point is not that other drivers
use TTM, but other drivers need dma32 memory (memory with 32-bit
physical addresses). If TTM used up all that memory, it would break
those other drivers. As a good steward of memory, TTM limits its usage
of dma32 memory in order to allow other drivers to have access to it.
>>> It's random case, happens about 1/5. I can change test to increase the
>>> timeout value to workaround this, but this seems TTM bug. This will slow
>>> application performance a lot if this random issue happens.
>> One thing we could try is to dig into why the kernel gives us DMA32
>> pages when there are still other pages available. Please take a look at
>> /proc/buddyinfo on that box for this.
> Thanks for the advise, from buddyinfo, the machine has 4 nodes, each
> node has 64GB memory, and dma32 zone is on node 0. BigBufStressTest
> allocate 96GB. If I force the test on node 1, "numactl --cpunodebind=1
> ./kfdtest", no eviction at all. If I force the test on node 0, it always
> has eviction and restore because it used up all memory including dma32
> zone from node 0. I will change test app to avoid running on node 0 to
> workaround this.
>> The next thing coming to mind is that we can most likely completely skip
>> this if IOMMU is active.
>> The last thing of hand coming to my mind is to improve TTM to actually
>> only evict BOs which use DMA32 memory, cause currently we just evict
>> stuff randomly and not check if its DMA32 or other memory.
>>>>> Change-Id: I289b85d891b8f64a1422c42b1eab398098ab7ef7
>>>>> Signed-off-by: Philip Yang <Philip.Yang at amd.com>
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 4 ++++
>>>>> 1 file changed, 4 insertions(+)
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> index 2778ff63d97d..79bb9dfe617b 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> @@ -1686,6 +1686,10 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
>>>>> adev->mman.initialized = true;
>>>>> + /* Only kernel zone (no dma32 zone) if device does not require
>>>>> dma32 */
>>>>> + if (!adev->need_dma32)
>>>>> + adev->mman.bdev.glob->mem_glob->num_zones = 1;
>>>>> /* We opt to avoid OOM on system pages allocations */
>>>>> adev->mman.bdev.no_retry = true;
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
More information about the amd-gfx