<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body text="#000000" bgcolor="#FFFFFF"> <p><br> </p> <br> <div class="moz-cite-prefix">On 12/29/2017 03:19 PM, Koenig, Christian wrote:<br> </div> <blockquote type="cite" cite="mid:f817cb0f-9c6f-472b-8bdc-025b34da595a@email.android.com"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <div dir="auto">The difference is that the OOM killer doesn't know of the pages an application allocates through the driver. <div dir="auto"><br> </div> <div dir="auto">This results in a bad decision which process to kill.</div> <div dir="auto"><br> </div> <div dir="auto">I had patches to fix this a long time ago on the list, but never found time to clean them up and push them upstream.</div> <div dir="auto"><br> </div> <div dir="auto">Andrey is now working on this, but I don't know the status of hand.</div> <div dir="auto"><br> </div> <div dir="auto">Christian.</div> </div> </blockquote> <br> Don't have updates since the last status I sent on this (on vacation currently), so will just reiterate the last status here - <br> <br> "Worked a bit more on this, did find a function to properly get free swap space but this solution is not working for when evicting to swap a BO from LRU list which is larger then available RAM (as you predicted) and it causes OOM in the swapper work thread, as you can see from attachment shmem_read_mapping_page will use default allocation policy for swap pages while I think we should have used __GFP_RETRY_MAYFAIL here and the way to do it is to use shmem_read_mapping_page_gfp which allows to set GFP flags. <br> <br> In general I think that the approach you suggested (and then one i was advised at #mm channel) is the right one, to avoid OOM killer we should not try to do any estimations on free RAM or SWAP since it's not reliable any way to assume that by the time we allocate things will not change, what we should do is to allocate pages without triggering OOM killer. I think I should again to try and set all system page allocation code paths we use to __GFP_RETRY_MAYFAIL and debug again why it didn't work last time. One reason could be because i missed the SWAP pages allocation, another is a possible memory leak when failing allocation and rolling back all previously allocated pages for the BO, which leads to OOM anyway. "<br> <br> Thanks,<br> Andrey<br> <br> <blockquote type="cite" cite="mid:f817cb0f-9c6f-472b-8bdc-025b34da595a@email.android.com"> <div dir="auto"> </div> <div class="gmail_extra"><br> <div class="gmail_quote">Am 29.12.2017 20:36 schrieb "Kuehling, Felix" <a class="moz-txt-link-rfc2396E" href="mailto:Felix.Kuehling@amd.com"><Felix.Kuehling@amd.com></a>:<br type="attribution"> <blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <div><font size="2"><span style="font-size:11pt"> <div>Is it possible that the test is broken? A test that allocates memory to<br> exhaustion may well trigger the OOM killer. A test can do that by using<br> malloc. Why not by using the graphics driver? The OOM killer does what<br> it's supposed to do, and kills the broken application.<br> <br> As I understand it, this change is adds artificial limitations to<br> workaround a bug in a user mode test. However, it ends up limiting the<br> memory available for well behaved applications, more than necessary.<br> <br> For compute applications that work with huge data sets, we want to be<br> able to allocate lots of system memory. Tying available system memory to<br> the VRAM size makes no sense for compute applications that want to work<br> with such huge data sets.<br> <br> Regards,<br> Felix<br> <br> <br> On 2017-12-15 02:09 PM, Andrey Grodzovsky wrote:<br> > This reverts commit ba851eed895c76be0eb4260bdbeb7e26f9ccfaa2.<br> > With that change piglit max size tests (running with -t max.*size) are causing<br> > OOM and hard hang on my CZ with 1GB RAM.<br> ><br> > Signed-off-by: Andrey Grodzovsky <a class="moz-txt-link-rfc2396E" href="mailto:andrey.grodzovsky@amd.com"><andrey.grodzovsky@amd.com></a><br> > ---<br> > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 8 +++++---<br> > 1 file changed, 5 insertions(+), 3 deletions(-)<br> ><br> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c<br> > index c307a7d..814a9c1 100644<br> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c<br> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c<br> > @@ -1329,9 +1329,11 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)<br> > struct sysinfo si;<br> > <br> > si_meminfo(&si);<br> > - gtt_size = max(AMDGPU_DEFAULT_GTT_SIZE_MB << 20,<br> > - (uint64_t)si.totalram * si.mem_unit * 3/4);<br> > - } else<br> > + gtt_size = min(max((AMDGPU_DEFAULT_GTT_SIZE_MB << 20),<br> > + adev->mc.mc_vram_size),<br> > + ((uint64_t)si.totalram * si.mem_unit * 3/4));<br> > + }<br> > + else<br> > gtt_size = (uint64_t)amdgpu_gtt_size << 20;<br> > r = ttm_bo_init_mm(&adev->mman.bdev, TTM_PL_TT, gtt_size >> PAGE_SHIFT);<br> > if (r) {<br> <br> </div> </span></font></div> </blockquote> </div> <br> </div> </blockquote> <br> </body> </html>