[Mesa-dev] [PATCH] anv: don't crash when creating a huge image

Fri Nov 10 05:59:03 UTC 2017

On Thu, 2017-11-09 at 16:34 -0800, Jason Ekstrand wrote:
> On Thu, Nov 9, 2017 at 4:23 PM, Chad Versace <chadversary at chromium.or
> g>
> wrote:
> 
> > On Wed 08 Nov 2017, Jason Ekstrand wrote:
> > > On Wed, Nov 8, 2017 at 1:34 AM, Samuel Iglesias Gonsálvez <[1]
> > > siglesias at igalia.com> wrote:
> > > 
> > >     The HW has some limits but, according to the spec, we can
> > > create
> > >     the image as it has not yet any memory backing it. When we
> > > allocate
> > >     that memory, then we fail following what Vulkan spec, "10.2.
> > > Device
> > >     Memory" says when talking about vkAllocateMemory():
> > > 
> > >     "Some platforms may have a limit on the maximum size of a
> > > single
> > >      allocation. For example, certain systems may fail to create
> > >      allocations with a size greater than or equal to 4GB. Such a
> > > limit
> > 
> > is
> > >      implementation-dependent, and if such a failure occurs then
> > > the
> > 
> > error
> > >      VK_ERROR_OUT_OF_DEVICE_MEMORY must be returned."
> > > 
> > >     Fixes the crashes on BDW for the following tests:
> > > 
> > >     dEQP-VK.pipeline.render_to_image.core.2d_array.huge.*
> > >     dEQP-VK.pipeline.render_to_image.core.cube_array.huge.*
> > > 
> > >     Signed-off-by: Samuel Iglesias Gonsálvez <[2]siglesias at igalia
> > > .com>
> > >     ---
> > > 
> > >     Jason, I was tempted to move this piece of code to
> > 
> > anv_AllocateMemory()
> > >     but then I found the kernel relocation limitation of 32-
> > > bit... Is
> > 
> > that
> > >     limitation still applicable? Or was it from the BDW age and
> > > we forgot
> > >     to update that limitation for gen9+?
> > > 
> > > 
> > > We're still using relocations on all hardware so it applies to
> > > everything
> > > today.  One of my 2018 projects is to fix that and get rid of
> > 
> > relocations on
> > > gen8+.
> > > 
> > > 
> > >     Sam
> > > 
> > >      src/intel/isl/isl.c | 22 ----------------------
> > >      1 file changed, 22 deletions(-)
> > > 
> > >     diff --git a/src/intel/isl/isl.c b/src/intel/isl/isl.c
> > >     index 59f512fc050..aaadcbaf991 100644
> > >     --- a/src/intel/isl/isl.c
> > >     +++ b/src/intel/isl/isl.c
> > >     @@ -1472,28 +1472,6 @@ isl_surf_init_s(const struct
> > > isl_device *dev,
> > >            base_alignment = MAX(info->min_alignment, tile_size);
> > >         }
> > > 
> > >     -   if (ISL_DEV_GEN(dev) < 9) {
> > >     -      /* From the Broadwell PRM Vol 5, Surface Layout:
> > >     -       *
> > >     -       *    "In addition to restrictions on maximum height,
> > > width,
> > 
> > and
> > >     depth,
> > >     -       *     surfaces are also restricted to a maximum size
> > > in
> > 
> > bytes. This
> > >     -       *     maximum is 2 GB for all products and all
> > > surface
> > 
> > types."
> > >     -       *
> > >     -       * This comment is applicable to all Pre-gen9
> > > platforms.
> > >     -       */
> > >     -      if (size > (uint64_t) 1 << 31)
> > >     -         return false;
> > >     -   } else {
> > >     -      /* From the Skylake PRM Vol 5, Maximum Surface Size in
> > > Bytes:
> > >     -       *    "In addition to restrictions on maximum height,
> > > width,
> > 
> > and
> > >     depth,
> > >     -       *     surfaces are also restricted to a maximum size
> > > of 2^38
> > 
> > bytes.
> > >     -       *     All pixels within the surface must be contained
> > > within
> > 
> > 2^38
> > >     bytes
> > >     -       *     of the base address."
> > >     -       */
> > >     -      if (size > (uint64_t) 1 << 38)
> > >     -         return false;
> > >     -   }
> > 
> > I think it very unwise to delete code that enforces requirements
> > defined
> > by the hardware spec. Deleting the code doesn't make the hardware
> > requirements go away :)
> > 

The idea was to move that code to another place, hence my question out
of the commit log message :-)

> > > I'm not sure how I feel about removing this from ISL.  There are
> > > really
> > 
> > two
> > > limitations going on here.  One is a limitation imposed by
> > > relocations,
> > 
> > and the
> > > other is some sort of fundamental hardware surface size
> > > limitation.  Most
> > > likely, the surface size limitation has to do with how many bits
> > > they
> > 
> > use for
> > > image address computations in the hardware.  Most likely, on
> > > gen8, they
> > 
> > do all
> > > of the internal calculations in 32 bits and only convert to 48 at
> > > the
> > 
> > end when
> > > they need to add it to Surface Base Address.
> > > 
> > > If my understanding is correct then we will still have this
> > > limitation
> > 
> > on gen8
> > > even after we get rid of relocations and remove the BO size
> > > limitation.
> > 
> > I see
> > > a couple of options, neither of which I like very much:
> > > 
> > >  1) Take something like this patch and then keep the BO size
> > > limitation
> > 
> > on BDW
> > > to 2GiB when we get rid of relocations even though it's
> > > artificial.
> > >  2) "Gracefully" handle isl_surf_init failure by doing a
> > > debug_log and
> > > succeeding but setting the image size (that will be returned by
> > > vkGetImageMemoryRequirements) to UINT64_MAX so that the client
> > > won't
> > 
> > ever be
> > > able to find memory for it.
> > > 
> > > My feeling is that 1) above is probably the better of the two as
> > > 2)
> > 
> > seems to be
> > > a twisting of the spec.  That said, I would like to keep the
> > > restriction
> > 
> > in ISL
> > > somehow and we need to make sure it still gets applied in GL.
> > 
> > I dislike both. I originally designed isl to mimic the VkImage API,
> > so
> > let's continue that trend.
> > 
> >   Option 3) Change isl_surf_init() to return a meaningful result
> > code:
> > 
> >             ISL_SUCCESS = 0
> >             ISL_ERROR_SOMETHING_SOMETHING_THE_USUAL_FAILURES = -1
> >             ISL_ERROR_SURFACE_SIZE_TOO_LARGE = -2
> > 
> > I like option 3 because it avoids secret implicit contracts between
> > isl
> > and anvil, and thus avoids hidden hacks.
> > 
> 
> I mostly agree but that still doesn't answer the question of what do
> we do
> with that return code?  We can't fail the vkCreateImage because our
> only
> options there are the two OOM errors both of which are lies.  The
> only real
> option we have is to go ahead and create the VkImage but give it a
> size
> that doesn't fit in any allocatable memory object.  So, we're back to
> options 1 and 2...

I'm going to implement option 1) first and then we will decide if we
like it or we go to option 2).

Sam