[Mesa-dev] [PATCH] anv: implement pipeline statistics queries

Robert Bragg robert at sixbynine.org
Tue Jan 24 22:27:46 UTC 2017


Sorry for the delay responding here; some comments below...


On Tue, Jan 24, 2017 at 11:48 AM, Ilia Mirkin <imirkin at alum.mit.edu> wrote:
> 2-month ping. [ok, it hasn't been 2 months on the dot, but ... close.]
>
> On Tue, Jan 10, 2017 at 5:49 PM, Ilia Mirkin <imirkin at alum.mit.edu> wrote:
>> ping.
>>
>> On Thu, Dec 22, 2016 at 11:14 AM, Ilia Mirkin <imirkin at alum.mit.edu> wrote:
>>> Ping? Any further comments/feedback/reviews?
>>>
>>>
>>> On Dec 5, 2016 11:22 AM, "Ilia Mirkin" <imirkin at alum.mit.edu> wrote:
>>>
>>> On Mon, Dec 5, 2016 at 11:11 AM, Robert Bragg <robert at sixbynine.org> wrote:
>>>>
>>>>
>>>> On Sun, Nov 27, 2016 at 7:23 PM, Ilia Mirkin <imirkin at alum.mit.edu> wrote:
>>>>>
>>>>> The strategy is to just keep n anv_query_pool_slot entries per query
>>>>> instead of one. The available bit is only valid in the last one.
>>>>>
>>>>> Signed-off-by: Ilia Mirkin <imirkin at alum.mit.edu>
>>>>> ---
>>>>>
>>>>> I think this is in a pretty good state now. I've tested both the direct
>>>>> and
>>>>> buffer paths with a hacked up cube application, and I'm seeing
>>>>> non-ridiculous
>>>>> values for the various counters, although I haven't 100% verified them
>>>>> for
>>>>> accuracy.
>>>>>
>>>>> This also implements the hsw/bdw workaround for dividing frag invocations
>>>>> by 4,
>>>>> copied from hsw_queryobj. I tested this on SKL and it seem to divide the
>>>>> values
>>>>> as expected.
>>>>>
>>>>> The cube patch I've been testing with is at
>>>>> http://paste.debian.net/899374/
>>>>> You can flip between copying to a buffer and explicit retrieval by
>>>>> commenting
>>>>> out the relevant function calls.
>>>>>
>>>>>  src/intel/vulkan/anv_device.c      |   2 +-
>>>>>  src/intel/vulkan/anv_private.h     |   4 +
>>>>>  src/intel/vulkan/anv_query.c       |  99 ++++++++++----
>>>>>  src/intel/vulkan/genX_cmd_buffer.c | 260
>>>>> ++++++++++++++++++++++++++++++++-----
>>>>>  4 files changed, 308 insertions(+), 57 deletions(-)
>>>>>
>>>>>
>>>>> diff --git a/src/intel/vulkan/anv_device.c
>>>>> b/src/intel/vulkan/anv_device.c
>>>>> index 99eb73c..7ad1970 100644
>>>>> --- a/src/intel/vulkan/anv_device.c
>>>>> +++ b/src/intel/vulkan/anv_device.c
>>>>> @@ -427,7 +427,7 @@ void anv_GetPhysicalDeviceFeatures(
>>>>>        .textureCompressionASTC_LDR               = pdevice->info.gen >=
>>>>> 9,
>>>>> /* FINISHME CHV */
>>>>>        .textureCompressionBC                     = true,
>>>>>        .occlusionQueryPrecise                    = true,
>>>>> -      .pipelineStatisticsQuery                  = false,
>>>>> +      .pipelineStatisticsQuery                  = true,
>>>>>        .fragmentStoresAndAtomics                 = true,
>>>>>        .shaderTessellationAndGeometryPointSize   = true,
>>>>>        .shaderImageGatherExtended                = false,
>>>>> diff --git a/src/intel/vulkan/anv_private.h
>>>>> b/src/intel/vulkan/anv_private.h
>>>>> index 2fc543d..7271609 100644
>>>>> --- a/src/intel/vulkan/anv_private.h
>>>>> +++ b/src/intel/vulkan/anv_private.h
>>>>> @@ -1763,6 +1763,8 @@ struct anv_render_pass {
>>>>>     struct anv_subpass                           subpasses[0];
>>>>>  };
>>>>>
>>>>> +#define ANV_PIPELINE_STATISTICS_COUNT 11
>>>>> +
>>>>>  struct anv_query_pool_slot {
>>>>>     uint64_t begin;
>>>>>     uint64_t end;
>>>>> @@ -1772,6 +1774,8 @@ struct anv_query_pool_slot {
>>>>>  struct anv_query_pool {
>>>>>     VkQueryType                                  type;
>>>>>     uint32_t                                     slots;
>>>>> +   uint32_t                                     pipeline_statistics;
>>>>> +   uint32_t                                     slot_stride;
>>>>>     struct anv_bo                                bo;
>>>>>  };
>>>>>
>>>>> diff --git a/src/intel/vulkan/anv_query.c b/src/intel/vulkan/anv_query.c
>>>>> index 293257b..dc00859 100644
>>>>> --- a/src/intel/vulkan/anv_query.c
>>>>> +++ b/src/intel/vulkan/anv_query.c
>>>>> @@ -38,8 +38,10 @@ VkResult anv_CreateQueryPool(
>>>>>     ANV_FROM_HANDLE(anv_device, device, _device);
>>>>>     struct anv_query_pool *pool;
>>>>>     VkResult result;
>>>>> -   uint32_t slot_size;
>>>>> -   uint64_t size;
>>>>> +   uint32_t slot_size = sizeof(struct anv_query_pool_slot);
>>>>> +   uint32_t slot_stride = 1;
>>>>> +   uint64_t size = pCreateInfo->queryCount * slot_size;
>>>>> +   uint32_t pipeline_statistics = 0;
>>>>>
>>>>>     assert(pCreateInfo->sType ==
>>>>> VK_STRUCTURE_TYPE_QUERY_POOL_CREATE_INFO);
>>>>>
>>>>> @@ -48,12 +50,16 @@ VkResult anv_CreateQueryPool(
>>>>>     case VK_QUERY_TYPE_TIMESTAMP:
>>>>>        break;
>>>>>     case VK_QUERY_TYPE_PIPELINE_STATISTICS:
>>>>> -      return VK_ERROR_INCOMPATIBLE_DRIVER;
>>>>> +      pipeline_statistics = pCreateInfo->pipelineStatistics &
>>>>> +         ((1 << ANV_PIPELINE_STATISTICS_COUNT) - 1);
>>>>> +      slot_stride = _mesa_bitcount(pipeline_statistics);
>>>>> +      size *= slot_stride;
>>>>> +      break;
>>>>>     default:
>>>>>        assert(!"Invalid query type");
>>>>> +      return VK_ERROR_INCOMPATIBLE_DRIVER;
>>>>>     }
>>>>>
>>>>> -   slot_size = sizeof(struct anv_query_pool_slot);
>>>>>     pool = vk_alloc2(&device->alloc, pAllocator, sizeof(*pool), 8,
>>>>>                       VK_SYSTEM_ALLOCATION_SCOPE_OBJECT);
>>>>>     if (pool == NULL)
>>>>> @@ -61,8 +67,9 @@ VkResult anv_CreateQueryPool(
>>>>>
>>>>>     pool->type = pCreateInfo->queryType;
>>>>>     pool->slots = pCreateInfo->queryCount;
>>>>> +   pool->pipeline_statistics = pipeline_statistics;
>>>>> +   pool->slot_stride = slot_stride;
>>>>>
>>>>> -   size = pCreateInfo->queryCount * slot_size;
>>>>>     result = anv_bo_init_new(&pool->bo, device, size);
>>>>>     if (result != VK_SUCCESS)
>>>>>        goto fail;
>>>>> @@ -95,6 +102,27 @@ void anv_DestroyQueryPool(
>>>>>     vk_free2(&device->alloc, pAllocator, pool);
>>>>>  }
>>>>>
>>>>> +static void *
>>>>> +store_query_result(void *pData, VkQueryResultFlags flags,
>>>>> +                   uint64_t result, uint64_t available)
>>>>> +{
>>>>> +   if (flags & VK_QUERY_RESULT_64_BIT) {
>>>>> +      uint64_t *dst = pData;
>>>>> +      *dst++ = result;
>>>>> +      if (flags & VK_QUERY_RESULT_WITH_AVAILABILITY_BIT)
>>>>> +         *dst++ = available;
>>>>> +      return dst;
>>>>> +   } else {
>>>>> +      uint32_t *dst = pData;
>>>>> +      if (result > UINT32_MAX)
>>>>> +         result = UINT32_MAX;
>>>>> +      *dst++ = result;
>>>>> +      if (flags & VK_QUERY_RESULT_WITH_AVAILABILITY_BIT)
>>>>> +         *dst++ = available;
>>>>> +      return dst;
>>>>> +   }
>>>>> +}
>>>>> +
>>>>>  VkResult anv_GetQueryPoolResults(
>>>>>      VkDevice                                    _device,
>>>>>      VkQueryPool                                 queryPool,
>>>>> @@ -112,6 +140,7 @@ VkResult anv_GetQueryPoolResults(
>>>>>     int ret;
>>>>>
>>>>>     assert(pool->type == VK_QUERY_TYPE_OCCLUSION ||
>>>>> +          pool->type == VK_QUERY_TYPE_PIPELINE_STATISTICS ||
>>>>>            pool->type == VK_QUERY_TYPE_TIMESTAMP);
>>>>>
>>>>>     if (pData == NULL)
>>>>> @@ -129,14 +158,42 @@ VkResult anv_GetQueryPoolResults(
>>>>>     void *data_end = pData + dataSize;
>>>>>     struct anv_query_pool_slot *slot = pool->bo.map;
>>>>>
>>>>> -   for (uint32_t i = 0; i < queryCount; i++) {
>>>>> +   for (uint32_t i = 0; i < queryCount && pData < data_end;
>>>>> +        i++, pData += stride) {
>>>>> +      if (pool->type == VK_QUERY_TYPE_PIPELINE_STATISTICS) {
>>>>> +         VkQueryResultFlags f = flags &
>>>>> ~VK_QUERY_RESULT_WITH_AVAILABILITY_BIT;
>>>>> +         void *pos = pData;
>>>>> +         uint32_t pipeline_statistics = pool->pipeline_statistics;
>>>>> +         struct anv_query_pool_slot *base =
>>>>> +            &slot[(firstQuery + i) * pool->slot_stride];
>>>>> +
>>>>> +         while (pipeline_statistics) {
>>>>> +            uint32_t stat = u_bit_scan(&pipeline_statistics);
>>>>> +            uint64_t result = base->end - base->begin;
>>>>> +
>>>>> +            /* WaDividePSInvocationCountBy4:HSW,BDW */
>>>>> +            if ((device->info.gen == 8 || device->info.is_haswell) &&
>>>>> +                (1 << stat) ==
>>>>> VK_QUERY_PIPELINE_STATISTIC_FRAGMENT_SHADER_INVOCATIONS_BIT)
>>>>> +               result >>= 2;
>>>>> +
>>>>> +            pos = store_query_result(pos, f, result, 0);
>>>>> +            base++;
>>>>> +         }
>>>>> +         if (flags & VK_QUERY_RESULT_WITH_AVAILABILITY_BIT) {
>>>>> +            base--;
>>>>> +            if (flags & VK_QUERY_RESULT_64_BIT)
>>>>> +               *(uint64_t *)pos = base->available;
>>>>> +            else
>>>>> +               *(uint32_t *)pos = base->available;
>>>>> +         }
>>>>> +         continue;
>>>>> +      }
>>>>> +
>>>>>        switch (pool->type) {
>>>>>        case VK_QUERY_TYPE_OCCLUSION: {
>>>>>           result = slot[firstQuery + i].end - slot[firstQuery + i].begin;
>>>>>           break;
>>>>>        }
>>>>> -      case VK_QUERY_TYPE_PIPELINE_STATISTICS:
>>>>> -         unreachable("pipeline stats not supported");
>>>>>        case VK_QUERY_TYPE_TIMESTAMP: {
>>>>>           result = slot[firstQuery + i].begin;
>>>>>           break;
>>>>> @@ -145,23 +202,7 @@ VkResult anv_GetQueryPoolResults(
>>>>>           unreachable("invalid pool type");
>>>>>        }
>>>>>
>>>>> -      if (flags & VK_QUERY_RESULT_64_BIT) {
>>>>> -         uint64_t *dst = pData;
>>>>> -         dst[0] = result;
>>>>> -         if (flags & VK_QUERY_RESULT_WITH_AVAILABILITY_BIT)
>>>>> -            dst[1] = slot[firstQuery + i].available;
>>>>> -      } else {
>>>>> -         uint32_t *dst = pData;
>>>>> -         if (result > UINT32_MAX)
>>>>> -            result = UINT32_MAX;
>>>>> -         dst[0] = result;
>>>>> -         if (flags & VK_QUERY_RESULT_WITH_AVAILABILITY_BIT)
>>>>> -            dst[1] = slot[firstQuery + i].available;
>>>>> -      }
>>>>> -
>>>>> -      pData += stride;
>>>>> -      if (pData >= data_end)
>>>>> -         break;
>>>>> +      store_query_result(pData, flags, result, slot[firstQuery +
>>>>> i].available);
>>>>>     }
>>>>>
>>>>>     return VK_SUCCESS;
>>>>> @@ -183,6 +224,14 @@ void anv_CmdResetQueryPool(
>>>>>           slot[firstQuery + i].available = 0;
>>>>>           break;
>>>>>        }
>>>>> +      case VK_QUERY_TYPE_PIPELINE_STATISTICS: {
>>>>> +         struct anv_query_pool_slot *slot = pool->bo.map;
>>>>> +
>>>>> +         slot = &slot[(firstQuery + i) * pool->slot_stride];
>>>>> +         for (uint32_t j = 0; j < pool->slot_stride; j++)
>>>>> +            slot[j].available = 0;
>>>>> +         break;
>>>>> +      }
>>>>>        default:
>>>>>           assert(!"Invalid query type");
>>>>>        }
>>>>> diff --git a/src/intel/vulkan/genX_cmd_buffer.c
>>>>> b/src/intel/vulkan/genX_cmd_buffer.c
>>>>> index a965cd6..1369ac2 100644
>>>>> --- a/src/intel/vulkan/genX_cmd_buffer.c
>>>>> +++ b/src/intel/vulkan/genX_cmd_buffer.c
>>>>> @@ -2272,6 +2272,50 @@ emit_query_availability(struct anv_cmd_buffer
>>>>> *cmd_buffer,
>>>>>     }
>>>>>  }
>>>>>
>>>>> +#define IA_VERTICES_COUNT               0x2310
>>>>> +#define IA_PRIMITIVES_COUNT             0x2318
>>>>> +#define VS_INVOCATION_COUNT             0x2320
>>>>> +#define HS_INVOCATION_COUNT             0x2300
>>>>> +#define DS_INVOCATION_COUNT             0x2308
>>>>> +#define GS_INVOCATION_COUNT             0x2328
>>>>> +#define GS_PRIMITIVES_COUNT             0x2330
>>>>> +#define CL_INVOCATION_COUNT             0x2338
>>>>> +#define CL_PRIMITIVES_COUNT             0x2340
>>>>> +#define PS_INVOCATION_COUNT             0x2348
>>>>> +#define CS_INVOCATION_COUNT             0x2290
>>>>> +
>>>>> +static const uint32_t PIPELINE_STAT_TO_REG[] = {
>>>>> +   IA_VERTICES_COUNT,
>>>>> +   IA_PRIMITIVES_COUNT,
>>>>> +   VS_INVOCATION_COUNT,
>>>>> +   GS_INVOCATION_COUNT,
>>>>> +   GS_PRIMITIVES_COUNT,
>>>>> +   CL_INVOCATION_COUNT,
>>>>> +   CL_PRIMITIVES_COUNT,
>>>>> +   PS_INVOCATION_COUNT,
>>>>> +   HS_INVOCATION_COUNT,
>>>>> +   DS_INVOCATION_COUNT,
>>>>> +   CS_INVOCATION_COUNT
>>>>> +};
>>>>> +
>>>>> +static void
>>>>> +emit_pipeline_stat(struct anv_cmd_buffer *cmd_buffer, uint32_t stat,
>>>>> +                   struct anv_bo *bo, uint32_t offset) {
>>>>> +   STATIC_ASSERT(ARRAY_SIZE(PIPELINE_STAT_TO_REG) ==
>>>>> +                 ANV_PIPELINE_STATISTICS_COUNT);
>>>>> +
>>>>> +   uint32_t reg = PIPELINE_STAT_TO_REG[stat];
>>>>> +
>>>>> +   anv_batch_emit(&cmd_buffer->batch, GENX(MI_STORE_REGISTER_MEM), lrm)
>>>>> {
>>>>> +      lrm.RegisterAddress  = reg,
>>>>> +      lrm.MemoryAddress    = (struct anv_address) { bo, offset };
>>>>> +   }
>>>>> +   anv_batch_emit(&cmd_buffer->batch, GENX(MI_STORE_REGISTER_MEM), lrm)
>>>>> {
>>>>> +      lrm.RegisterAddress  = reg + 4,
>>>>> +      lrm.MemoryAddress    = (struct anv_address) { bo, offset + 4 };
>>>>> +   }
>>>>> +}
>>>>> +
>>>>>  void genX(CmdBeginQuery)(
>>>>>      VkCommandBuffer                             commandBuffer,
>>>>>      VkQueryPool                                 queryPool,
>>>>> @@ -2301,7 +2345,25 @@ void genX(CmdBeginQuery)(
>>>>>                            query * sizeof(struct anv_query_pool_slot));
>>>>>        break;
>>>>>
>>>>> -   case VK_QUERY_TYPE_PIPELINE_STATISTICS:
>>>>> +   case VK_QUERY_TYPE_PIPELINE_STATISTICS: {
>>>>> +      uint32_t pipeline_statistics = pool->pipeline_statistics;
>>>>> +      uint32_t slot_offset = query * pool->slot_stride *
>>>>> +         sizeof(struct anv_query_pool_slot);
>>>>> +
>>>>> +      /* TODO: This might only be necessary for certain stats */
>>>>> +      anv_batch_emit(&cmd_buffer->batch, GENX(PIPE_CONTROL), pc) {
>>>>> +         pc.CommandStreamerStallEnable = true;
>>>>> +         pc.StallAtPixelScoreboard = true;
>>>>> +      }
>>>>
>>>>
>>>> I couldn't find a reference for needing this for workarounds, though I saw
>>>> the workaround Ben implemented for Gen9 + GT4 using a CS stall when
>>>> querying
>>>> timestamps, which is handled separately.
>>>
>>> Well, experimentally, I need it for the clipper and frag shader
>>> invocation counts to show up - otherwise they're always 0. This is on
>>> SKL GT2 with the vk-loader cube patched up as described. Perhaps the
>>> timing works out s.t. the counters all update outside of the begin/end
>>> "critical section" otherwise, so even though they're moving, the
>>> difference remains 0.
>>>
>>>>
>>>> Depending on how strictly we consider that the queries should only measure
>>>> the commands they bracket then I think some stalling will be necessary to
>>>> serialize the work associated with a query and defer reading the end state
>>>> until after the relevant stages have completed their work.
>>>>
>>>> We aren't very precise about this in GL currently, but in Begin maybe we
>>>> should stall until everything >= the statistic-stage is idle and in End
>>>> stall until everything <= the statistic-stage is idle before reading
>>>> (where
>>>> 'statistic-stage' here is the pipeline stage associated with the pipeline
>>>> statistic being queried (or respectively the min/max stage for a set)).
>>>>
>>>> For reference in my implementation of INTEL_performance_query facing this
>>>> same question, I'm currently just stalling before and after queries:
>>>>
>>>>
>>>> https://github.com/rib/mesa/blob/wip/rib/oa-next/src/mesa/drivers/dri/i965/brw_performance_query.c#L994
>>>>
>>>> https://github.com/rib/mesa/blob/wip/rib/oa-next/src/mesa/drivers/dri/i965/brw_performance_query.c#L1136
>>>
>>> So that's essentially what I'm doing here, I think. (And what the GL
>>> driver does.)

Yup, the upshot might just be a comment explaining the need for a
stall. I think we probably need a stall in CmdEndQuery too, otherwise
the command streamer may read the end counter before the work has
finished.

>>> brw_emit_mi_flush actually emits a lot more flushes than
>>> this, but I poked around and the main thing is to have a command
>>> streamer stall, which on earlier gens requires at least one other bit
>>> set.

Yeah, and there's a fun workaround on ivy bridge apparently requiring
a CS stall for every fourth PIPE_CONTROL which brw_emit_mi_flush helps
with. I chose to use that as the simplest hammer that hopefully
handled workarounds. It doesn't look like PIPE_CONTROL emission is
wrapped within the vulkan driver atm so I don't know if/how some of
the same workarounds are handled here yet. Seems reasonable to emit
the PIPE_CONTROL unwrapped here (i.e. seems like it shouldn't be your
concern whether a wrapper would be good for handling workarounds at
some point)

>>>
>>> [Please note that I have no real familiarity with the hardware - I'm
>>> mostly just looking at what the GL driver does and copying it here
>>> with minor adjustments.]
>>>
>>>>
>>>>> +
>>>>> +      while (pipeline_statistics) {
>>>>> +         uint32_t stat = u_bit_scan(&pipeline_statistics);
>>>>> +
>>>>> +         emit_pipeline_stat(cmd_buffer, stat, &pool->bo, slot_offset);
>>>>> +         slot_offset += sizeof(struct anv_query_pool_slot);
>>>>> +      }
>>>>> +      break;
>>>>> +   }
>>>>>     default:
>>>>>        unreachable("");
>>>>>     }
>>>>> @@ -2314,17 +2376,35 @@ void genX(CmdEndQuery)(
>>>>>  {
>>>>>     ANV_FROM_HANDLE(anv_cmd_buffer, cmd_buffer, commandBuffer);
>>>>>     ANV_FROM_HANDLE(anv_query_pool, pool, queryPool);
>>>>> +   uint32_t slot_offset = query * pool->slot_stride *
>>>>> +      sizeof(struct anv_query_pool_slot);
>>>>>
>>>>>     switch (pool->type) {
>>>>>     case VK_QUERY_TYPE_OCCLUSION:
>>>>> -      emit_ps_depth_count(cmd_buffer, &pool->bo,
>>>>> -                          query * sizeof(struct anv_query_pool_slot) +
>>>>> 8);
>>>>> +      emit_ps_depth_count(cmd_buffer, &pool->bo, slot_offset + 8);
>>>>> +      emit_query_availability(cmd_buffer, &pool->bo, slot_offset + 16);
>>>>> +      break;
>>>>> +
>>>>> +   case VK_QUERY_TYPE_PIPELINE_STATISTICS: {
>>>>> +      uint32_t pipeline_statistics = pool->pipeline_statistics;
>>>>> +      /* TODO: This might only be necessary for certain stats */
>>>>> +      anv_batch_emit(&cmd_buffer->batch, GENX(PIPE_CONTROL), pc) {
>>>>> +         pc.CommandStreamerStallEnable = true;
>>>>> +         pc.StallAtPixelScoreboard = true;
>>>>> +      }
>>>>> +
>>>>> +      while (pipeline_statistics) {
>>>>> +         uint32_t stat = u_bit_scan(&pipeline_statistics);
>>>>>
>>>>> -      emit_query_availability(cmd_buffer, &pool->bo,
>>>>> -                              query * sizeof(struct anv_query_pool_slot)
>>>>> + 16);
>>>>> +         emit_pipeline_stat(cmd_buffer, stat, &pool->bo, slot_offset +
>>>>> 8);
>>>>> +         slot_offset += sizeof(struct anv_query_pool_slot);
>>>>> +      }
>>>>> +
>>>>> +      slot_offset -= sizeof(struct anv_query_pool_slot);
>>>>> +      emit_query_availability(cmd_buffer, &pool->bo, slot_offset + 16);
>>>>>        break;
>>>>> +   }
>>>>>
>>>>> -   case VK_QUERY_TYPE_PIPELINE_STATISTICS:
>>>>>     default:
>>>>>        unreachable("");
>>>>>     }
>>>>> @@ -2421,6 +2501,31 @@ emit_load_alu_reg_u64(struct anv_batch *batch,
>>>>> uint32_t reg,
>>>>>  }
>>>>>
>>>>>  static void
>>>>> +emit_load_alu_reg_imm32(struct anv_batch *batch, uint32_t reg, uint32_t
>>>>> imm)
>>>>> +{
>>>>> +   anv_batch_emit(batch, GENX(MI_LOAD_REGISTER_IMM), lri) {
>>>>> +      lri.RegisterOffset   = reg;
>>>>> +      lri.DataDWord        = imm;
>>>>> +   }
>>>>> +}
>>>>> +
>>>>> +static void
>>>>> +emit_load_alu_reg_imm64(struct anv_batch *batch, uint32_t reg, uint64_t
>>>>> imm)
>>>>> +{
>>>>> +   emit_load_alu_reg_imm32(batch, reg, (uint32_t)imm);
>>>>> +   emit_load_alu_reg_imm32(batch, reg + 4, (uint32_t)(imm >> 32));
>>>>> +}
>>>>> +
>>>>> +static void
>>>>> +emit_load_alu_reg_reg32(struct anv_batch *batch, uint32_t src, uint32_t
>>>>> dst)
>>>>> +{
>>>>> +   anv_batch_emit(batch, GENX(MI_LOAD_REGISTER_REG), lrr) {
>>>>> +      lrr.SourceRegisterAddress      = src;
>>>>> +      lrr.DestinationRegisterAddress = dst;
>>>>> +   }
>>>>> +}
>>>>> +
>>>>> +static uint32_t
>>>>>  store_query_result(struct anv_batch *batch, uint32_t reg,
>>>>>                     struct anv_bo *bo, uint32_t offset,
>>>>> VkQueryResultFlags
>>>>> flags)
>>>>>  {
>>>>> @@ -2434,9 +2539,88 @@ store_query_result(struct anv_batch *batch,
>>>>> uint32_t reg,
>>>>>           srm.RegisterAddress  = reg + 4;
>>>>>           srm.MemoryAddress    = (struct anv_address) { bo, offset + 4 };
>>>>>        }
>>>>> +
>>>>> +      return offset + 8;
>>>>> +   }
>>>>> +
>>>>> +   return offset + 4;
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * GPR0 = GPR0 & ((1ull << n) - 1);
>>>>> + */
>>>>> +static void
>>>>> +keep_gpr0_lower_n_bits(struct anv_batch *batch, uint32_t n)
>>>>> +{
>>>>> +   assert(n < 64);
>>>>> +   emit_load_alu_reg_imm64(batch, CS_GPR(1), (1ull << n) - 1);
>>>>> +
>>>>> +   uint32_t *dw = anv_batch_emitn(batch, 5, GENX(MI_MATH));
>>>>> +   dw[1] = alu(OPCODE_LOAD, OPERAND_SRCA, OPERAND_R0);
>>>>> +   dw[2] = alu(OPCODE_LOAD, OPERAND_SRCB, OPERAND_R1);
>>>>> +   dw[3] = alu(OPCODE_AND, 0, 0);
>>>>> +   dw[4] = alu(OPCODE_STORE, OPERAND_R0, OPERAND_ACCU);
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * GPR0 = GPR0 << 30;
>>>>> + */
>>>>> +static void
>>>>> +shl_gpr0_by_30_bits(struct anv_batch *batch)
>>>>> +{
>>>>> +   /* First we mask 34 bits of GPR0 to prevent overflow */
>>>>> +   keep_gpr0_lower_n_bits(batch, 34);
>>>>> +
>>>>> +   const uint32_t outer_count = 5;
>>>>> +   const uint32_t inner_count = 6;
>>>>> +   STATIC_ASSERT(outer_count * inner_count == 30);
>>>>> +   const uint32_t cmd_len = 1 + inner_count * 4;
>>>>> +
>>>>> +   /* We'll emit 5 commands, each shifting GPR0 left by 6 bits, for a
>>>>> total of
>>>>> +    * 30 left shifts.
>>>>> +    */
>>>>> +   for (int o = 0; o < outer_count; o++) {
>>>>> +      /* Submit one MI_MATH to shift left by 6 bits */
>>>>> +      uint32_t *dw = anv_batch_emitn(batch, cmd_len, GENX(MI_MATH));
>>>>> +      dw++;
>>>>> +      for (int i = 0; i < inner_count; i++, dw += 4) {
>>>>> +         dw[0] = alu(OPCODE_LOAD, OPERAND_SRCA, OPERAND_R0);
>>>>> +         dw[1] = alu(OPCODE_LOAD, OPERAND_SRCB, OPERAND_R0);
>>>>> +         dw[2] = alu(OPCODE_ADD, 0, 0);
>>>>> +         dw[3] = alu(OPCODE_STORE, OPERAND_R0, OPERAND_ACCU);
>>>>> +      }
>>>>>     }
>>>>>  }
>>>>>
>>>>> +/*
>>>>> + * GPR0 = GPR0 >> 2;
>>>>> + *
>>>>> + * Note that the upper 30 bits of GPR are lost!
>>>>> + */
>>>>> +static void
>>>>> +shr_gpr0_by_2_bits(struct anv_batch *batch)
>>>>> +{
>>>>> +   shl_gpr0_by_30_bits(batch);
>>>>> +   emit_load_alu_reg_reg32(batch, CS_GPR(0) + 4, CS_GPR(0));
>>>>> +   emit_load_alu_reg_imm32(batch, CS_GPR(0) + 4, 0);


I recently noticed from inspecting the original hsw_queryobj,c code
that this looks suspicious.

Conceptually it makes sense to implement a right shift as lshift by
32-n and then keeping the upper 32bits, but the emit_load_ functions
take a destination followed by a source and so it looks like after the
left shift it's copying the least significant 32bits of R0 over the
most significant and then setting the most significant 32bits of R0 to
zero. It looks like the first load_alu is redundant if the second one
just writes zero to the same location.

Maybe I'm misreading something here though, this comment it just based
on inspection.

>>>>> +}
>>>>> +
>>>>> +static void
>>>>> +compute_query_result(struct anv_batch *batch, struct anv_bo *bo,
>>>>> +                     uint32_t dst_reg, uint32_t offset)
>>>>> +{
>>>>> +   emit_load_alu_reg_u64(batch, CS_GPR(0), bo, offset);
>>>>> +   emit_load_alu_reg_u64(batch, CS_GPR(1), bo, offset + 8);
>>>>> +
>>>>> +   /* FIXME: We need to clamp the result for 32 bit. */
>>>>> +
>>>>> +   uint32_t *dw = anv_batch_emitn(batch, 5, GENX(MI_MATH));
>>>>> +   dw[1] = alu(OPCODE_LOAD, OPERAND_SRCA, OPERAND_R1);
>>>>> +   dw[2] = alu(OPCODE_LOAD, OPERAND_SRCB, OPERAND_R0);
>>>>> +   dw[3] = alu(OPCODE_SUB, 0, 0);
>>>>> +   dw[4] = alu(OPCODE_STORE, dst_reg, OPERAND_ACCU);
>>>>> +}
>>>>> +
>>>>>  void genX(CmdCopyQueryPoolResults)(
>>>>>      VkCommandBuffer                             commandBuffer,
>>>>>      VkQueryPool                                 queryPool,
>>>>> @@ -2459,50 +2643,64 @@ void genX(CmdCopyQueryPoolResults)(
>>>>>        }
>>>>>     }
>>>>>
>>>>> -   dst_offset = buffer->offset + destOffset;
>>>>>     for (uint32_t i = 0; i < queryCount; i++) {
>>>>> -
>>>>> +      dst_offset = buffer->offset + destOffset + destStride * i;
>>>>>        slot_offset = (firstQuery + i) * sizeof(struct
>>>>> anv_query_pool_slot);
>>>>>        switch (pool->type) {
>>>>>        case VK_QUERY_TYPE_OCCLUSION:
>>>>> -         emit_load_alu_reg_u64(&cmd_buffer->batch,
>>>>> -                               CS_GPR(0), &pool->bo, slot_offset);
>>>>> -         emit_load_alu_reg_u64(&cmd_buffer->batch,
>>>>> -                               CS_GPR(1), &pool->bo, slot_offset + 8);
>>>>> -
>>>>> -         /* FIXME: We need to clamp the result for 32 bit. */
>>>>> -
>>>>> -         uint32_t *dw = anv_batch_emitn(&cmd_buffer->batch, 5,
>>>>> GENX(MI_MATH));
>>>>> -         dw[1] = alu(OPCODE_LOAD, OPERAND_SRCA, OPERAND_R1);
>>>>> -         dw[2] = alu(OPCODE_LOAD, OPERAND_SRCB, OPERAND_R0);
>>>>> -         dw[3] = alu(OPCODE_SUB, 0, 0);
>>>>> -         dw[4] = alu(OPCODE_STORE, OPERAND_R2, OPERAND_ACCU);
>>>>> +         compute_query_result(&cmd_buffer->batch, &pool->bo, OPERAND_R2,
>>>>> +                              slot_offset);
>>>>> +         dst_offset = store_query_result(
>>>>> +               &cmd_buffer->batch,
>>>>> +               CS_GPR(2), buffer->bo, dst_offset, flags);
>>>>>           break;
>>>>>
>>>>>        case VK_QUERY_TYPE_TIMESTAMP:
>>>>>           emit_load_alu_reg_u64(&cmd_buffer->batch,
>>>>>                                 CS_GPR(2), &pool->bo, slot_offset);
>>>>> +         dst_offset = store_query_result(
>>>>> +               &cmd_buffer->batch,
>>>>> +               CS_GPR(2), buffer->bo, dst_offset, flags);
>>>>>           break;
>>>>>
>>>>> +      case VK_QUERY_TYPE_PIPELINE_STATISTICS: {
>>>>> +         uint32_t pipeline_statistics = pool->pipeline_statistics;
>>>>> +
>>>>> +         slot_offset *= pool->slot_stride;
>>>>> +         while (pipeline_statistics) {
>>>>> +            uint32_t stat = u_bit_scan(&pipeline_statistics);
>>>>> +
>>>>> +            compute_query_result(&cmd_buffer->batch, &pool->bo,
>>>>> OPERAND_R0,
>>>>> +                                 slot_offset);
>>>>> +
>>>>> +            /* WaDividePSInvocationCountBy4:HSW,BDW */
>>>>> +            if ((cmd_buffer->device->info.gen == 8 ||
>>>>> +                 cmd_buffer->device->info.is_haswell) &&
>>>>> +                (1 << stat) ==
>>>>> VK_QUERY_PIPELINE_STATISTIC_FRAGMENT_SHADER_INVOCATIONS_BIT) {
>>>>> +               shr_gpr0_by_2_bits(&cmd_buffer->batch);
>>>>> +            }
>>>>> +            dst_offset = store_query_result(
>>>>> +                  &cmd_buffer->batch,
>>>>> +                  CS_GPR(0), buffer->bo, dst_offset, flags);
>>>>> +            slot_offset += sizeof(struct anv_query_pool_slot);
>>>>> +         }
>>>>> +
>>>>> +         /* Get the slot offset to where it's supposed to be for the
>>>>> +          * availability bit.
>>>>> +          */
>>>>> +         slot_offset -= sizeof(struct anv_query_pool_slot);
>>>>> +         break;
>>>>> +      }
>>>>>        default:
>>>>>           unreachable("unhandled query type");
>>>>>        }
>>>>>
>>>>> -      store_query_result(&cmd_buffer->batch,
>>>>> -                         CS_GPR(2), buffer->bo, dst_offset, flags);
>>>>> -
>>>>>        if (flags & VK_QUERY_RESULT_WITH_AVAILABILITY_BIT) {
>>>>>           emit_load_alu_reg_u64(&cmd_buffer->batch, CS_GPR(0),
>>>>>                                 &pool->bo, slot_offset + 16);
>>>>> -         if (flags & VK_QUERY_RESULT_64_BIT)
>>>>> -            store_query_result(&cmd_buffer->batch,
>>>>> -                               CS_GPR(0), buffer->bo, dst_offset + 8,
>>>>> flags);
>>>>> -         else
>>>>> -            store_query_result(&cmd_buffer->batch,
>>>>> -                               CS_GPR(0), buffer->bo, dst_offset + 4,
>>>>> flags);
>>>>> +         store_query_result(&cmd_buffer->batch,
>>>>> +                            CS_GPR(0), buffer->bo, dst_offset, flags);
>>>>>        }
>>>>> -
>>>>> -      dst_offset += destStride;
>>>>>     }
>>>>>  }
>>>>
>>>>
>>>> I was a bit surprised to see us needing to go the same route as GL query
>>>> objects with using the command streamer's very limited ALU for fixing up
>>>> metrics on the gpu instead of doing that on the CPU if possible.
>>>>
>>>> My understanding had been that multiple HW vendors pushed back on the
>>>> design
>>>> of these query apis to ensure they could deal with counter scaling and
>>>> normalization on the CPU without needing to support direct feedback into
>>>> the
>>>> gpu. Reading the spec though it does unfortunately look like
>>>> vkCmdCopyQueryPoolResults implies scaling on the gpu :-/
>>>
>>> Yeah, that looks identical to ARB_query_buffer_object functionality,
>>> with very minor differences.
>>>
>>>>
>>>> One thorny problem we have today at least with GL is that we don't scale
>>>> Skylake timestamps properly when using the ALU considering the extra
>>>> annoyance of scaling by a floating point factor of 83.3333. In the future
>>>> this scale factor might ideally not even be a constant if we can take
>>>> advantage of training done in the kernel to determine a more accurate
>>>> frequency (ref:
>>>> https://lists.freedesktop.org/archives/intel-gfx/2016-June/097272.html)
>>>> Also
>>>> for reference here, there's this patch that looks at fixing the timestamp
>>>> scaling done on the cpu for SKL:
>>>> https://patchwork.freedesktop.org/patch/118855/ I understand you aren't
>>>> dealing with timestamp scaling here, but it seems like an a example of
>>>> where
>>>> it's getting a bit out-of-hand using the ALU for this type of purpose.
>>>
>>> I haven't checked, but I hope that the timestamp units can be in
>>> whatever you want. So you can hopefully get away without scaling.
>>>

You were right; there's a float
VkPhysicalDeviceLimits::timestampPeriod property.

That was really good to find out.

>>>>
>>>> Can we maybe at least consider the complexity trade-off from using a
>>>> shader
>>>> instead of ALU - or consider if there's a legitimate way we're missing of
>>>> using the cpu here (though I suppose not).
>>>
>>> Go ahead :) I was just copying the GL approach. Happy to rip out the
>>> MI_MATH stuff and make the feature IVB + SKL+ only for now if you all
>>> agree that the MI_MATH stuff is bad. IMHO the complexity of the
>>> MI_MATH logic is quite small for the / 4 case, but I don't feel like
>>> I'm particularly part of the decision committee.

Just for reference, I spent a bit of time late last year prototyping
supporting multiplication with our ALU:

http://pastebin.com/dHmzbGQt

(I was wondering about the practicality of handing scaling on SKL and
BXT in GL.)

Anyway since we have the timestampPeriod constant with Vulkan may main
concern here is gone.

In case you double check my comment about shifting above looking
suspect then maybe nab the generalized shift I wrote within that
pastebin if it's of any use.


Thanks for looking at this; on the whole it looks good to me.

Br,
- Robert

>>>
>>> Cheers,
>>>
>>>   -ilia
>>>
>>>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev


More information about the mesa-dev mailing list