[Beignet] [PATCH] GBE: implement pre-register-allocation instruction scheduling.
Song, Ruiling
ruiling.song at intel.com
Tue Nov 10 18:26:36 PST 2015
I carefully read the implementation, it looks good.
But according to Zhigang's experiment, some case would suffer performance downgrade.
And he thought the main problem still lies in the non-accurate latency model.
And the pre-schedule may make the post-schedule cannot re-schedule again because of physical register dependency.
We need to tune the instruction schedule further to achieve good balance of ILP and register pressure.
I think we can merge the patch, but still keep pre-schedule disabled. And tune instruction schedule later.
Thanks!
Ruiling
> -----Original Message-----
> From: Beignet [mailto:beignet-bounces at lists.freedesktop.org] On Behalf Of
> Zhigang Gong
> Sent: Wednesday, September 16, 2015 10:46 AM
> To: beignet at lists.freedesktop.org
> Cc: Gong, Zhigang
> Subject: [Beignet] [PATCH] GBE: implement pre-register-allocation instruction
> scheduling.
>
> To find out an instruction scheduling policy to achieve the theoretical minimum
> registers required in a basic block is a NP problem. We have to use some
> heuristic
> factor to simplify the algorithm. There are many researchs which indicate a
> bottom-up list scheduling is much better than the top-down method in turns of
> register pressure. I choose one of such research paper as our target. The paper
> is as below:
>
> "Register-Sensitive Selection, Duplication, and Sequencing of Instructions"
> It use the bottom-up list scheduling with a Sethi-Ullman label as an
> heuristic number. As we will do cycle awareness scheduling after the register
> allocation, we don't need to bother with cycle related heuristic number here.
> I just skipped the EST computing and usage part in the algorithm.
>
> It turns out this algorithm works well. It could reduce the register spilling
> in clBlas's sgemmBlock kernel from 83+ to only 20.
>
> Although this scheduling method seems to be lowering the ILP(instruction level
> parallism).
> It's not a big issue, because we will allocate as much as possible different
> registers
> in the following register allocation stage, and we will do a after allocation
> instruction scheduling which will try to get as much ILP as possible.
>
> Signed-off-by: Zhigang Gong <zhigang.gong at intel.com>
> ---
> backend/src/backend/gen_insn_scheduling.cpp | 137
> +++++++++++++++++++++++-----
> 1 file changed, 116 insertions(+), 21 deletions(-)
>
> diff --git a/backend/src/backend/gen_insn_scheduling.cpp
> b/backend/src/backend/gen_insn_scheduling.cpp
> index 358a2ce..f4f1e70 100644
> --- a/backend/src/backend/gen_insn_scheduling.cpp
> +++ b/backend/src/backend/gen_insn_scheduling.cpp
> @@ -41,26 +41,29 @@
> * ==============================
> *
> * We try to limit the register pressure.
> - * Well, this is a hard problem and we have a decent strategy now that we
> called
> - * "zero cycled LIFO scheduling".
> - * We use a local forward list scheduling and we schedule the instructions in a
> - * LIFO order i.e. as a stack. Basically, we take the most recent instruction
> - * and schedule it right away. Obviously we ignore completely the real latencies
> - * and throuputs and just simulate instructions that are issued and completed in
> - * zero cycle. For the complex kernels we already have (like menger sponge),
> - * this provides a pretty good strategy enabling SIMD16 code generation where
> - * when scheduling is deactivated, even SIMD8 fails
> *
> - * One may argue that this strategy is bad, latency wise. This is not true since
> - * the register allocator will anyway try to burn as many registers as possible.
> - * So, there is still opportunities to schedule after register allocation.
> + * To find out an instruction scheduling policy to achieve the theoretical
> minimum
> + * registers required in a basic block is a NP problem. We have to use some
> heuristic
> + * factor to simplify the algorithm. There are many researchs which indicate a
> + * bottom-up list scheduling is much better than the top-down method in turns
> of
> + * register pressure. I choose one of such research paper as our target. The
> paper
> + * is as below:
> *
> - * Our idea seems to work decently. There is however a strong research article
> - * that is able to near-optimally reschudle the instructions to minimize
> - * register use. This is:
> + * "Register-Sensitive Selection, Duplication, and Sequencing of Instructions"
> + * It use the bottom-up list scheduling with a Sethi-Ullman label as an
> + * heuristic number. As we will do cycle awareness scheduling after the register
> + * allocation, we don't need to bother with cycle related heuristic number here.
> + * I just skipped the EST computing and usage part in the algorithm.
> *
> - * "Minimum Register Instruction Sequence Problem: Revisiting Optimal Code
> - * Generation for DAGs"
> + * It turns out this algorithm works well. It could reduce the register spilling
> + * in clBlas's sgemmBlock kernel from 83+ to only 20.
> + *
> + * Although this scheduling method seems to be lowering the ILP(instruction
> level parallism).
> + * It's not a big issue, because we will allocate as much as possible different
> registers
> + * in the following register allocation stage, and we will do a after allocation
> + * instruction scheduling which will try to get as much ILP as possible.
> + *
> + * FIXME: we only need to do this scheduling when a BB is really under high
> register pressure.
> *
> * After the register allocation
> * ==============================
> @@ -114,7 +117,7 @@ namespace gbe
> struct ScheduleDAGNode
> {
> INLINE ScheduleDAGNode(SelectionInstruction &insn) :
> - insn(insn), refNum(0), retiredCycle(0), preRetired(false),
> readDistance(0x7fffffff) {}
> + insn(insn), refNum(0), depNum(0), retiredCycle(0), preRetired(false),
> readDistance(0x7fffffff) {}
> bool dependsOn(ScheduleDAGNode *node) const {
> GBE_ASSERT(node != NULL);
> for (auto child : node->children)
> @@ -128,6 +131,10 @@ namespace gbe
> SelectionInstruction &insn;
> /*! Number of nodes that point to us (i.e. nodes we depend on) */
> uint32_t refNum;
> + /*! Number of nodes that we depends on. */
> + uint32_t depNum;
> + /*! Register pressure. */
> + uint32_t regNum;
> /*! Cycle when the instruction is retired */
> uint32_t retiredCycle;
> bool preRetired;
> @@ -218,6 +225,8 @@ namespace gbe
> /*! Schedule the DAG, pre register allocation and post register allocation. */
> void preScheduleDAG(SelectionBlock &bb, int32_t insnNum);
> void postScheduleDAG(SelectionBlock &bb, int32_t insnNum);
> +
> + void computeRegPressure(ScheduleDAGNode *node,
> map<ScheduleDAGNode *, int32_t> ®PressureMap);
> /*! To limit register pressure or limit insn latency problems */
> SchedulePolicy policy;
> /*! Make ScheduleListNode allocation faster */
> @@ -277,6 +286,7 @@ namespace gbe
> ScheduleListNode *dep = scheduler.newScheduleListNode(node0, depMode);
> node0->refNum++;
> node1->children.push_back(dep);
> + node1->depNum++;
> auto it = deps.find(node0);
> if (it != deps.end()) {
> it->second.push_back(node1);
> @@ -605,8 +615,95 @@ namespace gbe
> return insnNum;
> }
>
> + /*! Will sort child in register pressure in increasing order */
> + inline bool cmp(const ScheduleDAGNode *v0, const ScheduleDAGNode *v1) {
> + return v0->regNum < v1->regNum;
> + }
> +
> + /* Recursively compute heuristic Sethi-Ullman number for each node. */
> + void SelectionScheduler::computeRegPressure(ScheduleDAGNode *node,
> + map<ScheduleDAGNode *, int32_t>
> ®PressureMap) {
> + if (regPressureMap.find(node) != regPressureMap.end()) {
> + GBE_ASSERT(node->regNum == (uint32_t)regPressureMap.find(node)-
> >second);
> + return;
> + }
> + if (node->refNum == 0) {
> + node->regNum = 0;
> + regPressureMap.insert(std::make_pair(node, 0));
> + return;
> + }
> + auto &children = tracker.deps.find(node)->second;
> + for (auto child : children) {
> + computeRegPressure(child, regPressureMap);
> + }
> + std::sort(children.begin(), children.end(), cmp);
> + uint32_t maxRegNum = 0;
> + int32_t i = 0;
> + for (auto &child : children) {
> + if (child->regNum + children.size() - i > maxRegNum)
> + maxRegNum = child->regNum + node->children.size() - i;
> + ++i;
> + }
> + node->regNum = maxRegNum;
> + regPressureMap.insert(std::make_pair(node, maxRegNum));
> + return;
> + }
> +
> void SelectionScheduler::preScheduleDAG(SelectionBlock &bb, int32_t
> insnNum) {
> - printf("Not implemented yet. \n");
> + set<ScheduleDAGNode *> rootNodes;
> + for (int32_t i = 0; i < insnNum; i++) {
> + ScheduleDAGNode *node = tracker.insnNodes[i];
> + if (node->depNum == 0)
> + rootNodes.insert(node);
> + }
> + map<ScheduleDAGNode *, int32_t> regPressureMap;
> + map<ScheduleDAGNode *, int32_t> parentIndexMap;
> + for (auto node : rootNodes) {
> + computeRegPressure(node, regPressureMap);
> + parentIndexMap.insert(std::make_pair(node, INT_MAX));
> + }
> + set<ScheduleDAGNode *> readySet(rootNodes.begin(), rootNodes.end());
> + set<ScheduleDAGNode *> scheduledSet;
> + int32_t j = insnNum;
> +
> + // Now, start the scheduling.
> + // Each time find the minimum smallest pair (parentIndex[node],
> regPressure[node])
> + // as the best node to schedule.
> + while(readySet.size()) {
> + ScheduleDAGNode * bestNode = NULL;
> + int32_t minRegNum = INT_MAX;
> + int32_t minParentIndex = INT_MAX;
> + for(auto node : readySet) {
> + GBE_ASSERT(scheduledSet.contains(node) == false);
> + if (parentIndexMap.find(node)->second < minParentIndex) {
> + bestNode = node;
> + minParentIndex = parentIndexMap.find(node)->second;
> + minRegNum = regPressureMap.find(node)->second;
> + }
> + else if (parentIndexMap.find(node)->second == minParentIndex) {
> + if (regPressureMap.find(node)->second < minRegNum) {
> + bestNode = node;
> + minRegNum = regPressureMap.find(node)->second;
> + }
> + }
> + }
> + for( auto node : tracker.deps.find(bestNode)->second ) {
> + if (node == NULL)
> + continue;
> + node->depNum--;
> + if (parentIndexMap.find(node) != parentIndexMap.end())
> + parentIndexMap.find(node)->second = j;
> + else
> + parentIndexMap.insert(std::make_pair(node, j));
> + if (node->depNum == 0 && scheduledSet.contains(node) == false)
> + readySet.insert(node);
> + }
> + bb.prepend(&bestNode->insn);
> + readySet.erase(bestNode);
> + scheduledSet.insert(bestNode);
> + --j;
> + }
> + GBE_ASSERT(insnNum == (int32_t)bb.insnList.size());
> }
>
> void SelectionScheduler::postScheduleDAG(SelectionBlock &bb, int32_t
> insnNum) {
> @@ -714,8 +811,6 @@ namespace gbe
> void schedulePreRegAllocation(GenContext &ctx, Selection &selection) {
> if (OCL_PRE_ALLOC_INSN_SCHEDULE) {
> SelectionScheduler scheduler(ctx, selection, PRE_ALLOC);
> - // FIXME, need to implement proper pre reg allocation scheduling algorithm.
> - return;
> for (auto &bb : *selection.blockList) {
> const int32_t insnNum = scheduler.buildDAG(bb);
> bb.insnList.clear();
> --
> 1.9.1
>
> _______________________________________________
> Beignet mailing list
> Beignet at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
More information about the Beignet
mailing list