[Mesa-dev] [PATCH] intel/fs: Optimize and simplify the copy propagation dataflow logic.

Wed Dec 20 14:47:41 UTC 2017

Hi,

On 20.12.2017 16:29, Eero Tamminen wrote:
> I got unexpected results with this, when testing it on BXT & SKL GT2.
> 
> While performance in GpuTest Volplosion and GfxBench v4 Tessellation 
> test improved slightly, performance in SynMark v7 CSDof and GpuTest v0.7 
> Piano dropped clearly.
> 
> Piano dropped only on SKL, but there the drop was >10%.

When looking at the generated shader assembly before and after, there's 
one huge shader for which Mesa is able to generated only SIMD8 assembly.

Before the patch:

SIMD8 shader: 2727 instructions. 5 loops. 470840 cycles. 0:0 
spills:fills. Promoted 58 constants. Compacted 43632 to 32560 bytes (25%)

After the patch:

SIMD8 shader: 4784 instructions. 5 loops. 808006 cycles. 212:367 
spills:fills. Promoted 9 constants. Compacted 76544 to 46192 bytes (40%)

Because shader spills i.e. there are no free regs, there are also more 
register bank conflicts.

	- Eero

 > There was also
 > a peculiar change in its CPU vs. GPU utilization at the start of the
 > test.  With this patch, there's 5-10s of 100% CPU usage before the test
 > starts, which extra CPU usage happens only with this patch.
 >
 > Attached is Callgrind output of where the extra time at startup goes.
 >
> (I didn't see that issue with the other tests which perf changed.)
> 
> 
>      - Eero
> 
> On 20.12.2017 04:19, Francisco Jerez wrote:
>> Previously the dataflow propagation algorithm would calculate the ACP
>> live-in and -out sets in a two-pass fixed-point algorithm.  The first
>> pass would update the live-out sets of all basic blocks of the program
>> based on their live-in sets, while the second pass would update the
>> live-in sets based on the live-out sets.  This is incredibly
>> inefficient in the typical case where the CFG of the program is
>> approximately acyclic, because it can take up to 2*n passes for an ACP
>> entry introduced at the top of the program to reach the bottom (where
>> n is the number of basic blocks in the program), until which point the
>> algorithm won't be able to reach a fixed point.
>>
>> The same effect can be achieved in a single pass by computing the
>> live-in and -out sets in lock-step, because that makes sure that
>> processing of any basic block will pick up the updated live-out sets
>> of the lexically preceding blocks.  This gives the dataflow
>> propagation algorithm effectively O(n) run-time instead of O(n^2) in
>> the acyclic case.
>>
>> The time spent in dataflow propagation is reduced by 30x in the
>> GLES31.functional.ssbo.layout.random.all_shared_buffer.5 dEQP
>> test-case on my CHV system (the improvement is likely to be of the
>> same order of magnitude on other platforms).  This more than reverses
>> an apparent run-time regression in this test-case from my previous
>> copy-propagation undefined-value handling patch, which was ultimately
>> caused by the additional work introduced in that commit to account for
>> undefined values being multiplied by a huge quadratic factor.
>>
>> According to Chad this test was failing on CHV due to a 30s time-out
>> imposed by the Android CTS (this was the case regardless of my
>> undefined-value handling patch, even though my patch substantially
>> exacerbated the issue).  On my CHV system this patch reduces the
>> overall run-time of the test by approximately 12x, getting us to
>> around 13s, well below the time-out.
>>
>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=104271
>> Reported-by: Chad Versace <chadversary at chromium.org>
>> ---
>>   src/intel/compiler/brw_fs_copy_propagation.cpp | 30 
>> ++++++++------------------
>>   1 file changed, 9 insertions(+), 21 deletions(-)
>>
>> diff --git a/src/intel/compiler/brw_fs_copy_propagation.cpp 
>> b/src/intel/compiler/brw_fs_copy_propagation.cpp
>> index af5635eacef..8d6ab18cd13 100644
>> --- a/src/intel/compiler/brw_fs_copy_propagation.cpp
>> +++ b/src/intel/compiler/brw_fs_copy_propagation.cpp
>> @@ -228,34 +228,17 @@ fs_copy_prop_dataflow::run()
>>      do {
>>         progress = false;
>> -      /* Update liveout for all blocks. */
>>         foreach_block (block, cfg) {
>>            if (block->parents.is_empty())
>>               continue;
>>            for (int i = 0; i < bitset_words; i++) {
>>               const BITSET_WORD old_liveout = bd[block->num].liveout[i];
>> -
>> -            bd[block->num].liveout[i] =
>> -               bd[block->num].copy[i] | (bd[block->num].livein[i] &
>> -                                         ~bd[block->num].kill[i]);
>> -
>> -            if (old_liveout != bd[block->num].liveout[i])
>> -               progress = true;
>> -         }
>> -      }
>> -
>> -      /* Update livein for all blocks.  If a copy is live out of all 
>> parent
>> -       * blocks, it's live coming in to this block.
>> -       */
>> -      foreach_block (block, cfg) {
>> -         if (block->parents.is_empty())
>> -            continue;
>> -
>> -         for (int i = 0; i < bitset_words; i++) {
>> -            const BITSET_WORD old_livein = bd[block->num].livein[i];
>>               BITSET_WORD livein_from_any_block = 0;
>> +            /* Update livein for this block.  If a copy is live out 
>> of all
>> +             * parent blocks, it's live coming in to this block.
>> +             */
>>               bd[block->num].livein[i] = ~0u;
>>               foreach_list_typed(bblock_link, parent_link, link, 
>> &block->parents) {
>>                  bblock_t *parent = parent_link->block;
>> @@ -278,7 +261,12 @@ fs_copy_prop_dataflow::run()
>>                */
>>               bd[block->num].livein[i] &= livein_from_any_block;
>> -            if (old_livein != bd[block->num].livein[i])
>> +            /* Update liveout for this block. */
>> +            bd[block->num].liveout[i] =
>> +               bd[block->num].copy[i] | (bd[block->num].livein[i] &
>> +                                         ~bd[block->num].kill[i]);
>> +
>> +            if (old_liveout != bd[block->num].liveout[i])
>>                  progress = true;
>>            }
>>         }
>>
>