Thanks for the reply!!<br><br> <div class="gmail_quote">2011/5/23 Siarhei Siamashka <span dir="ltr"><<a href="mailto:siarhei.siamashka@gmail.com">siarhei.siamashka@gmail.com</a>></span><br> <blockquote style="BORDER-LEFT: #ccc 1px solid; MARGIN: 0px 0px 0px 0.8ex; PADDING-LEFT: 1ex" class="gmail_quote">On Mon, May 16, 2011 at 3:27 PM, Taekyun Kim <<a href="mailto:podain77@gmail.com">podain77@gmail.com</a>> wrote:<br> > Patch 1<br><br>Would it make sense to handle REPEAT_NORMAL optimizations separately<br>for nearest and bilinear scaling? I mean, you posted the benchmark<br>numbers for bilinear scaling, but nearest scaling does not seem to be<br> covered and it would be interesting to know whether it has actually<br>improved and how much. Currently nearest scaling is particularly<br>interesting on ARM devices without NEON instructions support because<br>it is likely that they can't afford the overhead of bilinear scaling<br> and have to fallback to nearest. Also there are some coding style<br>issues in the patches which are better to be addressed.<br></blockquote> <div> </div> <div> </div> <div>For sse2 and NEON fast path, nearest was also quite faster than general path. I will post some performance data for various nearest/bilinear cases later with your comments are applied.</div> <div> </div> <div>As you mentioned, C fast path is slower than before due to function call overhead and division inside the loop.</div> <div> </div> <div>To achieve REPEAT_NORMAL support, we can make scanline functions to support it(Plan A) or we can choose break-down-to-non-repeat approach(Plan B). Obviously the plan B causes function call overhead, so basically I think the plan A could be an optimal solution. But it requires a lot of changes in current fast path implementations.</div> <div> </div> <div>As an alternative, we can first take plan B (because we can benifit from already implemented sse2 and NEON fast paths) and later move to plan A one by one if it is requied. Following is my suggestion for doing that.</div> <div> </div> <div>Basically nearest/bilinear fast path templates take scanline_func as an argument. So we handle vertical repeat in macro templates.</div> <div>scanline_func can handle repeat inside of it or not. So we give following flags to macro templates which means whether the argument scanline_func support specific repeat mode or not.</div> <div> </div> <div> FAST_PATH_SCANLINE_SUPPORT_REPEAT_NONE</div> <div> FAST_PATH_SCANLINE_SUPPORT_REPEAT_PAD</div> <div> FAST_PATH_SCANLINE_SUPPORT_REPEAT_NORMAL</div> <div> FAST_PATH_SCANLINE_SUPPORT_REPEAT_REFLECT</div> <div> </div> <div>If all these flags are not turned on, scanline_func can process only SAMPLES_COVER_CLIP cases. And as you suggested, mask related boolean values can also be replaced with flags like</div> <div> </div> <div> FAST_PATH_HAVE_MASK</div> <div> FAST_PATH_MASK_IS_SOLID</div> <div> </div> <div>And additionally following flags can be useful for REPEAT_REFLECT and horizontal flipping.</div> <div> </div> <div> <div> FAST_PATH_SCANLINE_SUPPORT_NEGATIVE_X_UNIT</div> <div> </div> <div>So the FAST_NEAREST_MAINLOOP_INT would be something like this.</div> <div> </div></div> <div> FAST_NEAREST_MAINLOOP_INT(</div> <div> scale_func_name, scanline_func, </div> <div> src_type_t, mask_type_t, dst_type_t, repeat_mode, flag)</div> <div> </div> <div>If scanline_func supports repeat_mode, then we give unhandled vx, max_vx to scanline_func(plan A). If not, we can take plan B path (fallback). REPEAT_REFLECT is not easy to implementable right now, because scanline_func does not support horizontal flipping now (negative unit_x).</div> <div> </div> <div>And one more important thing is that current standard(non-scaled) fast paths do not have common templates. They usually handle vertical and horizontal loop in one function. REPEAT_NORMAL is relatively easy to implemented by invoking that function with height = 1. But PAD repeat is somewhat difficult to implement because the function does not take unit_x. It assumes that unit_x is always pixman_fixed_1. As you know, scaling functions can implement PAD repeat easily by giving 0 unit_x to scanline function.</div> <div> </div> <div>I've already done some REPEAT_NORMAL work for ARM standard fast paths and it shows great performance gain. But non-scaled PAD repeat(not SAMPLES_COVER_CLIP) still spend huge amount of time when browsing some popular web pages.</div> <div> </div> <blockquote style="BORDER-LEFT: #ccc 1px solid; MARGIN: 0px 0px 0px 0.8ex; PADDING-LEFT: 1ex" class="gmail_quote">+ /* num_pixels = max(n), where vx + (n - 1)*unit_x < width_fixed */ \<br>+ num_pixels = (new_src_width_fixed - vx + unit_x - pixman_fixed_e) /<br> unit_x; \<br>+ if (num_pixels > width_remain)<br> \<br>+ num_pixels = width_remain;<br> \<br><br>I'm also a bit worried about this part. One thing that makes me feel a<br>bit uneasy is the following expression: "(new_src_width_fixed - vx +<br> unit_x - pixman_fixed_e)". Can we be sure that it can never overflow<br>pixman_fixed_t type? If yes, then some descriptive comment about the<br>valid range of values and assumptions would be useful here. There are<br> some checks in pixman code which reject certain range of parameters<br>for the compositing operation and impose some constraints:<br> <a href="http://cgit.freedesktop.org/pixman/tree/pixman/pixman.c?id=pixman-0.22.0#n556" target="_blank">http://cgit.freedesktop.org/pixman/tree/pixman/pixman.c?id=pixman-0.22.0#n556</a><br> <a href="http://cgit.freedesktop.org/pixman/tree/pixman/pixman.c?id=pixman-0.22.0#n572" target="_blank">http://cgit.freedesktop.org/pixman/tree/pixman/pixman.c?id=pixman-0.22.0#n572</a><br> <a href="http://cgit.freedesktop.org/pixman/tree/pixman/pixman.c?id=pixman-0.22.0#n641" target="_blank">http://cgit.freedesktop.org/pixman/tree/pixman/pixman.c?id=pixman-0.22.0#n641</a><br> We have 'scaling-crash-test.c' file in test directory which tries to<br>do some nasty scaling related things and trigger some problems:<br> <a href="http://cgit.freedesktop.org/pixman/tree/test/scaling-crash-test.c?id=pixman-0.22.0" target="_blank">http://cgit.freedesktop.org/pixman/tree/test/scaling-crash-test.c?id=pixman-0.22.0</a><br> But we might need to extend it with some more test cases specifically<br>targeting this your expression, or maybe even do some random fuzzing.<br><br></blockquote> <div> </div> <div> </div> <div> </div> <div>We have to find max(n), where vx + n*unit_x < max_vx.</div> <div>This is same problem finding max(n) where n*b < c.</div> <div>So n = (b - 1)/c (in integer arithmetic, where b and c is positive)</div> <div>This passed scaling-test but I figured out that </div> <div>num_pixels = ((max_vx - vx - pixman_fixed_e) / unit_x) + 1</div> <div>is correct expression. (new_src_width_fixed is replaced with max_vx)</div> <div>max_vx - vx - pixman_fixed_e does not cause any overflows, because vx is positive and less than max_vx. I've added some comment about this.</div> <div> </div> <div> </div> <div> </div> <blockquote style="BORDER-LEFT: #ccc 1px solid; MARGIN: 0px 0px 0px 0.8ex; PADDING-LEFT: 1ex" class="gmail_quote">One more possible problem here is that we get a division operation in<br>the inner loop, which may be performed as frequently as once per each<br> PIXMAN_REPEAT_NORMAL_MIN_WIDTH pixels (8 in your patch) and impact<br>performance. Division is particularly bad on ARM processors which<br>don't have a special instruction for it, and it is also relatively<br>slow for x86. If this division could be avoided by using some<br> Bresenham's alike algorithm, it probably would be a good improvement.<br><br>+ if (src_image->bits.width < PIXMAN_REPEAT_NORMAL_MIN_WIDTH) \<br>+ { \<br> + for (i=0; i<PIXMAN_REPEAT_NORMAL_MIN_WIDTH; ) \<br>+ { \<br>+ for (j=0; j<src_image->bits.width; j++, i++ ) \<br> + { \<br>+ expended_top[i] = src_line_top[j]; \<br>+ expended_bottom[i] = src_line_bottom[j]; \<br> + } \<br>+ } \<br>+ new_src_width = i; \<br> + src_line_top = &expended_top[0]; \<br>+ src_line_bottom = &expended_bottom[0]; \<br>+ } \<br> <br>Even though it's likely not a bit issue here, but at least in the case<br>of bilinear scaling these prepared extended source scanlines can be<br>reused across different iterations and not regenerated each times. In<br> the case if we have scale factor close to 1x, each source scanline is<br>going to be accessed approximately twice, for 2x upscaling - accessed<br>4x times, etc.<br></blockquote> <div> </div> <div> </div> <div> </div> <div>I've modified to reuse previous source scanline if it does not need update. <br> </div></div> <div>-- <br>Best Regards, </div> <div>Taekyun Kim</div><br>