<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman",serif;}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
span.EmailStyle17
        {mso-style-type:personal-reply;
        font-family:"Calibri",sans-serif;
        color:#1F497D;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-family:"Calibri",sans-serif;}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Thanks Petr for reviewing. On an Intel hardware, having one loop with storeu and a tail section to handle the remaining pixels doesn’t seem to show any performance
 benefits for these operators (OVER, ROVER, ADD and ROUT) but certainly adds a lot more code. The masked operations is very useful in that aspect, where I can avoid having a separate section for tail. I don’t have access to an AMD hardware, so I won’t be able
 to tell how this performs on an AMD. <o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">And I agree with the suggestion of changing the array to a static const. I will incorporate that change.
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D">Raghuveer<o:p></o:p></span></p>
<p class="MsoNormal"><a name="_MailEndCompose"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D"><o:p> </o:p></span></a></p>
<p class="MsoNormal"><a name="_____replyseparator"></a><b><span style="font-size:11.0pt;font-family:"Calibri",sans-serif">From:</span></b><span style="font-size:11.0pt;font-family:"Calibri",sans-serif"> Petr Kobalíček [mailto:kobalicek.petr@gmail.com]
<br>
<b>Sent:</b> Tuesday, March 19, 2019 3:14 AM<br>
<b>To:</b> Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com><br>
<b>Cc:</b> pixman@lists.freedesktop.org<br>
<b>Subject:</b> Re: [Pixman] [PATCH 2/2] AVX2 implementation of OVER, ROVER, ADD, ROUT operators.<o:p></o:p></span></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<div>
<div>
<div>
<p class="MsoNormal">static force_inline __m256i<br>
get_partial_256_data_mask (const int num_elem, const int total_elem)<br>
{<br>
    int maskint[16] = {-1,-1,-1,-1,-1,-1,-1,-1,1, 1, 1, 1, 1, 1, 1, 1};<br>
    int* addr = maskint + total_elem - num_elem;<br>
    return _mm256_loadu_si256 ((__m256i*) addr);<br>
}<o:p></o:p></p>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">I would prefer `static const int` (if you don't want to construct such array every time) and changing `1` to `0` to make the intention of masking clear. Additionally, I think somebody should benchmark this on AMD hardware as it's not clear
 to me if `vpmaskmovd` would not make things worse especially in `_mm256_maskstore_epi32` case. I mean why not to have one loop and a tail section that would handle the remaining pixels? I would go this way in any case.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Intel documentation doesn't state the latency of `vpmaskmovd`:<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">  <a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_maskstore_epi32&expand=3525">https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_maskstore_epi32&expand=3525</a><br>
<br>
And Agner Fog's instruction tables completely miss this instruction in some cases too (like AMD Ryzen):<br>
  <a href="https://www.agner.org/optimize/instruction_tables.pdf">https://www.agner.org/optimize/instruction_tables.pdf</a><br>
<br>
And my own AsmGrid is missing that one too (although I will add it soon):<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">  <a href="https://asmjit.com/asmgrid/">https://asmjit.com/asmgrid/</a><br>
<br>
I know that mentioning AMD to somebody who works at Intel probably doesn't make much sense, but since this change would affect performance of AMD users too I think it would be fair to check whether this doesn't regress before merging.<br>
<br>
Thank you,<br>
- Petr<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</div>
</div>
</div>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal">On Sun, Mar 17, 2019 at 5:20 PM Raghuveer Devulapalli <<a href="mailto:raghuveer.devulapalli@intel.com">raghuveer.devulapalli@intel.com</a>> wrote:<o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in">
<p class="MsoNormal">From: raghuveer devulapalli <<a href="mailto:raghuveer.devulapalli@intel.com" target="_blank">raghuveer.devulapalli@intel.com</a>><br>
<br>
Performance benefits benchmarked on a Intel(R) Core(TM) i9-7900 CPU @<br>
3.30GHz. These patches improve performance of both high level and low level<br>
benchmarks.<br>
<br>
|-----------------------+----------+------------+----------+------------+----------|<br>
| cairo-perf-trace      | AVX2 Avg | AVX2 stdev | SSE2 Avg | SSE2 stdev | % change |<br>
|-----------------------+----------+------------+----------+------------+----------|<br>
| poppler               | 1.125s   | 0.30%      | 1.284s   | 0.19%      | +14.13%  |<br>
| firefox-canvas-scroll | 2.503s   | 0.21%      | 2.853s   | 0.22%      | +13.98%  |<br>
|-----------------------+----------+------------+----------+------------+----------|<br>
<br>
|--------------------+---------+---------+----------|<br>
| lowlevel-blt-bench | AVX2    | SSE2    | % change |<br>
|--------------------+---------+---------+----------|<br>
| OVER_8888_8888 L1  | 2118.06 | 1250.50 | +69.37%  |<br>
| OVER_8888_8888 L2  | 1967.59 | 1245.87 | +57.90%  |<br>
| OVER_8888_8888 M   | 1694.10 | 1183.65 | +43.12%  |<br>
| OVER_8888_8888 HT  | 562.82  | 556.45  | +01.11%  |<br>
| OVER_8888_8888 VT  | 411.19  | 349.78  | +17.56%  |<br>
| OVER_8888_8888 R   | 369.46  | 332.09  | +11.25%  |<br>
| OVER_8888_8888 RT  | 172.36  | 151.37  | +13.86%  |<br>
|--------------------+---------+---------+----------|<br>
---<br>
 pixman/pixman-avx2.c | 635 ++++++++++++++++++++++++++++++++++++++++++-<br>
 1 file changed, 634 insertions(+), 1 deletion(-)<br>
<br>
diff --git a/pixman/pixman-avx2.c b/pixman/pixman-avx2.c<br>
index d860d67..bd1817e 100644<br>
--- a/pixman/pixman-avx2.c<br>
+++ b/pixman/pixman-avx2.c<br>
@@ -7,13 +7,641 @@<br>
 #include "pixman-combine32.h"<br>
 #include "pixman-inlines.h"<br>
<br>
+#define MASK_0080_AVX2 _mm256_set1_epi16 (0x0080)<br>
+#define MASK_00FF_AVX2 _mm256_set1_epi16 (0x00ff)<br>
+#define MASK_0101_AVX2 _mm256_set1_epi16 (0x0101)<br>
+<br>
+static force_inline __m256i<br>
+load_256_unaligned_masked (int* src, __m256i mask)<br>
+{<br>
+    return _mm256_maskload_epi32 (src, mask);<br>
+}<br>
+<br>
+static force_inline __m256i<br>
+get_partial_256_data_mask (const int num_elem, const int total_elem)<br>
+{<br>
+    int maskint[16] = {-1,-1,-1,-1,-1,-1,-1,-1,1, 1, 1, 1, 1, 1, 1, 1};<br>
+    int* addr = maskint + total_elem - num_elem;<br>
+    return _mm256_loadu_si256 ((__m256i*) addr);<br>
+}<br>
+<br>
+static force_inline void<br>
+negate_2x256 (__m256i  data_lo,<br>
+             __m256i  data_hi,<br>
+             __m256i* neg_lo,<br>
+             __m256i* neg_hi)<br>
+{<br>
+    *neg_lo = _mm256_xor_si256 (data_lo, MASK_00FF_AVX2);<br>
+    *neg_hi = _mm256_xor_si256 (data_hi, MASK_00FF_AVX2);<br>
+}<br>
+<br>
+static force_inline __m256i<br>
+pack_2x256_256 (__m256i lo, __m256i hi)<br>
+{<br>
+    return _mm256_packus_epi16 (lo, hi);<br>
+}<br>
+<br>
+static force_inline void<br>
+pix_multiply_2x256 (__m256i* data_lo,<br>
+                   __m256i* data_hi,<br>
+                   __m256i* alpha_lo,<br>
+                   __m256i* alpha_hi,<br>
+                   __m256i* ret_lo,<br>
+                   __m256i* ret_hi)<br>
+{<br>
+    __m256i lo, hi;<br>
+<br>
+    lo = _mm256_mullo_epi16 (*data_lo, *alpha_lo);<br>
+    hi = _mm256_mullo_epi16 (*data_hi, *alpha_hi);<br>
+    lo = _mm256_adds_epu16 (lo, MASK_0080_AVX2);<br>
+    hi = _mm256_adds_epu16 (hi, MASK_0080_AVX2);<br>
+    *ret_lo = _mm256_mulhi_epu16 (lo, MASK_0101_AVX2);<br>
+    *ret_hi = _mm256_mulhi_epu16 (hi, MASK_0101_AVX2);<br>
+}<br>
+<br>
+static force_inline void<br>
+over_2x256 (__m256i* src_lo,<br>
+           __m256i* src_hi,<br>
+           __m256i* alpha_lo,<br>
+           __m256i* alpha_hi,<br>
+           __m256i* dst_lo,<br>
+           __m256i* dst_hi)<br>
+{<br>
+    __m256i t1, t2;<br>
+<br>
+    negate_2x256 (*alpha_lo, *alpha_hi, &t1, &t2);<br>
+    pix_multiply_2x256 (dst_lo, dst_hi, &t1, &t2, dst_lo, dst_hi);<br>
+<br>
+    *dst_lo = _mm256_adds_epu8 (*src_lo, *dst_lo);<br>
+    *dst_hi = _mm256_adds_epu8 (*src_hi, *dst_hi);<br>
+}<br>
+<br>
+static force_inline void<br>
+expand_alpha_2x256 (__m256i  data_lo,<br>
+                   __m256i  data_hi,<br>
+                   __m256i* alpha_lo,<br>
+                   __m256i* alpha_hi)<br>
+{<br>
+    __m256i lo, hi;<br>
+<br>
+    lo = _mm256_shufflelo_epi16 (data_lo, _MM_SHUFFLE (3, 3, 3, 3));<br>
+    hi = _mm256_shufflelo_epi16 (data_hi, _MM_SHUFFLE (3, 3, 3, 3));<br>
+<br>
+    *alpha_lo = _mm256_shufflehi_epi16 (lo, _MM_SHUFFLE (3, 3, 3, 3));<br>
+    *alpha_hi = _mm256_shufflehi_epi16 (hi, _MM_SHUFFLE (3, 3, 3, 3));<br>
+}<br>
+<br>
+static force_inline  void<br>
+unpack_256_2x256 (__m256i data, __m256i* data_lo, __m256i* data_hi)<br>
+{<br>
+    *data_lo = _mm256_unpacklo_epi8 (data, _mm256_setzero_si256 ());<br>
+    *data_hi = _mm256_unpackhi_epi8 (data, _mm256_setzero_si256 ());<br>
+}<br>
+<br>
+static force_inline void<br>
+save_256_unaligned (__m256i* dst, __m256i data)<br>
+{<br>
+    _mm256_storeu_si256 (dst, data);<br>
+}<br>
+<br>
+static force_inline void<br>
+save_256_unaligned_masked (int* dst, __m256i mask, __m256i data)<br>
+{<br>
+    _mm256_maskstore_epi32 (dst, mask, data);<br>
+}<br>
+<br>
+static force_inline int<br>
+is_opaque_256 (__m256i x)<br>
+{<br>
+    __m256i ffs = _mm256_cmpeq_epi8 (x, x);<br>
+    return (_mm256_movemask_epi8<br>
+           (_mm256_cmpeq_epi8 (x, ffs)) & 0x88888888) == 0x88888888;<br>
+}<br>
+<br>
+static force_inline int<br>
+is_zero_256 (__m256i x)<br>
+{<br>
+    return _mm256_movemask_epi8 (<br>
+       _mm256_cmpeq_epi8 (x, _mm256_setzero_si256 ())) == 0xffffffff;<br>
+}<br>
+<br>
+static force_inline int<br>
+is_transparent_256 (__m256i x)<br>
+{<br>
+    return (_mm256_movemask_epi8 (<br>
+                _mm256_cmpeq_epi8 (x, _mm256_setzero_si256 ())) & 0x88888888)<br>
+                    == 0x88888888;<br>
+}<br>
+<br>
+static force_inline __m256i<br>
+load_256_unaligned (const __m256i* src)<br>
+{<br>
+    return _mm256_loadu_si256 (src);<br>
+}<br>
+<br>
+static force_inline __m256i<br>
+combine8 (const __m256i *ps, const __m256i *pm)<br>
+{<br>
+    __m256i ymm_src_lo, ymm_src_hi;<br>
+    __m256i ymm_msk_lo, ymm_msk_hi;<br>
+    __m256i s;<br>
+<br>
+    if (pm)<br>
+    {<br>
+        ymm_msk_lo = load_256_unaligned (pm);<br>
+        if (is_transparent_256 (ymm_msk_lo))<br>
+        {<br>
+            return _mm256_setzero_si256 ();<br>
+        }<br>
+    }<br>
+<br>
+    s = load_256_unaligned (ps);<br>
+<br>
+    if (pm)<br>
+    {<br>
+           unpack_256_2x256 (s, &ymm_src_lo, &ymm_src_hi);<br>
+           unpack_256_2x256 (ymm_msk_lo, &ymm_msk_lo, &ymm_msk_hi);<br>
+<br>
+           expand_alpha_2x256 (ymm_msk_lo, ymm_msk_hi,<br>
+                                &ymm_msk_lo, &ymm_msk_hi);<br>
+<br>
+           pix_multiply_2x256 (&ymm_src_lo, &ymm_src_hi,<br>
+                                &ymm_msk_lo, &ymm_msk_hi,<br>
+                                &ymm_src_lo, &ymm_src_hi);<br>
+           s = pack_2x256_256 (ymm_src_lo, ymm_src_hi);<br>
+    }<br>
+    return s;<br>
+}<br>
+<br>
+static force_inline __m256i<br>
+expand_alpha_1x256 (__m256i data)<br>
+{<br>
+    return _mm256_shufflehi_epi16 (_mm256_shufflelo_epi16 (data,<br>
+                                        _MM_SHUFFLE (3, 3, 3, 3)),<br>
+                                            _MM_SHUFFLE (3, 3, 3, 3));<br>
+}<br>
+<br>
+static force_inline void<br>
+expand_alpha_rev_2x256 (__m256i  data_lo,<br>
+                        __m256i  data_hi,<br>
+                        __m256i* alpha_lo,<br>
+                        __m256i* alpha_hi)<br>
+{<br>
+    __m256i lo, hi;<br>
+<br>
+    lo = _mm256_shufflelo_epi16 (data_lo, _MM_SHUFFLE (0, 0, 0, 0));<br>
+    hi = _mm256_shufflelo_epi16 (data_hi, _MM_SHUFFLE (0, 0, 0, 0));<br>
+<br>
+    *alpha_lo = _mm256_shufflehi_epi16 (lo, _MM_SHUFFLE (0, 0, 0, 0));<br>
+    *alpha_hi = _mm256_shufflehi_epi16 (hi, _MM_SHUFFLE (0, 0, 0, 0));<br>
+}<br>
+<br>
+static force_inline void<br>
+in_over_2x256 (__m256i* src_lo,<br>
+               __m256i* src_hi,<br>
+               __m256i* alpha_lo,<br>
+               __m256i* alpha_hi,<br>
+               __m256i* mask_lo,<br>
+               __m256i* mask_hi,<br>
+               __m256i* dst_lo,<br>
+               __m256i* dst_hi)<br>
+{<br>
+    __m256i s_lo, s_hi;<br>
+    __m256i a_lo, a_hi;<br>
+<br>
+    pix_multiply_2x256 (src_lo,   src_hi, mask_lo, mask_hi, &s_lo, &s_hi);<br>
+    pix_multiply_2x256 (alpha_lo, alpha_hi, mask_lo, mask_hi, &a_lo, &a_hi);<br>
+<br>
+    over_2x256 (&s_lo, &s_hi, &a_lo, &a_hi, dst_lo, dst_hi);<br>
+}<br>
+<br>
+static force_inline __m256i<br>
+create_mask_2x32_256 (uint32_t mask0,<br>
+                      uint32_t mask1)<br>
+{<br>
+    return _mm256_set_epi32 (mask0, mask1, mask0, mask1,<br>
+                             mask0, mask1, mask0, mask1);<br>
+}<br>
+<br>
+static force_inline __m256i<br>
+unpack_32_1x256 (uint32_t data)<br>
+{<br>
+    return _mm256_unpacklo_epi8 (<br>
+                _mm256_broadcastd_epi32 (<br>
+                    _mm_cvtsi32_si128 (data)), _mm256_setzero_si256 ());<br>
+}<br>
+<br>
+static force_inline __m256i<br>
+expand_pixel_32_1x256 (uint32_t data)<br>
+{<br>
+    return _mm256_shuffle_epi32 (unpack_32_1x256 (data),<br>
+                                    _MM_SHUFFLE (1, 0, 1, 0));<br>
+}<br>
+<br>
+static force_inline void<br>
+core_combine_over_u_avx2_mask (uint32_t *         pd,<br>
+                               const uint32_t*   ps,<br>
+                               const uint32_t*   pm,<br>
+                               int               w)<br>
+{<br>
+    __m256i data_mask, mask;<br>
+    data_mask = _mm256_set1_epi32 (-1);<br>
+<br>
+    while (w > 0)<br>
+    {<br>
+        if (w < 8)<br>
+        {<br>
+            data_mask = get_partial_256_data_mask (w, 8);<br>
+        }<br>
+<br>
+        mask = load_256_unaligned_masked ((int *)pm, data_mask);<br>
+<br>
+        if (!is_zero_256 (mask))<br>
+        {<br>
+            __m256i src, dst;<br>
+            __m256i src_hi, src_lo;<br>
+            __m256i dst_hi, dst_lo;<br>
+            __m256i mask_hi, mask_lo;<br>
+            __m256i alpha_hi, alpha_lo;<br>
+<br>
+            src = load_256_unaligned_masked ((int *)ps, data_mask);<br>
+<br>
+            if (is_opaque_256 (_mm256_and_si256 (src, mask)))<br>
+            {<br>
+                save_256_unaligned_masked ((int *)pd, data_mask, src);<br>
+            }<br>
+            else<br>
+            {<br>
+                dst = load_256_unaligned_masked ((int *)pd, data_mask);<br>
+<br>
+                unpack_256_2x256 (mask, &mask_lo, &mask_hi);<br>
+                unpack_256_2x256 (src, &src_lo, &src_hi);<br>
+<br>
+                expand_alpha_2x256 (mask_lo, mask_hi, &mask_lo, &mask_hi);<br>
+<br>
+                pix_multiply_2x256 (&src_lo, &src_hi,<br>
+                                    &mask_lo, &mask_hi,<br>
+                                    &src_lo, &src_hi);<br>
+<br>
+                unpack_256_2x256 (dst, &dst_lo, &dst_hi);<br>
+                expand_alpha_2x256 (src_lo, src_hi,<br>
+                                    &alpha_lo, &alpha_hi);<br>
+<br>
+                over_2x256 (&src_lo, &src_hi, &alpha_lo, &alpha_hi,<br>
+                            &dst_lo, &dst_hi);<br>
+<br>
+                save_256_unaligned_masked ((int *)pd, data_mask,<br>
+                                          pack_2x256_256 (dst_lo, dst_hi));<br>
+            }<br>
+        }<br>
+        pm += 8;<br>
+        ps += 8;<br>
+        pd += 8;<br>
+        w -= 8;<br>
+    }<br>
+}<br>
+<br>
+static force_inline void<br>
+core_combine_over_u_avx2_no_mask (uint32_t *           pd,<br>
+                                  const uint32_t*       ps,<br>
+                                  int                   w)<br>
+{<br>
+    __m256i src, dst;<br>
+    __m256i src_hi, src_lo, dst_hi, dst_lo;<br>
+    __m256i alpha_hi, alpha_lo;<br>
+    __m256i data_mask = _mm256_set1_epi32 (-1);<br>
+<br>
+    while (w > 0)<br>
+    {<br>
+        if (w < 8) {<br>
+            data_mask = get_partial_256_data_mask (w, 8);<br>
+        }<br>
+<br>
+        src = load_256_unaligned_masked ((int*)ps, data_mask);<br>
+<br>
+        if (!is_zero_256 (src))<br>
+        {<br>
+            if (is_opaque_256 (src))<br>
+            {<br>
+                save_256_unaligned_masked ((int*)pd, data_mask, src);<br>
+            }<br>
+            else<br>
+            {<br>
+                dst = load_256_unaligned_masked ((int*)pd, data_mask);<br>
+<br>
+                unpack_256_2x256 (src, &src_lo, &src_hi);<br>
+                unpack_256_2x256 (dst, &dst_lo, &dst_hi);<br>
+<br>
+                expand_alpha_2x256 (src_lo, src_hi,<br>
+                                    &alpha_lo, &alpha_hi);<br>
+                over_2x256 (&src_lo, &src_hi, &alpha_lo, &alpha_hi,<br>
+                            &dst_lo, &dst_hi);<br>
+<br>
+                save_256_unaligned_masked ((int*)pd, data_mask,<br>
+                                          pack_2x256_256 (dst_lo, dst_hi));<br>
+            }<br>
+        }<br>
+<br>
+        ps += 8;<br>
+        pd += 8;<br>
+        w -= 8;<br>
+    }<br>
+}<br>
+<br>
+static force_inline void<br>
+avx2_combine_over_u (pixman_implementation_t *imp,<br>
+                    pixman_op_t              op,<br>
+                    uint32_t *               pd,<br>
+                    const uint32_t *         ps,<br>
+                    const uint32_t *         pm,<br>
+                    int                      w)<br>
+{<br>
+    if (pm)<br>
+        core_combine_over_u_avx2_mask (pd, ps, pm, w);<br>
+    else<br>
+        core_combine_over_u_avx2_no_mask (pd, ps, w);<br>
+}<br>
+<br>
+static force_inline void<br>
+avx2_combine_add_u (pixman_implementation_t *imp,<br>
+                    pixman_op_t              op,<br>
+                    uint32_t *               dst,<br>
+                    const uint32_t *         src,<br>
+                    const uint32_t *         mask,<br>
+                    int                      width)<br>
+{<br>
+    uint32_t* pd = dst;<br>
+    const uint32_t* ps = src;<br>
+    const uint32_t* pm = mask;<br>
+    int w = width;<br>
+    __m256i data_mask = _mm256_set1_epi32 (-1);<br>
+    __m256i s;<br>
+<br>
+    while (w > 0)<br>
+    {<br>
+        if (w < 8) {<br>
+            data_mask = get_partial_256_data_mask (w, 8);<br>
+        }<br>
+<br>
+       s = combine8 ((__m256i*)ps, (__m256i*)pm);<br>
+<br>
+       save_256_unaligned_masked ((int*)pd, data_mask,<br>
+                        _mm256_adds_epu8 (s,<br>
+                            load_256_unaligned_masked ((int*)pd, data_mask)));<br>
+<br>
+       pd += 8;<br>
+       ps += 8;<br>
+       if (pm)<br>
+           pm += 8;<br>
+       w -= 8;<br>
+    }<br>
+}<br>
+<br>
+static void<br>
+avx2_composite_add_8888_8888 (pixman_implementation_t *imp,<br>
+                              pixman_composite_info_t *info)<br>
+{<br>
+    PIXMAN_COMPOSITE_ARGS (info);<br>
+    uint32_t    *dst_line, *dst;<br>
+    uint32_t    *src_line, *src;<br>
+    int dst_stride, src_stride;<br>
+<br>
+    PIXMAN_IMAGE_GET_LINE (<br>
+       src_image, src_x, src_y, uint32_t, src_stride, src_line, 1);<br>
+    PIXMAN_IMAGE_GET_LINE (<br>
+       dest_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);<br>
+<br>
+    while (height--)<br>
+    {<br>
+       dst = dst_line;<br>
+       dst_line += dst_stride;<br>
+       src = src_line;<br>
+       src_line += src_stride;<br>
+<br>
+       avx2_combine_add_u (imp, op, dst, src, NULL, width);<br>
+    }<br>
+}<br>
+<br>
+static void<br>
+avx2_combine_over_reverse_u (pixman_implementation_t   *imp,<br>
+                             pixman_op_t               op,<br>
+                             uint32_t *                        pd,<br>
+                             const uint32_t *          ps,<br>
+                             const uint32_t *          pm,<br>
+                             int                       w)<br>
+{<br>
+    __m256i ymm_dst_lo, ymm_dst_hi;<br>
+    __m256i ymm_src_lo, ymm_src_hi;<br>
+    __m256i ymm_alpha_lo, ymm_alpha_hi;<br>
+    __m256i data_mask = _mm256_set1_epi32 (-1);<br>
+<br>
+    while (w > 0)<br>
+    {<br>
+        if (w < 8) {<br>
+            data_mask = get_partial_256_data_mask (w, 8);<br>
+        }<br>
+<br>
+        ymm_src_hi = combine8 ((__m256i*)ps, (__m256i*)pm);<br>
+        ymm_dst_hi = load_256_unaligned_masked ((int *) pd, data_mask);<br>
+<br>
+        unpack_256_2x256 (ymm_src_hi, &ymm_src_lo, &ymm_src_hi);<br>
+        unpack_256_2x256 (ymm_dst_hi, &ymm_dst_lo, &ymm_dst_hi);<br>
+<br>
+        expand_alpha_2x256 (ymm_dst_lo, ymm_dst_hi,<br>
+                            &ymm_alpha_lo, &ymm_alpha_hi);<br>
+<br>
+        over_2x256 (&ymm_dst_lo, &ymm_dst_hi,<br>
+                    &ymm_alpha_lo, &ymm_alpha_hi,<br>
+                    &ymm_src_lo, &ymm_src_hi);<br>
+<br>
+        save_256_unaligned_masked ((int *)pd, data_mask,<br>
+                              pack_2x256_256 (ymm_src_lo, ymm_src_hi));<br>
+<br>
+        w -= 8;<br>
+        ps += 8;<br>
+        pd += 8;<br>
+        if (pm)<br>
+            pm += 8;<br>
+    }<br>
+}<br>
+<br>
+static void<br>
+avx2_composite_over_reverse_n_8888 (pixman_implementation_t *imp,<br>
+                                    pixman_composite_info_t *info)<br>
+{<br>
+    PIXMAN_COMPOSITE_ARGS (info);<br>
+    uint32_t src;<br>
+    uint32_t    *dst_line, *dst;<br>
+    __m256i xmm_src;<br>
+    __m256i xmm_dst, xmm_dst_lo, xmm_dst_hi;<br>
+    __m256i xmm_dsta_hi, xmm_dsta_lo;<br>
+    __m256i data_mask;<br>
+    __m256i tmp_lo, tmp_hi;<br>
+    int dst_stride;<br>
+    int32_t w;<br>
+<br>
+    src = _pixman_image_get_solid (imp, src_image, dest_image->bits.format);<br>
+<br>
+    if (src == 0)<br>
+        return;<br>
+<br>
+    PIXMAN_IMAGE_GET_LINE (<br>
+            dest_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);<br>
+<br>
+    xmm_src = expand_pixel_32_1x256 (src);<br>
+<br>
+    while (height--)<br>
+    {<br>
+        dst = dst_line;<br>
+        dst_line += dst_stride;<br>
+        w = width;<br>
+        data_mask = _mm256_set1_epi32 (-1);<br>
+<br>
+        while (w > 0)<br>
+        {<br>
+            if (w < 8) {<br>
+                data_mask = get_partial_256_data_mask (w, 8);<br>
+            }<br>
+<br>
+            xmm_dst = load_256_unaligned_masked ((int*)dst, data_mask);<br>
+<br>
+            unpack_256_2x256 (xmm_dst, &xmm_dst_lo, &xmm_dst_hi);<br>
+            expand_alpha_2x256 (xmm_dst_lo, xmm_dst_hi,<br>
+                                &xmm_dsta_lo, &xmm_dsta_hi);<br>
+<br>
+            tmp_lo = xmm_src;<br>
+            tmp_hi = xmm_src;<br>
+<br>
+            over_2x256 (&xmm_dst_lo, &xmm_dst_hi,<br>
+                        &xmm_dsta_lo, &xmm_dsta_hi,<br>
+                        &tmp_lo, &tmp_hi);<br>
+<br>
+            save_256_unaligned_masked ((int*)dst, data_mask,<br>
+                                        pack_2x256_256 (tmp_lo, tmp_hi));<br>
+            w -= 8;<br>
+            dst += 8;<br>
+        }<br>
+    }<br>
+}<br>
+<br>
+static void<br>
+avx2_composite_over_8888_8888 (pixman_implementation_t *imp,<br>
+                               pixman_composite_info_t *info)<br>
+{<br>
+    PIXMAN_COMPOSITE_ARGS (info);<br>
+    int dst_stride, src_stride;<br>
+    uint32_t    *dst_line, *dst;<br>
+    uint32_t    *src_line, *src;<br>
+<br>
+    PIXMAN_IMAGE_GET_LINE (<br>
+            dest_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);<br>
+    PIXMAN_IMAGE_GET_LINE (<br>
+            src_image, src_x, src_y, uint32_t, src_stride, src_line, 1);<br>
+<br>
+    dst = dst_line;<br>
+    src = src_line;<br>
+<br>
+    while (height--)<br>
+    {<br>
+        avx2_combine_over_u (imp, op, dst, src, NULL, width);<br>
+        dst += dst_stride;<br>
+        src += src_stride;<br>
+    }<br>
+}<br>
+<br>
+static uint32_t *<br>
+avx2_fetch_x8r8g8b8 (pixman_iter_t *iter, const uint32_t *mask)<br>
+{<br>
+    int w = iter->width;<br>
+    __m256i ff000000 = create_mask_2x32_256 (0xff000000, 0xff000000);<br>
+    __m256i data_mask = _mm256_set1_epi32 (-1);<br>
+    uint32_t *dst = iter->buffer;<br>
+    uint32_t *src = (uint32_t *)iter->bits;<br>
+<br>
+    iter->bits += iter->stride;<br>
+<br>
+    while (w > 0)<br>
+    {<br>
+        if (w < 8) {<br>
+            data_mask = get_partial_256_data_mask (w, 8);<br>
+        }<br>
+<br>
+        save_256_unaligned_masked ((int *)dst, data_mask,<br>
+                                            _mm256_or_si256 (<br>
+                                                load_256_unaligned (<br>
+                                                    (__m256i *)src), ff000000));<br>
+        dst += 8;<br>
+        src += 8;<br>
+        w -= 8;<br>
+    }<br>
+    return iter->buffer;<br>
+}<br>
+<br>
+static void<br>
+avx2_combine_out_reverse_u (pixman_implementation_t *imp,<br>
+                            pixman_op_t              op,<br>
+                            uint32_t *               pd,<br>
+                            const uint32_t *         ps,<br>
+                            const uint32_t *         pm,<br>
+                            int                      w)<br>
+{<br>
+    __m256i xmm_src_lo, xmm_src_hi;<br>
+    __m256i xmm_dst_lo, xmm_dst_hi;<br>
+    __m256i data_mask = _mm256_set1_epi32 (-1);<br>
+<br>
+    while (w > 0)<br>
+    {<br>
+        if (w < 8) {<br>
+            data_mask = get_partial_256_data_mask (w, 8);<br>
+        }<br>
+<br>
+       xmm_src_hi = combine8 ((__m256i*)ps, (__m256i*)pm);<br>
+       xmm_dst_hi = load_256_unaligned_masked ((int*) pd, data_mask);<br>
+<br>
+       unpack_256_2x256 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);<br>
+       unpack_256_2x256 (xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);<br>
+<br>
+       expand_alpha_2x256 (xmm_src_lo, xmm_src_hi, &xmm_src_lo, &xmm_src_hi);<br>
+       negate_2x256       (xmm_src_lo, xmm_src_hi, &xmm_src_lo, &xmm_src_hi);<br>
+<br>
+       pix_multiply_2x256 (&xmm_dst_lo, &xmm_dst_hi,<br>
+                           &xmm_src_lo, &xmm_src_hi,<br>
+                           &xmm_dst_lo, &xmm_dst_hi);<br>
+<br>
+       save_256_unaligned_masked (<br>
+           (int*)pd, data_mask, pack_2x256_256 (xmm_dst_lo, xmm_dst_hi));<br>
+<br>
+       ps += 8;<br>
+       pd += 8;<br>
+       if (pm)<br>
+           pm += 8;<br>
+       w -= 8;<br>
+    }<br>
+}<br>
+<br>
 static const pixman_fast_path_t avx2_fast_paths[] =<br>
 {<br>
+    PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, a8r8g8b8, avx2_composite_over_8888_8888),<br>
+    PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, avx2_composite_over_8888_8888),<br>
+    PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, avx2_composite_over_8888_8888),<br>
+    PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, x8b8g8r8, avx2_composite_over_8888_8888),<br>
+    /* PIXMAN_OP_OVER_REVERSE */<br>
+    PIXMAN_STD_FAST_PATH (OVER_REVERSE, solid, null, a8r8g8b8, avx2_composite_over_reverse_n_8888),<br>
+    PIXMAN_STD_FAST_PATH (OVER_REVERSE, solid, null, a8b8g8r8, avx2_composite_over_reverse_n_8888),<br>
+    PIXMAN_STD_FAST_PATH (ADD, a8r8g8b8, null, a8r8g8b8, avx2_composite_add_8888_8888),<br>
+    PIXMAN_STD_FAST_PATH (ADD, a8b8g8r8, null, a8b8g8r8, avx2_composite_add_8888_8888),<br>
     { PIXMAN_OP_NONE },<br>
 };<br>
<br>
-static const pixman_iter_info_t avx2_iters[] = <br>
+#define IMAGE_FLAGS                                                    \<br>
+    (FAST_PATH_STANDARD_FLAGS | FAST_PATH_ID_TRANSFORM |               \<br>
+     FAST_PATH_BITS_IMAGE | FAST_PATH_SAMPLES_COVER_CLIP_NEAREST)<br>
+<br>
+static const pixman_iter_info_t avx2_iters[] =<br>
 {<br>
+    { PIXMAN_x8r8g8b8, IMAGE_FLAGS, ITER_NARROW,<br>
+      _pixman_iter_init_bits_stride, avx2_fetch_x8r8g8b8, NULL<br>
+    },<br>
     { PIXMAN_null },<br>
 };<br>
<br>
@@ -26,6 +654,11 @@ _pixman_implementation_create_avx2 (pixman_implementation_t *fallback)<br>
     pixman_implementation_t *imp = _pixman_implementation_create (fallback, avx2_fast_paths);<br>
<br>
     /* Set up function pointers */<br>
+    imp->combine_32[PIXMAN_OP_OVER] = avx2_combine_over_u;<br>
+    imp->combine_32[PIXMAN_OP_OVER_REVERSE] = avx2_combine_over_reverse_u;<br>
+    imp->combine_32[PIXMAN_OP_ADD] = avx2_combine_add_u;<br>
+    imp->combine_32[PIXMAN_OP_OUT_REVERSE] = avx2_combine_out_reverse_u;<br>
+<br>
     imp->iter_info = avx2_iters;<br>
<br>
     return imp;<br>
-- <br>
2.17.1<br>
<br>
_______________________________________________<br>
Pixman mailing list<br>
<a href="mailto:Pixman@lists.freedesktop.org" target="_blank">Pixman@lists.freedesktop.org</a><br>
<a href="https://lists.freedesktop.org/mailman/listinfo/pixman" target="_blank">https://lists.freedesktop.org/mailman/listinfo/pixman</a><o:p></o:p></p>
</blockquote>
</div>
</div>
</body>
</html>