<div><div><div>For a particular pixel block we should do following steps.</div><div>(for over_n_8888_0565 case)</div><div><br></div><div>1. fetch dest</div><div>2. fetch mask</div><div>3. combine_mask_ca</div><div>4. convert dest to x888</div> <div>5. combine_over_ca part A</div><div>6. combine_over_ca part B</div><div>7. convert result to 0565</div><div>8. store result</div><div>(put cache preload somewhere)</div><div><br></div><div>Your version is the case with</div> <div>head = (3, 4, 5)</div><div>tail = (6, 7)</div><div>tail_head = (6, 7, 3, 4, 5, 8) with 1, 2 is in the middle of block 6</div><div><br></div><div>We can figure out input/output/temp registers of each block.</div><div> So the dependency chain and critical path can be identified.</div><div><br></div><div>Let's see core tail_head block</div><div><br></div><div><div>.macro n_8888_0565_ca_tail_head</div><div> 6. combine_over_ca part B</div> <div> vrshr.u16 q10, q6, #8</div><div> vrshr.u16 q14, q7, #8</div><div> 1. fetch dest</div><div> vrshr.u16 q15, q11, #8</div><div> vraddhn.u16 d16, q10, q6</div><div> vraddhn.u16 d17, q14, q7</div> <div> vraddhn.u16 d18, q15, q11 </div><div> 2. fetch mask</div><div> /* bubble if above block 2 does not exist */</div><div> vqadd.u8 q8, q0, q8</div><div> /* bubble if above block 2 does not exist */</div> <div> vqadd.u8 d18, d2, d18 </div><div> /* bubble with following block 7 */</div><div><br></div><div> 7. convert result to 0565</div><div> vshll.u8 q14, d18, #8 </div><div> vshll.u8 q10, d17, #8 </div> <div> vshll.u8 q15, d16, #8 </div><div> vsri.u16 q14, q10, #5 </div><div> /* bubble */</div><div> vsri.u16 q14, q15, #11</div><div><br></div><div> cache_preload 8, 8</div><div> 3. combine_mask_ca</div> <div> 4. convert dest to x888</div><div> 5. combine_over_ca part A</div><div> 8. store destination</div><div>.endm</div><div><br></div><div>I marked bubbles that I could find.</div><div>Here we can make step 3 independent(or less dependent) from above step 6 and 7 by proper allocation of registers.</div> <div>So we can insert some instructions of step 3 into the above bubble positions.</div><div>Output of step 1(fetch dest) will be read in step 4 and output of step 2(fetch mask) will be read in step 3.</div><div>So I think you can fetch mask first and then dest at the beginning of tail_head block and remaining bubbles can be filled with instructions from step 3.</div> <div><br></div><div>Maybe this does not work, or there can be some other better ways to achieve optimal performance.</div><div><font class="Apple-style-span" face="arial, sans-serif"><span class="Apple-style-span" style="border-collapse: collapse;"><br> </span></font></div><div>-- <br>Best Regards,<div>Taekyun Kim</div><br> </div></div></div></div>