An iwmmxt optimised shadowUpdateRotate16_270YX, with problems.

Fri Aug 26 13:46:55 PDT 2011

On Fri, Aug 26, 2011 at 9:54 PM, Marko Katić <dromede at gmail.com> wrote:
> I chose 4x4 block size for my initial implementation. First i wanted to get
> it working. The block size could easily be resized to 8x4, 8x8 or even 12x4
> maybe. TLB misses could be completely avoided or at least cut in less than
> half, i'm also working on an improved pxafb driver that splits the
> framebuffer between internal sram (256K on a pxa270) and ram. The ram part
> is placed in a section mapping, greatly reducing the number of TLB entries.
> It may be possible to place the internal sram in a section mapping too, i
> have to try this. If it's possible, the fb proper and the shadowfb would
> only need 2 tlb entries.

OK, using section mapping is definitely a good idea to eliminate TLB
misses. Still I expect that writing more than 4 pixels (8 bytes) per
scanline to the destination buffer would be useful. The "write
combining buffer" has such a name for a reason, it helps to utilize
burst writes and make a better use of memory bandwidth. But running
some benchmarks is always better for identifying which memory access
pattern actually works the best.

> Could alignment be the problem with my patch? The shadowfb and fb proper
> lines are probably not 64 bit aligned before calling iwmmxt_rotate_copy...

Yes, very likely. Also "mov pc, lr" is not thumb interworking safe and
you are clobbering callee saved registers in your code (r4-r11). See:
    http://infocenter.arm.com/help/topic/com.arm.doc.ihi0042d/IHI0042D_aapcs.pdf
This can be solved by using "push {r4, r6, r7, lr}" on entry and "pop
{r4, r6, r7, pc}" on exit.

-- 
Best regards,
Siarhei Siamashka