No subject

Wed Nov 23 21:26:15 PST 2011

are looking for. I recognize that it disrupts your current views/plans on 
how this should be done, but I do want to work with you to find a suitable 
middle ground that covers most of the possiblities.

In case you are looking at my code to follow the above-described 
scenarios, please make sure you pull the latest stuff from my github 
repository. I have been pushing new stuff since my original annoucement.

> I still foresee problems with tiling, we generally don't encourage
> accel code to live in the kernel, and you'll really want a
> tiled->untiled blit for this thing,

Accel code should not go into the kernel (that I fully agree) and there is 
nothing here that would behove us to do so. Restricting my comments to 
Radeon GPU (which is the only one that I know well enough), shaders for 
blit copy live in the kernel and irrespective of VCRTCM work. I rely on 
them to move the frame buffer out of VRAM to CTD device but I don't add 
any additional features.

Now for detiling, I think that it should be the responsibility of the 
receiving CTD device, not the GPU pushing the data (Alan mentioned that 
during the initial set of comments, and although I didn't say anything to 
it that has been my view as well).

Even if you wanted to use GPU for detiling (which I'll explain shortly why 
you should not), it would not require any new accel code in the kernel. It 
would merely require one bit flip in the setup of blit copy that already 
lives in the kernel.

However, de-tiling in GPU is a bad idea for two reasons. I tried to do 
that just as an experiment on Radeon GPUs and watched with the PCI Express 
analyzer what happens on the bus (yeah, I have some "heavy weapons" in my 
lab). Normally a tile is a continuous array of memory locations in VRAM. 
If blit-copy function is told to assume tiled source and linear 
destination (de-tiling) it will read a continuous set of addresses in 
VRAM, but then scatter 8 rows of 8 pixels each on non-contignuous set of 
addresses of the destination. If the destination is the PCI-Express bus, 
it will result in 8 32-byte write transactions instead of 2 128-byte 
transactions per each tile. That will choke the throughput of the bus 
right there.

BTW, this is the crux of the blit-copy performance improvement that you 
got from me back in October. Since blit-copy deals with copying a linear 
array, playing with tiled/non-tiled bits only affects the order in which 
addresses are accessed, so the trick was to get rid of short PCIe 
transactions and also shape up linear to rectangle mapping to make address 
pattern more friendly for the host.

> also for Intel GPUs where you have
> UMA, would you read from the UMA.
>

Yes the read would be from UMA. I have not yet looked at Intel GPUs in 
detail, so I don't have an answer for you on what problems would pop up 
and how to solve them, but I'll be glad to revisit the Intel discussion 
once I do some homework.

Some initial thoughts is that frame buffer in Intel are at the end of the 
day pages in the system memory, so anyone/anything can get to them if they 
are correctly mapped.

> It also doesn't solve the optimus GPU problem in any useful fashion,
> since it can't deal with all the use cases, so we still have to write
> an alternate solution that can deal with them, so we just end up with
> two answers.
>

Can you elaborate on some specific use cases that are of your concern? I 
have had this case in mind and I think I can make it work. First I would 
have to add CTD functionality to Intel driver. That should be 
straightforward. Once I get there, I'll be ready to experiment and we'll 
probably be in better position to discuss the specifics then (i.e. when we 
have something working to compare with what you did in PRIME experiemnt), 
but it would be good to know your specific concerns early.

thanks,

Ilija