[Mesa-dev] r600g old design -> new design

Tue Sep 28 08:40:24 PDT 2010

Hi,

So new design is on parity with old design from piglit point of view,
so i am now
switching new design to be default. If nothings come up in the coming day i will
remove the old design and do some renaming of files :
winsys r600_state2.c -> r600_hw_context.c
winsys evergreen_state.c -> evergreen_hw_context.c
pipe r600_state2.c -> r600_state.c

And also move things around (mostly dispatch some of pipe r600_state2 to various
r600_blit, r600_texture, ... files).

I have try to make sure that all fix that went in old design were also
in new one
but i might have miss few things, sorry if i did.

New design seems already faster than previous one, which is a good sign given
that it has no optimization. Here are list of things to do to improve
performance :

- use score for placing bo, bo placement will be recorded in bo structure and
each time a state is bind bo score will be updated (bo bound as framebuffer
will get their score for placing into vram increase while bo bound as small
vertex buffer will endup in GTT, also anytime a bo is mapped for transfer for
CPU read its score for GTT placement increase thus bo that are often updated
by CPU will more likely place into GTT)

- optimize flushing inside cs, avoid too much flush can give hugue perf increase
according to some early benchmark. idea is to keep list of flushing
state in reloc
struct of cs and avoid flushing if bo is already flush. For behaving
properly with
ddx we will always flush at the end of the cs.

- remove the group and use a hash table, will simplify code somemore and avoid
mistake like picking wrong group for a register (also reduce the number of dword
per register)

- account bo size for each cs and make sure there is enough memory for current
cs and flush if needed (this will be added in winsys along the reloc
information)

- properly disable query when clearing buffer

- avoid rebuilding vertex/fragment shader at each draw command, very huge
performance boost. We need to record some of the information specific to the
shader and compare with current state to see if need rebuilding.
Further optimization
would be to use fetch shader.

- shader compiler ! (thought i am kind of convince as of today this is not our
biggest bottleneck)

Along this gallium work, i think more optimization on our ttm usage
pattern could
give huge performance boost. I will add some infrastructure to allow gathering
information about TTM activity on bo side (amount of meg move around, and
other memory related operation).

The separation of packet building and register value was intentional to allow to
test a new way to submit command to GPU. New r600g design is pretty close to
what i think would be an interesting scheme. Here is a brief description of what
could be the API.

User will submit list of 4 dwords structure :
struct drm_radeon_reg {
    u32  offset;
    u32  value;
    u32  handle;
    u32  misc;
};

One reg is considered to be an initator ie triggering rendering or GPU
acitivity.
Kernel will use similar structure as what is the winsys today. It will
keep track
of the GPU context. Performing check becomes a lot easier, it simply a matter
of looking up register.

Backward & forward compatibility can also be improved, assumption is kernel
will drop & ignore register he doesn't know about and in the ioctl
return it might
update some bit of misc to signal to userspace which regs are dropped (thus no
need to have versioning, userspace can probe which regs it can use).

Kernel will now have a better understanding of what's going on and
thus can optimize
away more of the flush (which seems to be a huge hit on performance). In case
of dual API use (ie if cs ioctl is also use) kernel will assume GPU
has loose context
and will reprogram, reflush everythings.

Further optimization could be to keep a context per process and allow userspace
to only update its context, which could shrink the amount of data to communicate
btw userspace & kernel.

Some feature will go through different path, like occlusion query, maybe some
special query space.

For multi GPU idea is to have a context per GPU and allow userspace to select
which GPU they are updating (or maybe through some bit in misc like a bitmask
to know on which GPU to mirror the state).

I intend to play with this kernel idea as time permit over the next
few months, i
hope we can improve performance through such API.

Cheers,
Jerome