radeonsi - crash with 7870 Tahiti - Bad Active CU detection?

Alexandre Biron bironalexandre at gmail.com
Thu Oct 8 18:59:41 PDT 2015


Hi!

I've never been able to run the open source drm driver on my 7870
Tahiti card. The console kms works but it crashes as soon as X is
started. There have been many mentions of it in bug reports, but none
of the attempts at fixes worked.
https://bugs.freedesktop.org/show_bug.cgi?id=71689 is currently open
https://bugs.freedesktop.org/show_bug.cgi?id=60879 was about a few
different issues, including this one, but none of the proposed fixes
worked on my 7870

I decided to debug this on my own, and though I am a total noob at
driver development, I think I made some progress at understanding the
issue.

The 7870 Tahiti is a "harvested" chip, which means some CUs are
disabled. 25% of them in this case.  The code handling this is in
si.c, in the function is si_setup_spi(). The idea seems to be that a
bit mask telling which CUs are truly available must be set in the
SPI_STATIC_THREAD_MGMT_3 register. But the algorithm to build that
mask seems fuzzy to me. It walks the bits of active_cu until it finds
an active one, and stop there to build its make.

            data = RREG32(SPI_STATIC_THREAD_MGMT_3);
            active_cu = si_get_cu_enabled(rdev, cu_per_sh);
            mask = 1;
            for (k = 0; k < 16; k++) {
                mask <<= k;
                if (active_cu & mask) {
                    data &= ~mask;
                    WREG32(SPI_STATIC_THREAD_MGMT_3, data);
                    break;
                }
            }

However, from the little I understand that doesn't cover all cases,
but only works if the disabled CUs are in the lower bits. For my card,
the active_cu results are:
Decimal - Binary
252 - 11111100
252 - 11111100
207 - 11001111
252 - 11111100

As you can see the 3rd group has its disabled CUs straight in the
middle of it, but the algorithm probably thinks that they are all good
since the first bit is 1 and it stops right there. So I guess it tries
to use bad CUs at runtime and fails miserably
I tried to change the way the register data is computed by I just
can't figure the logic of it (and I couldn't find much details in
AMD's doc). Assuming it generates the right thing for active_cu ==
11111100, I get data == 1111111111110111.
So I have 2 disabled units, but only a single 0 in the mask?
Pretending to have more disabled units, active_cu == 11110000, that
generates data == 1111111110111111. Again, single 0. Is that right?
What would be the bit pattern required for active_cu == 11001111 ?

I might be way off track in my investigation, so please enlighten me!
Thanks for your help,
Alexandre


More information about the dri-devel mailing list