radeonsi - crash with 7870 Tahiti - Bad Active CU detection?

Alex Deucher alexdeucher at gmail.com
Mon Oct 12 14:20:06 PDT 2015


On Thu, Oct 8, 2015 at 9:59 PM, Alexandre Biron
<bironalexandre at gmail.com> wrote:
> Hi!
>
> I've never been able to run the open source drm driver on my 7870
> Tahiti card. The console kms works but it crashes as soon as X is
> started. There have been many mentions of it in bug reports, but none
> of the attempts at fixes worked.
> https://bugs.freedesktop.org/show_bug.cgi?id=71689 is currently open
> https://bugs.freedesktop.org/show_bug.cgi?id=60879 was about a few
> different issues, including this one, but none of the proposed fixes
> worked on my 7870
>
> I decided to debug this on my own, and though I am a total noob at
> driver development, I think I made some progress at understanding the
> issue.
>
> The 7870 Tahiti is a "harvested" chip, which means some CUs are
> disabled. 25% of them in this case.  The code handling this is in
> si.c, in the function is si_setup_spi(). The idea seems to be that a
> bit mask telling which CUs are truly available must be set in the
> SPI_STATIC_THREAD_MGMT_3 register. But the algorithm to build that
> mask seems fuzzy to me. It walks the bits of active_cu until it finds
> an active one, and stop there to build its make.
>
>             data = RREG32(SPI_STATIC_THREAD_MGMT_3);
>             active_cu = si_get_cu_enabled(rdev, cu_per_sh);
>             mask = 1;
>             for (k = 0; k < 16; k++) {
>                 mask <<= k;
>                 if (active_cu & mask) {
>                     data &= ~mask;
>                     WREG32(SPI_STATIC_THREAD_MGMT_3, data);
>                     break;
>                 }
>             }
>
> However, from the little I understand that doesn't cover all cases,
> but only works if the disabled CUs are in the lower bits. For my card,
> the active_cu results are:
> Decimal - Binary
> 252 - 11111100
> 252 - 11111100
> 207 - 11001111
> 252 - 11111100
>
> As you can see the 3rd group has its disabled CUs straight in the
> middle of it, but the algorithm probably thinks that they are all good
> since the first bit is 1 and it stops right there. So I guess it tries
> to use bad CUs at runtime and fails miserably
> I tried to change the way the register data is computed by I just
> can't figure the logic of it (and I couldn't find much details in
> AMD's doc). Assuming it generates the right thing for active_cu ==
> 11111100, I get data == 1111111111110111.
> So I have 2 disabled units, but only a single 0 in the mask?
> Pretending to have more disabled units, active_cu == 11110000, that
> generates data == 1111111110111111. Again, single 0. Is that right?
> What would be the bit pattern required for active_cu == 11001111 ?
>
> I might be way off track in my investigation, so please enlighten me!

I don't think this register even needs to be programmed.  Can you try
skipping the call to si_setup_spi()?  The hw default values 0xffff
should be fine.  These registers are not for harvesting, but rather
for limiting the number of CUs uses by specific shader stages so even
if you program them wrong, the hw will do the right thing internally.

Alex


More information about the dri-devel mailing list