radeonsi - crash with 7870 Tahiti - Bad Active CU detection?

Alexandre Biron bironalexandre at gmail.com
Mon Oct 12 20:43:26 PDT 2015


Hi!

Skipping si_setup_spi doesn't help at all. Where is harvesting
handled/setup then? Only other place that to me seems related would be
si_setup_rb, but from what I can see by logging none of my backends
seem to be disabled.

Anything else I can check?
Thanks!
Alexandre

On Mon, Oct 12, 2015 at 5:20 PM, Alex Deucher <alexdeucher at gmail.com> wrote:
> On Thu, Oct 8, 2015 at 9:59 PM, Alexandre Biron
> <bironalexandre at gmail.com> wrote:
>> Hi!
>>
>> I've never been able to run the open source drm driver on my 7870
>> Tahiti card. The console kms works but it crashes as soon as X is
>> started. There have been many mentions of it in bug reports, but none
>> of the attempts at fixes worked.
>> https://bugs.freedesktop.org/show_bug.cgi?id=71689 is currently open
>> https://bugs.freedesktop.org/show_bug.cgi?id=60879 was about a few
>> different issues, including this one, but none of the proposed fixes
>> worked on my 7870
>>
>> I decided to debug this on my own, and though I am a total noob at
>> driver development, I think I made some progress at understanding the
>> issue.
>>
>> The 7870 Tahiti is a "harvested" chip, which means some CUs are
>> disabled. 25% of them in this case.  The code handling this is in
>> si.c, in the function is si_setup_spi(). The idea seems to be that a
>> bit mask telling which CUs are truly available must be set in the
>> SPI_STATIC_THREAD_MGMT_3 register. But the algorithm to build that
>> mask seems fuzzy to me. It walks the bits of active_cu until it finds
>> an active one, and stop there to build its make.
>>
>>             data = RREG32(SPI_STATIC_THREAD_MGMT_3);
>>             active_cu = si_get_cu_enabled(rdev, cu_per_sh);
>>             mask = 1;
>>             for (k = 0; k < 16; k++) {
>>                 mask <<= k;
>>                 if (active_cu & mask) {
>>                     data &= ~mask;
>>                     WREG32(SPI_STATIC_THREAD_MGMT_3, data);
>>                     break;
>>                 }
>>             }
>>
>> However, from the little I understand that doesn't cover all cases,
>> but only works if the disabled CUs are in the lower bits. For my card,
>> the active_cu results are:
>> Decimal - Binary
>> 252 - 11111100
>> 252 - 11111100
>> 207 - 11001111
>> 252 - 11111100
>>
>> As you can see the 3rd group has its disabled CUs straight in the
>> middle of it, but the algorithm probably thinks that they are all good
>> since the first bit is 1 and it stops right there. So I guess it tries
>> to use bad CUs at runtime and fails miserably
>> I tried to change the way the register data is computed by I just
>> can't figure the logic of it (and I couldn't find much details in
>> AMD's doc). Assuming it generates the right thing for active_cu ==
>> 11111100, I get data == 1111111111110111.
>> So I have 2 disabled units, but only a single 0 in the mask?
>> Pretending to have more disabled units, active_cu == 11110000, that
>> generates data == 1111111110111111. Again, single 0. Is that right?
>> What would be the bit pattern required for active_cu == 11001111 ?
>>
>> I might be way off track in my investigation, so please enlighten me!
>
> I don't think this register even needs to be programmed.  Can you try
> skipping the call to si_setup_spi()?  The hw default values 0xffff
> should be fine.  These registers are not for harvesting, but rather
> for limiting the number of CUs uses by specific shader stages so even
> if you program them wrong, the hw will do the right thing internally.
>
> Alex


More information about the dri-devel mailing list