[Mesa-dev] [PATCH 1/3] cso: don't release sampler states that are bound

Nicolai Hähnle nhaehnle at gmail.com
Wed Dec 7 17:03:17 UTC 2016


On 07.12.2016 17:26, Marek Olšák wrote:
> Optimizing the CSO cache isn't exactly on the top of my list, so I
> can't really do that right now.
>
> I think that varying the LOD bias is starting to be common. It's used
> for smooth LOD transitions when loading textures during rendering.
> Games with lots of content typically do that. This particular Batman
> game uses UE3.
>
> I used to play Borderlands 2 (UE3), it had smooth LOD transitions too,
> but it never hung.
>
> Does this patch have your Rb?

Yes, go ahead.

Nicolai

>
> Thanks,
> Marek
>
> On Wed, Dec 7, 2016 at 11:16 AM, Nicolai Hähnle <nhaehnle at gmail.com> wrote:
>> On 07.12.2016 08:50, Michel Dänzer wrote:
>>>
>>> On 06/12/16 10:24 PM, Marek Olšák wrote:
>>>>
>>>> On Mon, Dec 5, 2016 at 10:05 AM, Michel Dänzer <michel at daenzer.net>
>>>> wrote:
>>>>>
>>>>> On 03/12/16 05:38 AM, Marek Olšák wrote:
>>>>>>
>>>>>> From: Marek Olšák <marek.olsak at amd.com>
>>>>>>
>>>>>> This fixes random radeonsi GPU hangs in Batman Arkham: Origins (Wine)
>>>>>> and
>>>>>> probably many other games too.
>>>>>>
>>>>>> cso_cache deletes sampler states when the cache size is too big and
>>>>>> doesn't
>>>>>> check which sampler states are bound, causing use-after-free in
>>>>>> drivers.
>>>>>> Because of that, radeonsi uploaded garbage sampler states and the
>>>>>> hardware
>>>>>> went bananas. Other drivers may have experienced similar issues.
>>>>>>
>>>>>> Cc: 13.0 12.0 <mesa-stable at lists.freedesktop.org>
>>>>>> ---
>>>>>>  src/gallium/auxiliary/cso_cache/cso_cache.c | 4 +++-
>>>>>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/src/gallium/auxiliary/cso_cache/cso_cache.c
>>>>>> b/src/gallium/auxiliary/cso_cache/cso_cache.c
>>>>>> index b240c93..1f3be4b 100644
>>>>>> --- a/src/gallium/auxiliary/cso_cache/cso_cache.c
>>>>>> +++ b/src/gallium/auxiliary/cso_cache/cso_cache.c
>>>>>> @@ -181,21 +181,23 @@ static inline void sanitize_cb(struct cso_hash
>>>>>> *hash, enum cso_cache_type type,
>>>>>>        --to_remove;
>>>>>>     }
>>>>>>  }
>>>>>>
>>>>>>  struct cso_hash_iter
>>>>>>  cso_insert_state(struct cso_cache *sc,
>>>>>>                   unsigned hash_key, enum cso_cache_type type,
>>>>>>                   void *state)
>>>>>>  {
>>>>>>     struct cso_hash *hash = _cso_hash_for_type(sc, type);
>>>>>> -   sanitize_hash(sc, hash, type, sc->max_size);
>>>>>> +
>>>>>> +   if (type != CSO_SAMPLER)
>>>>>> +      sanitize_hash(sc, hash, type, sc->max_size);
>>>>>>
>>>>>>     return cso_hash_insert(hash, hash_key, state);
>>>>>>  }
>>>>>>
>>>>>>  struct cso_hash_iter
>>>>>>  cso_find_state(struct cso_cache *sc,
>>>>>>                 unsigned hash_key, enum cso_cache_type type)
>>>>>>  {
>>>>>>     struct cso_hash *hash = _cso_hash_for_type(sc, type);
>>>>>>
>>>>>>
>>>>>
>>>>> If I understand correctly, this means that the maximum cache size
>>>>> effectively isn't enforced for sampler states? Wouldn't it be better
>>>>> instead to add code which prevents currently bound states from being
>>>>> deleted from here?
>>>>
>>>>
>>>> Not in this patch. Maybe later and if CPU profiling results show that
>>>> it matters.
>>>>
>>>> Currently, if the cache size is exceeded, the GPU usually hangs. If
>>>> pruning the cache was important, most apps would hang.
>>>>
>>>> Thus, I'd like to push this as-is.
>>>
>>>
>>> Fair enough, but this will require some follow-up work:
>>>
>>> Why are sampler states singled out? The same problem can at least
>>> theoretically happen with other CSOs as well, right?
>>
>>
>> The other CSOs only have single bind points. So with other CSOs, what
>> happens is that:
>>
>> 1. State tracker wants to bind some state, creates the CSO.
>> 2. cso_cache gets cleaned out; this might delete the currently bound CSO.
>> 3. Currently bound CSO gets immediately overwritten with the new state
>> anyway.
>>
>> Samplers are different because there are many bind points within a single
>> context.
>>
>> You're right though that the sounder approach would be to check all CSOs
>> against current bind points during cache cleanup, just in case a driver does
>> something with the old state on unbind.
>>
>>
>>> The maximum cache size needs to be addressed, either something like
>>> above or by just removing the maximum cache size.
>>>
>>>
>>> BTW, if varying the LOD bias is common, maybe the LOD bias value should
>>> be tracked separately from the remaining sampler state?
>>
>>
>> It's not common at all, actually. Some apps are just weird.
>>
>>
>>> P.S. The bug has been there forever, so I don't see the justification
>>> for bypassing the normal stable branch process.
>>
>>
>> And we've had mysterious hang bug reports forever as well. I wouldn't be
>> surprised if a large number or even all of them could be traced back to this
>> issue.
>>
>> Nicolai


More information about the mesa-dev mailing list