multi-threading task under SolarMutex -> deadlock

Armin Le Grand armin_le_grand at me.com
Wed May 18 08:43:30 UTC 2016


Hi Norbert,

thanks for also having an eye on this - I am looking for the failure 
reports on ci.libreoffice.org currently, too.
Last is from http://ci.libreoffice.org/job/lo_tb_master_linux_dbg/7195/, 
so last is from Friday, 13th (uhhh...)

Have you seen such or similar stacks anywhere else? In the meantime I 
tried ChartTest massively locally on Linux and Win, but could never 
locally reproduce.

The SolarMutex thing is sure not good, but only a symtom showing up I 
would guess. There are tests and codes in SC e.g. that also use massive 
parallelism, not limited to an upper core count. The basic problem is 
that the MainThread always holds the SolarMutex, so also during calling 
waitUntilEmpty(). The consequence is that no WorkerThread is allowed to 
get the SolarMutex, limiting multithreaded actions to this.

I knew that and made sure that the multithreaded 3DRenderer 
WorkerThreads do not need the SolarMutex for their work. I did not know 
yet that the memory fail handler tries to get the SolarMutex, too, but 
is logic when it wants to bring up a dialog in some form.

But the deeper problem is that allocation - here extending a vector of 
pointers to a helper class from 1 to 2 entries - fails. Sometimes. And 
that only on many cores on that machine (up to now).

I checked all involved classes, their refcounting and that the used 
o3tl::cow_wrapper uses the ThreadSafeRefCountingPolicy, looks good so 
far. It is also not the case that the WorkerThreads need massive amounts 
of own memory, so I doublt that limiting to e.g. 8 thredads would change 
this, except maybe making it less probable to happen. I looked at 
o3tl::cow_wrapper itself, and the basic B2D/B3DPrimitive implementations 
which internally use a comphelper::OBaseMutex e.g. for creating buffered 
decompositions.

I found no concrete reason until now, any tipps/help much appreciated.

I keep watching this - at least it did not happen in all the builds 
since 13th and on no other machine, so the thread now is to somehow nail 
it to get it reproducable. If someone has other traces, please send 
them! I would hate to take this back, esp. because we will need 
multithreading more and more since Moore's law is tilting.

Sincerely,

Armin


Am 17.05.2016 um 14:35 schrieb Norbert Thiebaud:
> On Tue, May 17, 2016 at 6:44 AM, Thorsten Behrens <thb at libreoffice.org> wrote:
>> Norbert Thiebaud wrote:
>>> The threaded work then raise() due to some memory problem and out
>>> signal handler try to acquire the solar mutex ->deadlock
>>>
>> Eek, that's ugly. Then again, at the core is the OOM condition, which
>> needs solving independently. Per chance, is that happening on a box
>> with massive amounts of CPU threads?
> it is on the ci builder, so yeah 32 thread or so.
>
> but I disagree that it is _at the core_
>
> at the core this exhibit 2 things:
> 1/ we do a lot of thing that is verboten in a signal handler.
> 2/ taking a lock that rely on other thread to move forward while
> holding the solarmutex is begging for deadlock.
>
> Norbert

-- 
--
ALG (PGP Key: EE1C 4B3F E751 D8BC C485 DEC1 3C59 F953 D81C F4A2)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/libreoffice/attachments/20160518/e91e7f0b/attachment.html>


More information about the LibreOffice mailing list