[Libreoffice] Unexpected failures (eg. segfaults) using PyUNO and LibreOffice/OpenOffice

Fri Oct 21 02:42:11 PDT 2011

On 19/10/11 10:17, Dag Wieers wrote:
> Hi,
>
> During the course of the LibreOffice conference in Paris, we (the
> unoconv and cloudooo projects) found that some of the issues our users
> were having while doing document conversions using PyUNO and OpenOffice
> and LibreOffice were not related to our own project, but have a
> root-cause in either PyUNO or LibreOffice/OpenOffice.
>
> The result of these issues are various and individual:
>
> - segfaults
> - various error codes
> - PyUNO crashes
> - memory leaks
> - xslt problems
>
> And while some of them are reproducable (and consistent), others are
> not, which makes me believe they are related to internal state or timing
> issues of LibreOffice/OpenOffice or related to import/export filters.
>
> Since these issues are very common and can be triggered very quickly, we
> would like to have developers look at them to see what is the cause and
> how we can fix them.

it is well known that the threading implementation in the OOo 
applications is rather unreliable.

currently for thread safety the implementers of UNO APIs are required to 
explicitly use low-level synchronization primitives such as mutexes.

not doing it correctly (such as locking a mutex while it should not be 
locked, or forgetting to lock a mutex while it should be locked) lead to 
very subtle problems that do not show up during ordinary office use, and 
are extremely difficult to reproduce.

basically the only way for developers to find these issues is via the 
subsequenttests, which currently are mostly implemented in Java and 
connect to the OOo instance via a UNO remote bridge.

and the only issues that are half-way easy do debug are deadlocks; in 
case of missing locks you may get a memory corruption _somewhere_ which 
causes some later test to crash, but it is very difficult to track down 
the root cause.

also, most of the developers who work on the applications are not 
experts in multi-threading issues (those who are tend to work on the 
lower-level layers like the URE).  for example i discovered once that in 
Writer almost all destructors of UNO objects do not lock a mutex but 
then call into the Writer core (have partially fixed this for OOo 3.3).

so as a result of all of this driving OOo/LO via remote bridges is 
rather unreliable.

some have suggested the best way out of this is to find a way so that 
implementers of UNO APIs do not have to care about thread safety 
themselves, but instead there should be a framework that does it 
automatically.  such a framework actually exists for many years now (Kay 
Ramme's "UNO threading framework"), but most of OOo/LO does not make use 
of it (iirc it is used for only some database drivers).

of course there may also be problems in PyUNO on top of that; back at 
Sun we had nothing that depended on PyUNO so i guess nobody spent much 
time debugging it...

> The cloudooo project has tested about 100.000 conversions and
> implemented some techniques to overcome the issues by monitoring the
> libreoffice process for memory leaks and 'endless loops', and retrying
> on failure. In the end this brought the failure rate down from about 10%
> tot 1.1%.
> (http://git.erp5.org/gitweb/cloudooo.git)

yes, there are various ways to minimize the risk of failure, no doubt 
you are already doing most of these:
- monitor the OOo instance and restart it
- only connect to an OOo instance from a single thread (should result in 
fewer problems, but e.g. with a JVM you still effectively get multiple 
connections, don't know about PyUNO)

> Both the cloudooo and unoconv presentations will become available and
> contain some information on both projects and the PyUNO/LO unreliabilities.

> Below is some
> example failure output from a single run, LibreOffice does seem a bit
> more stable than OpenOffice though.

there are a lot of XSLT errors; LO (at least in 3.4) ships a different 
XSLT implementation, perhaps that has helped...

regards,
  michael