help needed: hsqldb shutdown race condition, how to fix without deadlock

Michael Stahl mstahl at redhat.com
Wed Jun 3 14:08:36 PDT 2015


On 03.06.2015 17:31, Lionel Elie Mamane wrote:
> On Wed, Jun 03, 2015 at 03:01:39PM +0200, Michael Stahl wrote:
>> On 03.06.2015 14:47, Lionel Elie Mamane wrote:
> 
>>> Since the problem is essentially that the two threads take the same mutexes
>>> in different order, here is a dirty hack that forces the hsqldb thread
>>> to take the "affine bridge" mutex before taking the "HSQL driver"
>>> mutex.
> 
>> i'm not entirely sure this is sensible...
> 
> Thinking the issue from yet another angle, if we go back to the
> situation at the start of this thread (hsqldb is *not* affine). I
> don't understand why the
> connectivity::hsqldb::ODriverDelegator::disposing thread holds the
> affine lock, and goes through the affine bridge:
> 
> #6  0x00002b4ff40b8342 in connectivity::hsqldb::ODriverDelegator::disposing (this=0x1ca1cf0, Source=...)
>     at /home/master/src/libreoffice/workdirs/libreoffice-5-1/connectivity/source/drivers/hsqldb/HDriver.cxx:566
> (...)
> #13 0x00002b4ff45f5f4b in AffineBridge::outerDispatch (this=this at entry=0x1ca77b0, loop=loop at entry=1)
>     at /home/master/src/libreoffice/workdirs/libreoffice-5-1/cppu/source/AffineBridge/AffineBridge.cxx:201
> #14 0x00002b4ff45f6186 in AffineBridge::v_callInto_v (this=0x1ca77b0, pCallee=0x2b4fd23f9dda <s_pull(va_list*)>,
>     pParam=0x2b4fecb5d3c0)
>     at /home/master/src/libreoffice/workdirs/libreoffice-5-1/cppu/source/AffineBridge/AffineBridge.cxx:270
> (...)
> #26 0x00002b4ff47fdccc in Proxy::dispatch (this=this at entry=0x1cb3ac0,
>     pReturnTypeRef=pReturnTypeRef at entry=0x8ce880, pParams=pParams at entry=0x20789e0, nParams=nParams at entry=0,
>     pMemberType=pMemberType at entry=0x236b4c0, pReturn=pReturn at entry=0x2b4fecb5dd90, pArgs=0x2b4fecb5d9a0,
>     ppException=0x2b4fecb5da30)
>     at /home/master/src/libreoffice/workdirs/libreoffice-5-1/cppu/source/helper/purpenv/helper_purpenv_Proxy.cxx:445
> #27 0x00002b4ff47fcc7d in s_Proxy_dispatch (pUnoI=0x1cb3ac0, pMemberType=0x236b4c0, pReturn=0x2b4fecb5dd90,
>     pArgs=0x2b4fecb5d9a0, ppException=0x2b4fecb5da30)
>     at /home/master/src/libreoffice/workdirs/libreoffice-5-1/cppu/source/helper/purpenv/helper_purpenv_Proxy.cxx:170
> (...)
> #32 0x00002b4fe831ac12 in privateSnippetExecutor ()
>    from /home/master/src/libreoffice/workdirs/libreoffice-5-1/instdir/program/libgcc3_uno.so
> #33 0x00002b4fecd6e866 in dbaccess::OConnection::disposing (this=0x2058670)
>     at /home/master/src/libreoffice/workdirs/libreoffice-5-1/dbaccess/source/core/dataaccess/connection.cxx:492
> 
> 
> Why does the call from dbaccess::OConnection::disposing into
> connectivity::hsqldb::ODriverDelegator::disposing go through the
> affine bridge?

because OConnection is outside the affine environment, so it has to go
through the bridge to call a component inside the affine environment.

however the ODriverDelegator isn't inside the affine environment, what
happens is that while AffineBridge::v_callInto_v() waits for the call on
the "inner" thread to return, it assigns the current (calling) thread as
the "outer" thread and in AffineBridge::outerDispatch() it waits for
outgoing calls from the "inner" thread.

(this is something i mis-remembered: i wrote that outgoing calls happen
on a thread different from the calling one but evidently that is not
always the case; a new "outer" thread may be launched but only if no
"outer" thread exists.)

the "inner" thread executes only calls against the affine component; if
the affine component does a UNO call to another component, the "inner"
thread stores some closure in the AffineBridge instance and signals the
"outer" thread, which will then do the call and once it's done signal
back to the "inner" thread.

because the thread locked the AffineBridge mutex, which is a recursive
mutex, it can call into the affine component again, so you can have
re-entrant calls in the affine component, but no other thread can call
into it.

>> a component is (in general) responsible for its own thread safety, and
>> should not make assumptions about how other components are implemented,
>> in particular wrt. locking.
> 
> The hsqldb module is in great part a wrapper around the jdbc module,
> so I'm not too worried about hsqldb "knowing too much" about the jdbc
> module.
> 
> Alternatively, is there any way to force the hsqldb module to run in
> the same thread as the jdbc module?

that depends on how many AffineBridges there are...

>> this implies that if a component takes a lock, it must release the
>> lock before calling a method that could end up calling into a
>> different component, to avoid deadlock.
> 
> This sounds very restrictive. Essentially, you are saying that a
> method from class A in component CA cannot protect itself from being
> called "reentrantively" (or another method being called on the same
> object) if it has to call a method of an object B from component CB
> that is a member of class A:

yes, it is quite non-intuitive and likely violated in thousands of
places currently, even if we ignore the SolarMutex, where fixing this is
essentially impossible.

> class A
> {
>     private:
> 
>     mutex m_aMutex;
>     B m_aB0;
>     B m_aB1;
> 
>     public:
> 
>     void foo()
>     {
>        MutexGuard(m_aMutex);
>        m_aB1 = m_aB0;
>        m_aB0.non_const_method();
>     }
> 
>     void bar()
>     {
>        MutexGuard(m_aMutex);
>        m_aB0 = m_aB1;
>        m_aB1.other_non_const_method();
>     }
> }
> 
> How do you suggest this is made thread-safe without holding m_aMutex?
> foo() and bar() need to be mutually exclusive, else the results can
> differ. Even assuming B::operator=, B::other_non_const_method and
> B::non_const_method are thread-safe, one could have interleaved
> execution of foo() and bar():

usually what you need to do is copy member variables to local variables
on the stack

  void foo()
  {
    B tmp;
    {
       MutexGuard g;
       m_aB0 = m_aB1;
       tmp = m_aB0;
    }
    tmp.non_const_method();
  }

of course if the assignment and call must be atomic you're in trouble :-/

if you have parameters that are components you may need to do some silly
things too, e.g. look at commit 7db6b2f7968d8063b12312737136f09cb7549805
which fixed a reported deadlock.

>>  there may be multiple re-entrant calls into the apartment, where
>> the affine component calls out into a different component (which is
>> bridged to a thread that is different from both the original calling
>> thread and the affine thread), which calls back agian into the
>> affine component, so a non-recursive mutex could deadlock but a
>> recursive mutex can not.
> 
> I don't think that is possible, since the affine bridge will allow
> only one call (which is the issue at hand here). As long as that call
> is not finished, the affine bridge will block:
> 
> void AffineBridge::v_callInto_v(uno_EnvCallee * pCallee, va_list * pParam)
> {
>     osl::MutexGuard guard(m_outerMutex); // only one thread at a time can call into
>     (...)
> }

see discussion of "outer-thread" / "inner-thread" ping-pong above.

> I understand there is only one AffineBridge?

not sure about that... it could be per-component-instance or per-factory
(i,e, per-library usually) or per-environment-type?

hmm... how about we solve the JDBC performance problem by, the first
time a Java component is created, the component-factory stuff posts a
message to the main thread via PostUserEvent with a handler that does a
"dummy" JNI call into some Java function that calls back via JNI into
another function that calls Application::Yield again; now JNI calls will
never *enter* the JVM on the main thread because it's always *in* the
JVM already - just needs cleaning up on quit, what could possibly go
wrong...




More information about the LibreOffice mailing list