[PATCH libreoffice-4-0] more regex fixes, fdo#60259 related

Eike Rathke (via Code Review) gerrit at gerrit.libreoffice.org
Tue Mar 12 03:53:04 PDT 2013


Hi,

I have submitted a patch for review:

    https://gerrit.libreoffice.org/2679

To pull it, you can do:

    git pull ssh://gerrit.libreoffice.org:29418/core refs/changes/79/2679/1

more regex fixes, fdo#60259 related

fdo#60259 prevent crash when searching backward for $ anchor regex

Old code wasn't prepared that searching for $ may actually return a
result set pointing behind the search string which it does with the ICU
regex engine.

(cherry picked from commit c00601dab0f5533b152cd63cec0a89bfec1ba95f)

Conflicts:
	i18npool/source/search/textsearch.cxx

regex: handle zero-length matches, fdo#60259 related

Also in backward search ignore all zero-length matches except the text
end single $ anchor search. The anchor search is a valid match, treat it
as such in Writer.

This still doesn't solve the backward $ backward search, the convoluted
Writer code in that place apparently never worked, someone more familiar
with those internals should straighten out the mess.

(cherry picked from commit 3bc5cb3c485d67f1ce0541d349d11637f52ebda5)

regex: don't loop 10000 identical matches to find a single $ anchor

(cherry picked from commit ccc349d3abb70ef38cd2b7706da51b060a385908)

make forward replacement of $ work again, fdo#60259 related

broken with 3bc5cb3c485d67f1ce0541d349d11637f52ebda5

(cherry picked from commit d8dcfa0e5dbecf77c4d6a8d49caf61b339cf9b43)

Change-Id: I6b5eb28d0a54ceecb6873a3c05f18f70312ab1a2
---
M i18npool/source/search/textsearch.cxx
M sw/source/core/crsr/findtxt.cxx
2 files changed, 79 insertions(+), 33 deletions(-)



diff --git a/i18npool/source/search/textsearch.cxx b/i18npool/source/search/textsearch.cxx
index dceb4d7..997b01d 100644
--- a/i18npool/source/search/textsearch.cxx
+++ b/i18npool/source/search/textsearch.cxx
@@ -124,14 +124,12 @@
     sSrchStr = aSrchPara.searchString;
 
     // use transliteration here
-    if ( xTranslit.is() &&
-     aSrchPara.transliterateFlags & SIMPLE_TRANS_MASK )
+    if ( xTranslit.is() && aSrchPara.transliterateFlags & SIMPLE_TRANS_MASK )
         sSrchStr = xTranslit->transliterateString2String(
-                aSrchPara.searchString, 0, aSrchPara.searchString.getLength());
+            aSrchPara.searchString, 0, aSrchPara.searchString.getLength());
 
-    if ( xTranslit2.is() &&
-     aSrchPara.transliterateFlags & COMPLEX_TRANS_MASK )
-    sSrchStr2 = xTranslit2->transliterateString2String(
+    if ( xTranslit2.is() && aSrchPara.transliterateFlags & COMPLEX_TRANS_MASK )
+        sSrchStr2 = xTranslit2->transliterateString2String(
             aSrchPara.searchString, 0, aSrchPara.searchString.getLength());
 
     // When start or end of search string is a complex script type, we need to
@@ -204,22 +202,32 @@
             newStartPos = FindPosInSeq_Impl( offset, startPos );
 
         if( endPos < searchStr.getLength() )
-        newEndPos = FindPosInSeq_Impl( offset, endPos );
+            newEndPos = FindPosInSeq_Impl( offset, endPos );
         else
             newEndPos = in_str.getLength();
 
         sres = (this->*fnForward)( in_str, newStartPos, newEndPos );
 
-        for ( int k = 0; k < sres.startOffset.getLength(); k++ )
+        // Map offsets back to untransliterated string.
+        const sal_Int32 nOffsets = offset.getLength();
+        if (nOffsets)
         {
-            if (sres.startOffset[k])
-          sres.startOffset[k] = offset[sres.startOffset[k]];
-            // JP 20.6.2001: end is ever exclusive and then don't return
-            //               the position of the next character - return the
-            //               next position behind the last found character!
-            //               "a b c" find "b" must return 2,3 and not 2,4!!!
-            if (sres.endOffset[k])
-          sres.endOffset[k] = offset[sres.endOffset[k]-1] + 1;
+            // For regex nGroups is the number of groups+1 with group 0 being
+            // the entire match.
+            const sal_Int32 nGroups = sres.startOffset.getLength();
+            for ( sal_Int32 k = 0; k < nGroups; k++ )
+            {
+                const sal_Int32 nStart = sres.startOffset[k];
+                if (nStart > 0)
+                    sres.startOffset[k] = (nStart < nOffsets ? offset[nStart] : (offset[nOffsets - 1] + 1));
+                // JP 20.6.2001: end is ever exclusive and then don't return
+                //               the position of the next character - return the
+                //               next position behind the last found character!
+                //               "a b c" find "b" must return 2,3 and not 2,4!!!
+                const sal_Int32 nStop = sres.endOffset[k];
+                if (nStop > 0)
+                    sres.endOffset[k] = offset[(nStop <= nOffsets ? nStop : nOffsets) - 1] + 1;
+            }
         }
     }
     else
@@ -291,24 +299,34 @@
         // JP 20.6.2001: also the start and end positions must be corrected!
         if( startPos < searchStr.getLength() )
             newStartPos = FindPosInSeq_Impl( offset, startPos );
-    else
-        newStartPos = in_str.getLength();
+        else
+            newStartPos = in_str.getLength();
 
         if( endPos )
-        newEndPos = FindPosInSeq_Impl( offset, endPos );
+            newEndPos = FindPosInSeq_Impl( offset, endPos );
 
         sres = (this->*fnBackward)( in_str, newStartPos, newEndPos );
 
-        for ( int k = 0; k < sres.startOffset.getLength(); k++ )
+        // Map offsets back to untransliterated string.
+        const sal_Int32 nOffsets = offset.getLength();
+        if (nOffsets)
         {
-            if (sres.startOffset[k])
-          sres.startOffset[k] = offset[sres.startOffset[k] - 1] + 1;
-            // JP 20.6.2001: end is ever exclusive and then don't return
-            //               the position of the next character - return the
-            //               next position behind the last found character!
-            //               "a b c" find "b" must return 2,3 and not 2,4!!!
-            if (sres.endOffset[k])
-          sres.endOffset[k] = offset[sres.endOffset[k]];
+            // For regex nGroups is the number of groups+1 with group 0 being
+            // the entire match.
+            const sal_Int32 nGroups = sres.startOffset.getLength();
+            for ( sal_Int32 k = 0; k < nGroups; k++ )
+            {
+                const sal_Int32 nStart = sres.startOffset[k];
+                if (nStart > 0)
+                    sres.startOffset[k] = offset[(nStart <= nOffsets ? nStart : nOffsets) - 1] + 1;
+                // JP 20.6.2001: end is ever exclusive and then don't return
+                //               the position of the next character - return the
+                //               next position behind the last found character!
+                //               "a b c" find "b" must return 2,3 and not 2,4!!!
+                const sal_Int32 nStop = sres.endOffset[k];
+                if (nStop > 0)
+                    sres.endOffset[k] = (nStop < nOffsets ? offset[nStop] : (offset[nOffsets - 1] + 1));
+            }
         }
     }
     else
@@ -734,6 +752,11 @@
         int nEndOfs = pRegexMatcher->end( nIcuErr);
         if( nStartOfs < nEndOfs)
             break;
+        // If the zero-length match is behind the string, do not match it again
+        // and again until startPos reaches there. A match behind the string is
+        // a "$" anchor.
+        if (nStartOfs == endPos)
+            break;
         // try at next position if there was a zero-length match
         if( ++startPos >= endPos)
             return aRet;
@@ -779,17 +802,35 @@
     // find the last match
     int nLastPos = 0;
     int nFoundEnd = 0;
+    int nGoodPos = 0, nGoodEnd = 0;
+    bool bFirst = true;
     do {
         nLastPos = pRegexMatcher->start( nIcuErr);
         nFoundEnd = pRegexMatcher->end( nIcuErr);
+        if (nLastPos < nFoundEnd)
+        {
+            // remember last non-zero-length match
+            nGoodPos = nLastPos;
+            nGoodEnd = nFoundEnd;
+        }
         if( nFoundEnd >= startPos)
             break;
+        bFirst = false;
         if( nFoundEnd == nLastPos)
             ++nFoundEnd;
     } while( pRegexMatcher->find( nFoundEnd, nIcuErr));
 
+    // Ignore all zero-length matches except "$" anchor on first match.
+    if (nGoodPos == nGoodEnd)
+    {
+        if (bFirst && nLastPos == startPos)
+            nGoodPos = nLastPos;
+        else
+            return aRet;
+    }
+
     // find last match again to get its details
-    pRegexMatcher->find( nLastPos, nIcuErr);
+    pRegexMatcher->find( nGoodPos, nIcuErr);
 
     // fill in the details of the last match
     const int nGroupCount = pRegexMatcher->groupCount();
diff --git a/sw/source/core/crsr/findtxt.cxx b/sw/source/core/crsr/findtxt.cxx
index 101b002..427c4fa 100644
--- a/sw/source/core/crsr/findtxt.cxx
+++ b/sw/source/core/crsr/findtxt.cxx
@@ -446,8 +446,9 @@
     }
 
     xub_StrLen nStringEnd = nEnd;
-    while ( (bSrchForward && nStart < nStringEnd) ||
-            (! bSrchForward && nStart > nStringEnd) )
+    bool bZeroMatch = false;    // zero-length match, i.e. only $ anchor as regex
+    while ( ((bSrchForward && nStart < nStringEnd) ||
+            (! bSrchForward && nStart > nStringEnd)) && !bZeroMatch )
     {
         // SearchAlgorithms_APPROXIMATE works on a per word base so we have to
         // provide the text searcher with the correct locale, because it uses
@@ -475,7 +476,8 @@
         }
 
         if( nSearchScript == nCurrScript &&
-            (rSTxt.*fnMove->fnSearch)( sCleanStr, &nStart, &nEnd, 0 ))
+                (rSTxt.*fnMove->fnSearch)( sCleanStr, &nStart, &nEnd, 0 ) &&
+                !(bZeroMatch = (nStart == nEnd)))
         {
             // set section correctly
             *GetPoint() = *pPam->GetPoint();
@@ -518,11 +520,14 @@
 
     if ( bFound )
         return true;
-    else if( ( bChkEmptyPara && !nStart && !nTxtLen ) || bChkParaEnd )
+    else if( ( bChkEmptyPara && !nStart && !nTxtLen ) || bChkParaEnd)
     {
         *GetPoint() = *pPam->GetPoint();
         GetPoint()->nContent = bChkParaEnd ? nTxtLen : 0;
         SetMark();
+        /* FIXME: this condition does not work for !bSrchForward backward
+         * search, it probably never did. (pSttNd != &rNdIdx.GetNode())
+         * is never true in this case. */
         if( (bSrchForward || pSttNd != &rNdIdx.GetNode()) &&
             Move( fnMoveForward, fnGoCntnt ) &&
             (!bSrchForward || pSttNd != &GetPoint()->nNode.GetNode()) &&

-- 
To view, visit https://gerrit.libreoffice.org/2679
To unsubscribe, visit https://gerrit.libreoffice.org/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I6b5eb28d0a54ceecb6873a3c05f18f70312ab1a2
Gerrit-PatchSet: 1
Gerrit-Project: core
Gerrit-Branch: libreoffice-4-0
Gerrit-Owner: Eike Rathke <erack at redhat.com>



More information about the LibreOffice mailing list