[Bug 165931] Regular expressions must be able to match non-break line endings
bugzilla-daemon at bugs.documentfoundation.org
bugzilla-daemon at bugs.documentfoundation.org
Fri Mar 28 16:37:22 UTC 2025
https://bugs.documentfoundation.org/show_bug.cgi?id=165931
László Németh <nemeth at numbertext.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |RESOLVED
Resolution|--- |WORKSFORME
--- Comment #5 from László Németh <nemeth at numbertext.org> ---
Regex library search operates on the plain text conversion of the document,
where a single text line contains the full text of a paragraph (i.e.
paragraph/line). We always need a plain text conversion (back and forth) of the
document for regex search, and we have only a single \n for line end (i.e. in
plain text editors, you cannot search for paragraph end without adding some
extra syntax or heuristic – similarly, in Writer plain text import, there is a
heuristic to recognize shorter lines as paragraph boundaries).
Fortunately there are possible solutions or workarounds: 1) Easy command line,
2) Macro + Find & Replace 3) Macro only (first step for an add-on development)
== 1) Easy command line ==
1) Export your document to PDF.
2) Grep your plain text content of the PDF, showing the matching lines in
Linux/macOS/Cygwin command line:
$ less document.pdf | grep '” *$'
Note: When I made some research for hyphenation development
(https://numbertext.org/typography/automatikus_magyar_elv%C3%A1laszt%C3%A1s_a_LibreOffice-ban.pdf),
I used this, generating hundreds of documents with pyUNO, and the basic Linux
tool "less" converted the PDFs to plain text documents with the requested line
breaks immediately.
== 2) Macro + Find & Replace ==
1. Mark line ends with neutral Unicode characters using UNO, e.g. with
zero-width joiner (it depends on your text).
2. Apply Find & Replace with regex pattern matching, e.g. "\w+\W?\u200d" to
select last line words (with an optional punctuation mark) using Find All.
3. Format the selected words, e.g. underline them (but other formatting, e.g.
applying bold text would change the following line ends, so sometimes it's
better to use only macro).
3. Remove the neutral Unicode characters using Find & Replace.
For example, the Basic code for inserting ZWJ (U+200d):
'''''''''''''
Sub RunArg(command, args)
dim document as object
dim dispatcher as object
document = ThisComponent.CurrentController.Frame
dispatcher = createUnoService("com.sun.star.frame.DispatchHelper")
dispatcher.executeDispatch(document, command, "", 0, args)
End Sub
Sub Run(command)
RunArg(command, Array())
End Sub
Sub HardBreak()
dim args1(1) as new com.sun.star.beans.PropertyValue
cursor = ThisComponent.CurrentController.getViewCursor()
Run(".uno:Escape")
Run(".uno:GoToEndOfDoc")
Do
' insert ZWJ (zero-width joiner, U+200D) character at the end of the
line
Run(".uno:GoToEndOfLine")
args1(0).Name = "Text"
args1(0).Value = "" ' ZWJ within quotation marks
RunArg(".uno:InsertText", args1)
' go the the previous line
Run(".uno:GoLeft")
Run(".uno:GoToStartOfLine")
origStart = cursor.Start
Run(".uno:GoUp")
' loop until the cursor position doesn't change any more
Loop Until cursor.Text.compareRegionStarts(origStart, cursor.Start) = 0
End Sub
''''''''''''''''''''
Note: it seems, ZWJ can modify hyphenation (maybe a bug), see the attached
screenshot.
== 3) Macro-only ==
When the regex replace modifies line breaking, line ends, it's better to use a
macro-only solution, e.g. extending the previous macro to do everything
automatically. For example, selecting line-by-line the document using UNO
dispatcher calls:
Run(".uno:GoToEndOfLine")
Run(".uno:StartOfLineSel")
and calling Find & Replace with Search In Selection:
Sub SearchInSelection(regex)
dim args1(22) as new com.sun.star.beans.PropertyValue
args1(0).Name = "SearchItem.StyleFamily"
args1(0).Value = 2
args1(1).Name = "SearchItem.CellType"
args1(1).Value = 0
args1(2).Name = "SearchItem.RowDirection"
args1(2).Value = true
args1(3).Name = "SearchItem.AllTables"
args1(3).Value = false
args1(4).Name = "SearchItem.SearchFiltered"
args1(4).Value = false
args1(5).Name = "SearchItem.Backward"
args1(5).Value = false
args1(6).Name = "SearchItem.Pattern"
args1(6).Value = false
args1(7).Name = "SearchItem.Content"
args1(7).Value = false
args1(8).Name = "SearchItem.AsianOptions"
args1(8).Value = false
args1(9).Name = "SearchItem.AlgorithmType"
args1(9).Value = 1
args1(10).Name = "SearchItem.SearchFlags"
args1(10).Value = 71680 ' code for search in selection
args1(11).Name = "SearchItem.SearchString"
args1(11).Value = regex
args1(12).Name = "SearchItem.ReplaceString"
args1(12).Value = ""
args1(13).Name = "SearchItem.Locale"
args1(13).Value = 255
args1(14).Name = "SearchItem.ChangedChars"
args1(14).Value = 2
args1(15).Name = "SearchItem.DeletedChars"
args1(15).Value = 2
args1(16).Name = "SearchItem.InsertedChars"
args1(16).Value = 2
args1(17).Name = "SearchItem.TransliterateFlags"
args1(17).Value = 1073743104
args1(18).Name = "SearchItem.Command"
args1(18).Value = 1
args1(19).Name = "SearchItem.SearchFormatted"
args1(19).Value = false
args1(20).Name = "SearchItem.AlgorithmType2"
args1(20).Value = 2
args1(21).Name = "Quiet"
args1(21).Value = true
args1(21).Name = "SynchronMode"
args1(21).Value = true
RunArg(".uno:ExecuteSearch", args1())
end sub
(See argument SynchronMode to update the text lines to update the document to
select next line correctly).
Note: adding the ZWJ or other mark to the line is still needed.
So it's work for me (especially because regex is already a feature for advanced
users), but if you think, please file an enhancement request or reopen this
issue with that. Maybe it's worth to add a complete macro-only solution.
--
You are receiving this mail because:
You are on the CC list for the bug.
More information about the Libreoffice-ux-advise
mailing list