Can ODT internals be tidier?

Jono bitrat at fastmail.fm
Mon Sep 13 21:19:07 UTC 2021


I'm currently using LibreOffice for word processing, but I also use markdown, LaTeX and Scribus for related tasks.

As a former WordPerfect 5.1 hacker who continues to mourn the emergence of WYSIWYG, Windows and Microsoft Word, I found myself examining the internals of an .odt file.  

https://docs.oasis-open.org/office/OpenDocument/v1.3/os/part3-schema/OpenDocument-v1.3-os-part3-schema.pdf

I wondered whether it would be possible to reduce the blizzard of redundant tags and other noise in these files (see extract below), something that most basic markup (eg: markdown) editors accomplish.

In particular, could the document editor elide empty tags, such as <text:p text:style-name="P10"/>?  Also, would it be possible to normalise and merge contiguous text elements with the same style.

I've considered writing an XSLT to accomplish some of this, and maybe other XML tools for navigating the .odt format.

I suspect stalling in the editor and incomprehensible formatting glitches (seemingly impervious to the "Clear Direct Formatting" command) are artifacts of the complexity of this bloated document model.

I admit I'm undisciplined in my use of styles, but I feel this should not be a barrier to using LibreOffice.  I imagine some of the redundancy is related to the presence of  "Undo" stacks, etc, and there may be ways to accomplish some of these goals already (such as "Save As"?), but I'd appreciate any advice.

I'm also interested in methods to manage font definitions so that analogous fonts aren't included in documents by accident.  This could include some user intervention.  I suspect copy/paste with styling from other sources (eg: browsers) is the source of many of these issues.

Finally, is there any documentation describing the indirect style scheme used in the content/style models, such as the 'P2' in <text:p text:style-name="P2"/>?

I guess what I'm after is a way to directly manage fonts and styles that defaults to an empty set and is then parsimonious in the creation and application of either.  It would be nice to have a way to manage styles using a configuration script, without manual interaction with the "Manage Styles" dialogue.

There may be pythonic solutions to these issues, and that's something I haven't explored.  I'm currently running LibreOffice as an AppImage on Ubuntu (I despise snap) and I'm not sure how to use the LibreOffice python interpreter in that setup.

Feedback on any of this would be appreciated!

Cheers,
Jono


Example fragment from content.xml:
====================================================
<text:p text:style-name="P119">
Gzxt Hbnse
<text:span text:style-name="T52">s</text:span>
</text:p>
<text:p text:style-name="P2"/>
<text:p text:style-name="P10">
_________________________________
<text:span text:style-name="T75">title</text:span>
</text:p>
<text:p text:style-name="P10"/>
<text:p text:style-name="P10"/>
<text:p text:style-name="P103"/>
<text:p text:style-name="P44">
Jklkh jghj pljkweing with vbv hbnses.
<text:s/>
The
<text:span text:style-name="T131">assa </text:span>
dfd jghj hjhj.
</text:p>
<text:p text:style-name="P9"/>
<text:p text:style-name="P101">
<text:soft-page-break/>
====================================================



More information about the LibreOffice mailing list