<html>
<head>
<base href="https://bugs.documentfoundation.org/">
</head>
<body>
<p>
<div>
<b><a class="bz_bug_link
bz_status_NEW "
title="NEW - Rewrite old Pocket Word (PWI) file format import filter."
href="https://bugs.documentfoundation.org/show_bug.cgi?id=77278#c13">Comment # 13</a>
on <a class="bz_bug_link
bz_status_NEW "
title="NEW - Rewrite old Pocket Word (PWI) file format import filter."
href="https://bugs.documentfoundation.org/show_bug.cgi?id=77278">bug 77278</a>
from <span class="vcard"><a class="email" href="mailto:alonso@loria.fr" title="osnola <alonso@loria.fr>"> <span class="fn">osnola</span></a>
</span></b>
<pre>The old filter is quite basic: it reads the font names, then uses some
heuristics to retrieve paragraphs of text, retrieving:
- the main properties of the characters, with the notable exception of the
superscript and the subscript…
- the properties of the following paragraph: first indent, left/right margin,
left/center/.. alignment, a flag to know if it is a bulleted list,
I have rewritten a « more robust » version of this code in libwps, but clearly,
there are many things that are not recovered (as I can not guess
what there means).
If you want to try it, I have updated the libwps version compiled with
emscripten: <a href="http://libwps.sourceforge.net/convertWPS.html">http://libwps.sourceforge.net/convertWPS.html</a> .
To improve this filter, it would be useful to have some Pocket Word files (and
their pdfs equivalent) that :
- [character properties] use exponents and subscripts,
- [paragraph properties] have paragraphs with single/double/double line
spacing/..., with a certain spacing before/after the paragraph, lines with
fixed height, different types of listings
- [general] contains header(s), footer(s), footnote(s), endnotes, comment(s),
image(s), table(s)…
and simple documents with different page sizes, different margins, some
metadata...
- …</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are the assignee for the bug.</li>
</ul>
</body>
</html>