HTML import filter very (too) basic

Kohei Yoshida libreoffice at kohei.us
Sat Jan 12 22:54:01 UTC 2019


> On January 12, 2019 at 2:56 PM Jens Tröger <jens.troeger at light-speed.de> wrote:
> 
> 
> Thank you Kohei for the information!
> 
> Noel mentioned that much of Writer’s current implementation covers the conversion of HTML into Writer’s DOM. How complete is that step?

I can't really give you a definitive answer since I've never worked on Writer's HTML import code. That being said, I'm pretty confident that the original code was written almost entirely by the Star Division developers back in the StarOffice days (prior to year 2000), and the code hasn't been improved upon much after that aside from occasional corner case fixes here and there.

> It sounds like an iterative approach to me: once the DOM tree has been replicated, the CSS needs to be cascaded accordingly—and that step seems to be rather incomplete?

I would imagine so, or the support for CSS has not be updated since the original code was written more than 18 years ago.

> Again, I haven’t had any time to noodle through the Writer code so I am just trying to put together a picture based on this discussion…

I hope my input helps a bit on that front.

If you need some code pointers, you need to hunt for the class named SwHTMLParser whose header is found in sw/source/filter/html/swhtml.hxx.  If you look through its data members, you'll notice m_pCSS1Parser which is of type SwCSS1Parser.  If you check its call sites you'll get a pretty good idea of how Writer's HTML import code handles CSS input.  Much of the relevant code is found in sw/source/filter/html, and by looking through the files there, the file named htmlcss1.cxx looks pretty suspicious to me.  Maybe you can start sniffing from that file and find your way around...

HTH,

Kohei


More information about the LibreOffice mailing list