[Poppler-bugs] [Bug 107235] New: Bug fixes, emit more font info in pdftohtml

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Sun Jul 15 16:27:39 UTC 2018


https://bugs.freedesktop.org/show_bug.cgi?id=107235

            Bug ID: 107235
           Summary: Bug fixes, emit more font info in pdftohtml
           Product: poppler
           Version: unspecified
          Hardware: x86-64 (AMD64)
                OS: Linux (All)
            Status: NEW
          Severity: normal
          Priority: medium
         Component: pdftohtml
          Assignee: poppler-bugs at lists.freedesktop.org
          Reporter: ulatekh at yahoo.com

Created attachment 140641
  --> https://bugs.freedesktop.org/attachment.cgi?id=140641&action=edit
Fix possible uninitialized variable & dangling reference in HtmlFont

I'm about to use pdftohtml to extract information from PDFs and organize the
results into a database, so I had a chance to dig through the code.

I happened to notice a possible uninitialized variable, and possible dangling
reference, in HtmlFont. The first patch fixes that.

I've had a long-standing problem with qpdfview (which uses poppler) sometimes
copying text out of PDFs incorrectly -- the text copies, but all of the spaces
are missing. After reproducing it with a PDF, I tracked the problem down to the
PDF using tabs where it probably should have used spaces. The second patch
fixes HtmlFont::HtmlFilter() to convert incoming tabs to spaces, instead of
removing the whitespace completely.

The third patch merely emits more information in the <fontspec> elements when
pdftohtml is run with -xml. The PDFs I'm trying to analyze appear to be pretty
consistent with their font usage, to the point where I can use them to infer
the text's meaning. But I needed more information in the <fontspec> to do that,
and this patch does that for me.

Please consider these for inclusion into the project.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler-bugs/attachments/20180715/c4ca3343/attachment.html>


More information about the Poppler-bugs mailing list