[Poppler-bugs] [Bug 12449] Feature request: pdftodocbook

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Mon Oct 30 01:06:39 UTC 2017


https://bugs.freedesktop.org/show_bug.cgi?id=12449

--- Comment #2 from kurt.pfeifle at gmail.com ---
Such a feature to work would only have a chance for "tagged" PDF.

PDF started its life as a digital document format with (almost) only one
feature: to be a true replacement visually for printed paper -- but on screen
(and to be reliable to convert the on-screen page images to paper images
without misrendering).

For this task, PDF did not need to know the "meaning" of the strokes and pixels
on the screen. After all, to de-cipher these was meant to be the task of the
human brain looking with its eyes to it. It did not need to know about the
"semantics" of the different parts, only about how these parts should render on
screen or on paper.

Later this simple conception of PDF was extended: the ambition was/is to
include an internal "markup" of various parts of the PDF visual content, and to
declare their respective MEANINGS as well: "this is a headline"; "this is a
subtitle"; "this is a textbox"; "this text string is the author's name". 

PDFs which are equipped with such internal markup are called "tagged" PDFs.

Very few real-world PDFs nowadays are "tagged". And if they are, the tagging
very frequently is incomplete and often also plain wrong.

To convert a PDF document to DocBook is only feasible if you have a very well
tagged input.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler-bugs/attachments/20171030/f9bd8381/attachment.html>


More information about the Poppler-bugs mailing list