[Poppler-bugs] [Bug 64815] [TAGGEDPDF] Parse the Tagged-PDF document structure tree when present
bugzilla-daemon at freedesktop.org
bugzilla-daemon at freedesktop.org
Wed Oct 9 01:26:25 PDT 2013
https://bugs.freedesktop.org/show_bug.cgi?id=64815
--- Comment #50 from Carlos Garcia Campos <carlosgc at gnome.org> ---
Comment on attachment 86678
--> https://bugs.freedesktop.org/attachment.cgi?id=86678
[PATCH v8 05/15] Tagged-PDF: Text content extraction from structure elements
Review of attachment 86678:
-----------------------------------------------------------------
Looks good in general, my main concern is that some code is duplicated with
TextOutputDev, so I prefer if we could somehoe add support for marked content
to TextOutputDev instead of inventing yet another output dev. All those
abbreviations makes the code a bit more difficult to read for me.
::: poppler/MCOutputDev.cc
@@ +34,5 @@
> +};
> +
> +
> +MCOutputDev::MCOutputDev(int mcid):
> + p(new Priv(mcid))
Why do we need a heap allocated private struct?
@@ +66,5 @@
> +void MCOutputDev::beginMarkedContent(char *name, Dict *properties)
> +{
> + int id = -1;
> + if (properties && properties->lookupInt("MCID", NULL, &id) && id == p->mcid)
> + p->capturing = true;
capturing sounds a bit confusing to me, I would use something like
inMarkedContent
@@ +149,5 @@
> + if (p->lastFlags != flags) {
> + if (p->capturing)
> + p->mcOps.push_back(MCOp(MCOp::Flags, flags));
> + p->lastFlags = flags;
> + }
It seems you are not adding any color "op". Either remove the color type or add
it when it changes in ::drawChar (or beginWord)
@@ +150,5 @@
> + if (p->capturing)
> + p->mcOps.push_back(MCOp(MCOp::Flags, flags));
> + p->lastFlags = flags;
> + }
> +}
I wonder if we can integrate all this with TextOutputDev somehow. There's some
code duplication and this is about extracting text after all. Maybe we could
add a MarkedContentText object, similar to TextPage or ActualText objects, and
use it from the TextOutputDev. We could add a new constructor that only creates
a MarkedContentText and receives the marked content id, and then you can get
the text and attributes.
::: poppler/MCOutputDev.h
@@ +17,5 @@
> +class GfxState;
> +class GooString;
> +class Dict;
> +
> +struct MCOp {
What's op? operator?
@@ +33,5 @@
> + return ((Guint) (r * 255) & 0xFF) << 16
> + | ((Guint) (g * 255) & 0xFF) << 8
> + | ((Guint) (b * 255) & 0xFF);
> + }
> + };
Why not using GfxColor? or even GfxRGB if only RGB is supported?
@@ +67,5 @@
> + }
> + MCOp(): type(FontName), value(NULL) {}
> + MCOp(Unicode u): type(Unichar), unichar(u) {}
> + MCOp(const char *s): type(FontName), value(strdup(s)) {}
> + MCOp(Type t, Guint f = 0): type(t), flags(f) {}
Why do you need to pass also the Type for the flags constructor?
::: poppler/StructElement.h
@@ +233,5 @@
> + //
> + // The text will be appended to the passed GooString. If NULL is passed,
> + // a new string is returned, and the ownership passed to the caller.
> + //
> + GooString *getText(GooString *string = NULL, GBool recursive = gTrue) const;
appendText maybe?
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler-bugs/attachments/20131009/3e4a6753/attachment.html>
More information about the Poppler-bugs
mailing list