[Poppler-bugs] [Bug 64815] [TAGGEDPDF] Parse the Tagged-PDF document structure tree when present

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Fri Nov 8 10:15:40 PST 2013


https://bugs.freedesktop.org/show_bug.cgi?id=64815

--- Comment #55 from Adrian Perez de Castro <aperez at igalia.com> ---
(In reply to comment #50)
> Comment on attachment 86678 [details] [review]
> [PATCH v8 05/15] Tagged-PDF: Text content extraction from structure elements
> 
> Review of attachment 86678 [details] [review]:
> -----------------------------------------------------------------
> 
> Looks good in general, my main concern is that some code is duplicated with
> TextOutputDev, so I prefer if we could somehoe add support for marked
> content to TextOutputDev instead of inventing yet another output dev. All
> those abbreviations makes the code a bit more difficult to read for me.

New version of the patch ditches the “MC” abbreviation and it appears
expanded to “MarkedContent”.

> ::: poppler/MCOutputDev.cc
> @@ +34,5 @@
> > +};
> > +
> > +
> > +MCOutputDev::MCOutputDev(int mcid):
> > +  p(new Priv(mcid))
> 
> Why do we need a heap allocated private struct?

We don't, I first wrote this before being aware that the Poppler low-level
API is not considered public (as in “public stable API”).

> @@ +66,5 @@
> > +void MCOutputDev::beginMarkedContent(char *name, Dict *properties)
> > +{
> > +  int id = -1;
> > +  if (properties && properties->lookupInt("MCID", NULL, &id) && id == p->mcid)
> > +    p->capturing = true;
> 
> capturing sounds a bit confusing to me, I would use something like
> inMarkedContent

Changed.

> @@ +149,5 @@
> > +  if (p->lastFlags != flags) {
> > +    if (p->capturing)
> > +      p->mcOps.push_back(MCOp(MCOp::Flags, flags));
> > +    p->lastFlags = flags;
> > +  }
> 
> It seems you are not adding any color "op". Either remove the color type or
> add it when it changes in ::drawChar (or beginWord)

Ooops. New version of the patch adds the missing code.

> @@ +150,5 @@
> > +    if (p->capturing)
> > +      p->mcOps.push_back(MCOp(MCOp::Flags, flags));
> > +    p->lastFlags = flags;
> > +  }
> > +}
> 
> I wonder if we can integrate all this with TextOutputDev somehow. There's
> some code duplication and this is about extracting text after all. Maybe we
> could add a MarkedContentText object, similar to TextPage or ActualText
> objects, and use it from the TextOutputDev. We could add a new constructor
> that only creates a MarkedContentText and receives the marked content id,
> and then you can get the text and attributes.
> 
> ::: poppler/MCOutputDev.h
> @@ +17,5 @@
> > +class GfxState;
> > +class GooString;
> > +class Dict;
> > +
> > +struct MCOp {
> 
> What's op? operator?

Operation.

> @@ +33,5 @@
> > +      return ((Guint) (r * 255) & 0xFF) << 16
> > +           | ((Guint) (g * 255) & 0xFF) << 8
> > +           | ((Guint) (b * 255) & 0xFF);
> > +    }
> > +  };
> 
> Why not using GfxColor? or even GfxRGB if only RGB is supported?

GfxRGB seems a resonable compromise. Next version will use it.

> @@ +67,5 @@
> > +  }
> > +  MCOp(): type(FontName), value(NULL) {}
> > +  MCOp(Unicode u): type(Unichar), unichar(u) {}
> > +  MCOp(const char *s): type(FontName), value(strdup(s)) {}
> > +  MCOp(Type t, Guint f = 0): type(t), flags(f) {}
> 
> Why do you need to pass also the Type for the flags constructor?

Type “Unicode” is also a “Guint”, so there would be two overloaded
functions with one argument of the same type. The compiler cannot
resolve the ambiguity and generates an error. Adding the extra
“Type” argument is a workaround for that.

> ::: poppler/StructElement.h
> @@ +233,5 @@
> > +  //
> > +  // The text will be appended to the passed GooString. If NULL is passed,
> > +  // a new string is returned, and the ownership passed to the caller.
> > +  //
> > +  GooString *getText(GooString *string = NULL, GBool recursive = gTrue) const;
> 
> appendText maybe?

After thinking twice about this, it looks like it is wrong to have a
single function that can work in two ways. The reason for those two
behaviours is that client code will usually call “getText()”, without
passing arguments, so internally the function will create a GooString
and then iterate over the children recursively, passing that one string
to avoid allocating one GooString per child element.

IMHO the best is to split this in two functions:

  public:
    GooString* getText(GBool recursive = gTrue);
  private:
    void appendSubTreeText(GooString *string);

The new “appendSubTreeText()” function would be a private helper
function used when “getText()” is called with “recusive = gTrue”.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler-bugs/attachments/20131108/4ecfbf13/attachment.html>


More information about the Poppler-bugs mailing list