[Poppler-bugs] [Bug 64815] [TAGGEDPDF] Parse the Tagged-PDF document structure tree when present

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Thu Sep 26 11:15:00 PDT 2013


https://bugs.freedesktop.org/show_bug.cgi?id=64815

--- Comment #39 from Adrian Perez de Castro <aperez at igalia.com> ---
(In reply to comment #37)
> Comment on attachment 82288 [details] [review]
> [PATCH v6 03/10] Tagged-PDF: Implement parsing of StructElem objects
> 
> Review of attachment 82288 [details] [review]:
> -----------------------------------------------------------------
> 
> I still don't see the point of having two structs heap allocated and all the
> API methods with 'if isFoo()' to use one or the other, but I'm not opposed
> either if you guys think it's better.
> 
> ::: poppler/StructElement.cc
> @@ +83,5 @@
> > +  { StructElement::Formula,    "Formula",    elementTypeUndefined },
> > +  { StructElement::Form,       "Form",       elementTypeUndefined },
> > +  { StructElement::TOC,        "TOC",        elementTypeUndefined },
> > +  { StructElement::TOCI,       "TOCI",       elementTypeUndefined },
> > +};
> 
> All this is part of the standard types defined for tagged pdf, right? If we
> move this to a follow up patch and can merge this with the previous one,
> containing only logical structure implementation.

Righy. I will split this to a separate patch.

> @@ +126,5 @@
> > +//------------------------------------------------------------------------
> > +// StructElement
> > +//------------------------------------------------------------------------
> > +
> > +const Ref StructElement::InvalidRef = { -1, -1 };
> 
> This looks unused here

Removed.

> @@ +176,5 @@
> > +  c(new ContentData(mcid))
> > +{
> > +  assert(treeRoot);
> > +  assert(parent);
> > +  assert(c->mcid != InvalidMCID);
> 
> Instead of creating the ContentData struct with an invalid mcid, and
> asserting here, I think we should ensure that StructElement is never created
> with an invalid mcid. I don't understand in which situation mcid can be
> InvalidMCID or why -1 can't be a valid MCID, the spec says it's an integer
> that uniquely identifies the marked content sequence.

You are right, instantiation inside StructElement::parseChild() checks that
the MCID is an integer.


> @@ +188,5 @@
> > +{
> > +  assert(treeRoot);
> > +  assert(parent);
> > +  assert(c->ref.num >= 0);
> > +  assert(c->ref.gen >= 0);
> 
> Same here, this should never be called with an invalid ref and I think it's
> already ensured by the callers.

Yes, StructElement::parseChild() also checks that this will be a Ref.

Personally, I think it does not hurt to have them even when with the
current implementation it is guaranteed that the passed values will
be correct. For example, if the API is extended at some point to also
allow creation of structure trees (i.e: not just reading them) those
constructors could/would be used by client code, and it would be good
to have the asserts to ensure that the API is used correctly… Anyway,
I am digressing; I will remove the extra asserts for now  O:-)


> @@ +226,5 @@
> > +  if (parent)
> > +    return parent->getPageRef();
> > +
> > +  static const Ref invalidRef = { -1, -1 };
> > +  return invalidRef;
> 
> I would make this boolean returning the Ref as an out parameter instead of
> using this invalidRef

Seems better, yes. I am changing this to:

  GBool StructElement::getPageRef(Ref& ref);

with the suggested semantics.

> @@ +299,5 @@
> > +  obj.free();
> > +
> > +  // Object ID (optional), to be looked at the IDTree in the tree root.
> > +  if (element->lookup("ID", &obj)->isString()) {
> > +    s->id = new GooString(obj.getString());
> 
> It's probably simpler to do obj.getString()->copy(), but it's exactly the
> same.

Changed.

> @@ +317,5 @@
> > +  obj.free();
> > +
> > +  // Element title (optional).
> > +  if (element->lookup("T", &obj)->isString()) {
> > +    s->title = new GooString(obj.getString());
> 
> Ditto.

Changed as well.

> @@ +324,5 @@
> > +
> > +  // Language (optional).
> > +  if (element->lookup("Lang", &obj)->isString()) {
> > +    s->language = obj.getString();
> > +    obj.initNull(); // The StructElement takes ownership of the GooString
> 
> I still don't understand why you are doing this here and not for the title
> for example. And I still think the proper way of doing this is adding
> Object::takeString() or something like that. Another way of avoid
> duplicating the string is to save the Object instead of getting the String
> contained.

I have added Object::takeString(), and also modified the rest of the
places where strings are used to use Object::takeString() if possible.

> @@ +378,5 @@
> > +      mcidObj.free();
> > +      return NULL;
> > +    }
> > +
> > +    child = new StructElement(mcidObj.getInt(), treeRoot, this);
> 
> mcidObj.free() after this.

Good catch, thanks. I have added it.

> @@ +383,5 @@
> > +
> > +    if (childObj->dictLookupNF("Pg", &pageRefObj)->isRef()) {
> > +      child->pageRef = pageRefObj;
> > +    } else {
> > +      pageRefObj.free();
> 
> I don't think you need the else, you can call .free unconditionally:
> 
> if (childObj->dictLookupNF("Pg", &pageRefObj)->isRef())
>   child->pageRef = pageRefObj;
> pageRefObj.free();

Note that “pageRefObj” is in Object, not a Ref. Note that “child->pageRef”
is a pointer to the Object, so if it is deinitialized by calling .free()
on it, it may be left in an undetermined state. The idea is to only call
.free() if the object is *not* a Ref. If it is a Ref, “child->pageRef” is
freed in the destructor for StructElement.

> @@ +403,5 @@
> > +      error(errSyntaxError, -1, "Obj object is wrong type ({0:s})", refObj.getTypeName());
> > +    }
> > +    refObj.free();
> > +  } else if (childObj->isDict()) {
> > +    assert(ref->isRef());
> 
> Why is this required to be a valid ref when child object is a dict?

Because the reference number is used to populate the “seen” set (it is
a “std::set<int>”) which tracks the seen nodes to avoid infinite loops
in malformed PDFs. I have changed this so instead of asserting, an error
is produced instead. According to the spec a properly formed /StructTree
must contain references for the /StrucElem items (not dictionaries
directly embedded in the tree), so causing an error should be reasonable.

> ::: poppler/StructElement.h
> @@ +63,3 @@
> >  
> > +  inline GBool isContent() const { return (type == MCID && c->mcid != InvalidMCID) || isObjectRef(); }
> > +  inline GBool isObjectRef() const { return (type == OBJR && c->ref.num != -1 && c->ref.gen != -1); }
> 
> I don't think we should consider the cases of invalid objects, isOk should
> return FALSE in all those cases

Okay.

> @@ +63,5 @@
> >  
> > +  inline GBool isContent() const { return (type == MCID && c->mcid != InvalidMCID) || isObjectRef(); }
> > +  inline GBool isObjectRef() const { return (type == OBJR && c->ref.num != -1 && c->ref.gen != -1); }
> > +
> > +  int getMCID() const { return type == MCID ? c->mcid : InvalidMCID; }
> 
> This is not actually public API, so I think it's simpler to assume this is
> only going to be called when type == MCID, so the caller should ensure that
> by checking the type first.

Yep.

> @@ +64,5 @@
> > +  inline GBool isContent() const { return (type == MCID && c->mcid != InvalidMCID) || isObjectRef(); }
> > +  inline GBool isObjectRef() const { return (type == OBJR && c->ref.num != -1 && c->ref.gen != -1); }
> > +
> > +  int getMCID() const { return type == MCID ? c->mcid : InvalidMCID; }
> > +  Ref getObjectRef() const { return type == OBJR ? c->ref : InvalidRef; }
> 
> Ditto.

Yepyep.

> ::: poppler/StructTreeRoot.cc
> @@ +115,4 @@
> >      for (int i = 0; i < kids.arrayGetLength(); i++) {
> >        Object obj, ref;
> >        kids.arrayGetNF(i, &ref);
> > +      assert(ref.isRef());
> 
> Why should this be a valid ref? Shouldn't we ignore invalid entries and
> continue parsing instead of crashing?

Same reason as earlier. In the case of the structure tree it is okay to
skip adding the integer to the “seen” set, as the loop would be caught
later on, so I have removed this assertion.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler-bugs/attachments/20130926/6dcd104f/attachment-0001.html>


More information about the Poppler-bugs mailing list