[poppler] extend poppler::text_box to store some font infos

suzuki toshiya mpsuzuki at hiroshima-u.ac.jp
Mon Mar 19 16:30:38 UTC 2018


Hi,

Recently I heard some people wants to retrieve the list
of words from PDF, as cpp's poppler::page::text_list(),
but with the font information (e.g. the familyname of
the font).

Considering that often the office document or academic
articles use different fonts for the section titles and
the main text, it would be reasonable for the people to
expect as "I want to retrieve the text boxes, but only
the text boxes written by Helvetica-Bold".

What is the right way to do such? During the developmet
of poppler::page::text_list(), once I've tried to do such.
https://github.com/mpsuzuki/poppler/commit/8ce2556a62a90c034d7cea8b1dfd26715d03a8f0
(note: this patch was written before the stabilization
of unique_ptr utilization. more fix is expected in future)

However, I feel it's slightly too big. Its changes are
not only for cpp frontend codes, but also for poppler/FontInfo.{cc,h}
and poppler/TextOutputDev.{cc,h}. I want to ask a few
questions...

Q-1) a request for text_box with font info fits to poppler's
scope? is there any better library to request such feature?

Q-2) if this request fits to poppler's scope, the enhancement
of the cpp frontend poppler::page::text_list() is the way to
go? having different API for such purpose is better?

Q-3) my current patch modifies FontInfo and TextOutputDev
of libpoppler itself. such modification is acceptable?

I appreciate if the maintainers can give some comments.

Regards,
mpsuzuki



More information about the poppler mailing list