[poppler] How to read textbox positions?

suzuki toshiya mpsuzuki at hiroshima-u.ac.jp
Thu Feb 8 03:17:33 UTC 2018


Dear Albert,

> Can you please try adding some documentation in the cpp file like the other 
> functions have?

Oh, sorry for overlooking that task. Here it is...

Regards,
mpsuzuki

Albert Astals Cid wrote:
> El diumenge, 28 de gener de 2018, a les 13:24:45 CET, suzuki toshiya va 
> escriure:
>> Dear Albert,
>>
>>> As far as i can see we don't use exceptions anywhere in the poppler code,
>>> so i would not introduce them "just for this".
>> Oh, I'm sorry for my overlooking that. Here is updated patch which checks
>> out-of-bound value to char_bbox() and return rect(0,0,0,0) if bad index
>> is passed. Also I changed the type of the index from int to size_t.
> 
> Can you please try adding some documentation in the cpp file like the other 
> functions have?
> 
> Cheers,
>   Albert
> 
>> Regards,
>> mpsuzuki
>>
>> Albert Astals Cid wrote:
>>> El dissabte, 27 de gener de 2018, a les 17:17:37 CET, suzuki toshiya va
>>>
>>> escriure:
>>>> Dear Albert,
>>>>
>>>> Oops, I apologize that I overlooked your response to
>>>> improve the patch.
>>>>
>>>>> You have one text_box where you have 17 char bboxes but the
>>>>> ustr.to_utf8().size() is 27, so if you iterate it's easy to get a crash
>>>>> because you get an out of bounds.
>>>> Indeed, it might be common scenarios for the
>>>> programmers assuming the input as ASCII data but
>>>> something different is given.
>>>>
>>>>> So it would be nice to make
>>>>> rectf     char_bbox(int i) const;
>>>>> not crash on an invalid i, i guess return a 0x0 rect
>>>>> and then probably add some documentation about it?
>>>> Checking the features of C++ std::vector, I think
>>>> we have 2 options:
>>>>
>>>> a) use (as current patch) m_data->char_bbox[i] and
>>>> sanitize the i-index before using it (as you proposed),
>>>> and the bad values cause 0x0 rectf.
>>>>
>>>> b) use m_data->char_bbox.at(i) and let stdlib raise
>>>> std::out_of_range error.
>>>>
>>>> which is better for C++ programmers?
>>> As far as i can see we don't use exceptions anywhere in the poppler code,
>>> so i would not introduce them "just for this".
>>>
>>> Cheers,
>>>
>>>   Albert
>>>   
>>>>> FWIW this was a real problem in Okular and i fixed it with
>>>>> https://cgit.kde.org/okular.git/commit/generators/poppler/generator_pdf.
>>>>> cp
>>>>> p?id=dcf0e78227959a52300d8f253c4b1058b3e81567
>>>>>
>>>>> Am i making any sense here?
>>>> Thank you very much for the important suggestion and
>>>> sorry for delayed reponse!
>>>>
>>>> Regards,
>>>> mpsuzuki
>>>>
>>>> Albert Astals Cid wrote:
>>>>> El dimecres, 3 de gener de 2018, a les 21:10:06 CET, Adam Reichold va
>>> escriure:
>>>>>> Hello mpsuzuki,
>>>>>>
>>>>>> I do not think you need the two indirections: Since you want to keep
>>>>>> that m_data indirection for ABI compatibility, just return
>>>>>> std::vector<text_box> as text_box basically "is" a unique_ptr to its
>>>>>> data with the additional getter methods. (You just need to make sure to
>>>>>> default the move constructor and assignment operator since no copying
>>>>>> implies no moving by default.)
>>>>>>
>>>>>> Attached is an update patch along those lines. (I also took the liberty
>>>>>> to replace some bare deletes by use of unique_ptr in the implementation
>>>>>> of page::text_list as well.)
>>>>> One thing that i wanted to improve in the Qt5 frontend but never go to,
>>>>> maybe we can start here is the fact that the number of boxes is "a bit
>>>>> random", for example take https://bugs.kde.org/attachment.cgi?id=66189
>>>>>
>>>>> You have one text_box where you have 17 char bboxes but the
>>>>> ustr.to_utf8().size() is 27, so if you iterate it's easy to get a crash
>>>>> because you get an out of bounds.
>>>>>
>>>>> So it would be nice to make
>>>>> rectf     char_bbox(int i) const;
>>>>> not crash on an invalid i, i guess return a 0x0 rect
>>>>> and then probably add some documentation about it?
>>>>>
>>>>> FWIW this was a real problem in Okular and i fixed it with
>>>>> https://cgit.kde.org/okular.git/commit/generators/poppler/generator_pdf.
>>>>> cp
>>>>> p?id=dcf0e78227959a52300d8f253c4b1058b3e81567
>>>>>
>>>>> Am i making any sense here?
>>>>>
>>>>> Cheers,
>>>>>
>>>>>   Albert
>>>>>> Best regards, Adam.
>>>>>>
>>>>>> Am 03.01.2018 um 07:38 schrieb suzuki toshiya:
>>>>>>> Dear Adam,
>>>>>>>
>>>>>>> Thank you for the technical information on the difference of
>>>>>>> std::shared_ptr and std::unique_ptr. Just I've tried to replace
>>>>>>> the vector of the pointers to that of the unique_ptrs, and
>>>>>>> remove the next_xxx methods.
>>>>>>>
>>>>>>> I attached the revised patch (gzipped) to the master, and
>>>>>>> here I quote the revised part from my last revision, for
>>>>>>> convenience to take a look on it.
>>>>>>>
>>>>>>> Regards,
>>>>>>> mpsuzuki
>>>>>>>
>>>>>>> diff --git a/cpp/poppler-private.h b/cpp/poppler-private.h
>>>>>>> index 726068a..2cedd1e 100644
>>>>>>> --- a/cpp/poppler-private.h
>>>>>>> +++ b/cpp/poppler-private.h
>>>>>>> @@ -72,12 +72,11 @@ class text_box_data
>>>>>>>
>>>>>>>  {
>>>>>>>  
>>>>>>>  public:
>>>>>>>      text_box_data()
>>>>>>>
>>>>>>> -    : next_text_box(0), has_space_after(false)
>>>>>>> +    : has_space_after(false)
>>>>>>>
>>>>>>>      {
>>>>>>>      }
>>>>>>>      ustring text;
>>>>>>>      rectf bbox;
>>>>>>>
>>>>>>> -    text_box *next_text_box;
>>>>>>>
>>>>>>>      std::vector<rectf> char_bboxes;
>>>>>>>      bool has_space_after;
>>>>>>>  
>>>>>>>  };
>>>>>>>
>>>>>>> diff --git a/cpp/poppler-page.h b/cpp/poppler-page.h
>>>>>>> index 18af213..3013997 100644
>>>>>>> --- a/cpp/poppler-page.h
>>>>>>> +++ b/cpp/poppler-page.h
>>>>>>> @@ -22,6 +22,8 @@
>>>>>>>
>>>>>>>  #include "poppler-global.h"
>>>>>>>  #include "poppler-rectangle.h"
>>>>>>>
>>>>>>> +#include <memory>
>>>>>>> +
>>>>>>>
>>>>>>>  namespace poppler
>>>>>>>  {
>>>>>>>
>>>>>>> @@ -33,12 +35,10 @@ class POPPLER_CPP_EXPORT text_box {
>>>>>>>
>>>>>>>  	~text_box();
>>>>>>>  	ustring   text() const;
>>>>>>>  	rectf     bbox() const;
>>>>>>>
>>>>>>> -	text_box *next_text_box() const;
>>>>>>> -	text_box *next_word() { return this->next_text_box(); };
>>>>>>>
>>>>>>>  	rectf     char_bbox(int i) const;
>>>>>>>  	bool      has_space_after() const;
>>>>>>>  	
>>>>>>>      private:
>>>>>>> -	text_box_data* m_data;
>>>>>>> +	std::unique_ptr<text_box_data> m_data;
>>>>>>>
>>>>>>>  };
>>>>>>>  
>>>>>>>  class document;
>>>>>>>
>>>>>>> @@ -79,7 +79,7 @@ public:
>>>>>>>      ustring text(const rectf &rect = rectf()) const;
>>>>>>>      ustring text(const rectf &rect, text_layout_enum layout_mode)
>>>>>>>      const;
>>>>>>>
>>>>>>> -    std::vector<text_box*> text_list(rotation_enum rotation) const;
>>>>>>> +    std::vector<std::unique_ptr<text_box>> text_list(rotation_enum
>>>>>>> rotation) const;>
>>>>>>>
>>>>>>>  private:
>>>>>>>      page(document_private *doc, int index);
>>>>>>>
>>>>>>> diff --git a/cpp/poppler-page.cpp b/cpp/poppler-page.cpp
>>>>>>> index 9461ab9..5da2e1d 100644
>>>>>>> --- a/cpp/poppler-page.cpp
>>>>>>> +++ b/cpp/poppler-page.cpp
>>>>>>> @@ -291,14 +291,13 @@ ustring page::text(const rectf &r,
>>>>>>> text_layout_enum
>>>>>>> layout_mode) const
>>>>>>>
>>>>>>>   */
>>>>>>>  
>>>>>>>  text_box::text_box(const ustring& text, const rectf &bbox)
>>>>>>>  {
>>>>>>>
>>>>>>> -    m_data = new text_box_data();
>>>>>>> +    m_data = std::unique_ptr<text_box_data>(new text_box_data());
>>>>>>>
>>>>>>>      m_data->text = text;
>>>>>>>      m_data->bbox = bbox;
>>>>>>>  
>>>>>>>  }
>>>>>>>  
>>>>>>>  text_box::~text_box()
>>>>>>>  {
>>>>>>>
>>>>>>> -    delete m_data;
>>>>>>>
>>>>>>>  }
>>>>>>>  
>>>>>>>  ustring text_box::text() const
>>>>>>>
>>>>>>> @@ -311,11 +310,6 @@ rectf text_box::bbox() const
>>>>>>>
>>>>>>>      return m_data->bbox;
>>>>>>>  
>>>>>>>  }
>>>>>>>
>>>>>>> -text_box* text_box::next_text_box() const
>>>>>>> -{
>>>>>>> -    return m_data->next_text_box;
>>>>>>> -}
>>>>>>> -
>>>>>>>
>>>>>>>  rectf text_box::char_bbox(int i) const
>>>>>>>  {
>>>>>>>  
>>>>>>>      return m_data->char_bboxes[i];
>>>>>>>
>>>>>>> @@ -326,10 +320,10 @@ bool text_box::has_space_after() const
>>>>>>>
>>>>>>>      return m_data->has_space_after;
>>>>>>>  
>>>>>>>  }
>>>>>>>
>>>>>>> -std::vector<text_box*> page::text_list(rotation_enum rotate) const
>>>>>>> +std::vector<std::unique_ptr<text_box>> page::text_list(rotation_enum
>>>>>>> rotate) const>
>>>>>>>
>>>>>>>  {
>>>>>>>  
>>>>>>>      TextOutputDev *output_dev;
>>>>>>>
>>>>>>> -    std::vector<text_box*>  output_list;
>>>>>>> +    std::vector<std::unique_ptr<text_box>>  output_list;
>>>>>>>
>>>>>>>      const int rotation_value = (int)rotate * 90;
>>>>>>>      
>>>>>>>      /* config values are same with Qt5 Page::TextList() */
>>>>>>>
>>>>>>> @@ -366,7 +360,7 @@ std::vector<text_box*>
>>>>>>> page::text_list(rotation_enum
>>>>>>> rotate) const
>>>>>>>
>>>>>>>  	double xMin, yMin, xMax, yMax;
>>>>>>>  	word->getBBox(&xMin, &yMin, &xMax, &yMax);
>>>>>>>
>>>>>>> -	text_box* tb = new text_box(ustr, rectf(xMin, yMin, xMax-xMin,
>>>>>>> yMax-yMin)); +	std::unique_ptr<text_box> tb =
>>>>>>> std::unique_ptr<text_box>(new text_box(ustr, rectf(xMin, yMin,
>>>>>>> xMax-xMin,
>>>>>>> yMax-yMin)));
>>>>>>>
>>>>>>>  	tb->m_data->has_space_after = (word->hasSpaceAfter() == gTrue);
>>>>>>>  	
>>>>>>>  	tb->m_data->char_bboxes.reserve(word->getLength());
>>>>>>>
>>>>>>> @@ -375,10 +369,7 @@ std::vector<text_box*>
>>>>>>> page::text_list(rotation_enum
>>>>>>> rotate) const
>>>>>>>
>>>>>>>  	    tb->m_data->char_bboxes.push_back(rectf(xMin, yMin, xMax-
> xMin,
>>>>>>>  	    yMax-yMin));
>>>>>>>  	
>>>>>>>  	}
>>>>>>>
>>>>>>> -	if (output_list.size() > 0)
>>>>>>> -	    output_list.back()->m_data->next_text_box = tb;
>>>>>>> -
>>>>>>> -	output_list.push_back(tb);
>>>>>>> +	output_list.push_back(std::move(tb));
>>>>>>>
>>>>>>>      }
>>>>>>>      delete word_list;
>>>>>>>      delete output_dev;
>>>>>>>
>>>>>>> diff --git a/cpp/tests/poppler-dump.cpp b/cpp/tests/poppler-dump.cpp
>>>>>>> index 09b180d..d26a0ef 100644
>>>>>>> --- a/cpp/tests/poppler-dump.cpp
>>>>>>> +++ b/cpp/tests/poppler-dump.cpp
>>>>>>> @@ -333,7 +333,7 @@ static void print_page_text_list(poppler::page *p)
>>>>>>>
>>>>>>>          std::cout << std::endl;
>>>>>>>          return;
>>>>>>>      
>>>>>>>      }
>>>>>>>
>>>>>>> -    std::vector<poppler::text_box*> text_list =
>>>>>>> p->text_list(poppler::rotate_0); +
>>>>>>> std::vector<std::unique_ptr<poppler::text_box>> text_list =
>>>>>>> p->text_list(poppler::rotate_0);
>>>>>>>
>>>>>>>      std::cout << "---" << std::endl;
>>>>>>>      for (size_t i = 0; i < text_list.size(); i ++) {
>>>>>>>
>>>>>>> @@ -343,7 +343,6 @@ static void print_page_text_list(poppler::page *p)
>>>>>>>
>>>>>>>          std::cout << "( x=" << bbox.x() << " y=" << bbox.y() << " w="
>>>>>>>          <<
>>>>>>>
>>>>>>> bbox.width() << " h=" << bbox.height() << " )";
>>>>>>>
>>>>>>>          std::cout << std::endl;
>>>>>>>
>>>>>>> -        delete text_list[i];
>>>>>>>
>>>>>>>      }
>>>>>>>      std::cout << "---" << std::endl;
>>>>>>>  
>>>>>>>  }
>>>>>>>
>>>>>>> Adam Reichold wrote:
>>>>>>>> Hello mpsuzuki,
>>>>>>>>
>>>>>>>> std::shared_ptr was introduced in C++11 and Poppler does use C++11
>>>>>>>> only
>>>>>>>> since a relatively short time, so I guess there just was no time to
>>>>>>>> use
>>>>>>>> it anywhere.
>>>>>>>>
>>>>>>>> In any case it is a pretty heavy-handed solution due to the atomic
>>>>>>>> memory accesses on the reference counter which will slow down even
>>>>>>>> single-threaded programs which do not need it. Hence I would suggest
>>>>>>>> starting with std::unique_ptr to get a move-only type.
>>>>>>>>
>>>>>>>> Best regards, Adam.
>>>>>>>>
>>>>>>>> Am 02.01.2018 um 08:19 schrieb suzuki toshiya:
>>>>>>>>> Dear Albert, Adam,
>>>>>>>>>
>>>>>>>>> Thank you for the helpful review. I should have more
>>>>>>>>> training on C++... Please let me spend to improve the
>>>>>>>>> patch.
>>>>>>>>>
>>>>>>>>>> (Another easy option albeit with some
>>>>>>>>>> overhead to get automatic destruction would be to wrap the actual
>>>>>>>>>> data
>>>>>>>>>> inside a std::shared_ptr and hence use reference counting.)
>>>>>>>>> Yeah, from some posts in stackoverflow, I found about
>>>>>>>>> shared_ptr... But it seems that current poppler does
>>>>>>>>> not use shared_ptr at all. Is this a coding policy?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> mpsuzuki
>>>>>>>>>
>>>>>>>>> Adam Reichold wrote:
>>>>>>>>>> Hello
>>>>>>>>>>
>>>>>>>>>> Am 02.01.2018 um 01:06 schrieb Albert Astals Cid:
>>>>>>>>>>> El diumenge, 31 de desembre de 2017, a les 0:45:16 CET, suzuki
>>>>>>>>>>> toshiya
>>>>>>>>>>> va
>>>>>>>>>>>
>>>>>>>>>>> escriure:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I've tried to implement the suggestion, I attached my current
>>>>>>>>>>>> patch.
>>>>>>>>>>>>
>>>>>>>>>>>> As suggested, the most part is just copied from Qt frontend and
>>>>>>>>>>>> renamed,
>>>>>>>>>>>> except of one point: TextBox.nextWord() looks slightly confusing,
>>>>>>>>>>>> because the returned object is a pointer to TextBox. I wrote
>>>>>>>>>>>> text_box.next_text_box() and a macro text_box.next_word() which
>>>>>>>>>>>> calls next_text_box() internally.
>>>>>>>>>>>>
>>>>>>>>>>>> Another point I want to discuss is the design of the list give by
>>>>>>>>>>>> poppler::page::text_list(). In Qt frontend, Page::textList()
>>>>>>>>>>>> returns
>>>>>>>>>>>> QList<TextBox*>. For similarity, current patch returns
>>>>>>>>>>>> std::vector<text_box*> for similarity to Qt frontend.
>>>>>>>>>>>>
>>>>>>>>>>>> But, if we return the vector of pointers, the client should
>>>>>>>>>>>> destruct
>>>>>>>>>>>> the objects pointed by the vector, before destructing vector
>>>>>>>>>>>> itself.
>>>>>>>>>>>> Using a vector of text_box (not the pointer but the object
>>>>>>>>>>>> itself),
>>>>>>>>>>>> like std::vector<text_box>, could be better, because the
>>>>>>>>>>>> destructor
>>>>>>>>>>>> of the vector would internally call the destructor for text_box
>>>>>>>>>>>> object.
>>>>>>>>>>>> (Qt has qDeleteAll(), but I think std::vector does not have
>>>>>>>>>>>> such).
>>>>>>>>>>>> If I'm misunderstanding about C++, please correct.
>>>>>>>>>>> First you need to fix the text_box class, you either need to make
>>>>>>>>>>> it
>>>>>>>>>>> uncopiable (and thus making your suggestion not possible) or you
>>>>>>>>>>> need
>>>>>>>>>>> to make the copy and assignment operators not break your class
>>>>>>>>>>> since
>>>>>>>>>>> at the moment the default operators will just copy the pointer and
>>>>>>>>>>> everything will break because you will end up with a dangling
>>>>>>>>>>> pointer.
>>>>>>>>>>>
>>>>>>>>>>> If you go for the second option you'll want the move constructor
>>>>>>>>>>> and
>>>>>>>>>>> move
>>>>>>>>>>> operator too.
>>>>>>>>>> Should the first option also be viable if text_box becomes a
>>>>>>>>>> move-only
>>>>>>>>>> type, i.e. no copying but move construction and assignment? I think
>>>>>>>>>> std::vector can handle it. (Another easy option albeit with some
>>>>>>>>>> overhead to get automatic destruction would be to wrap the actual
>>>>>>>>>> data
>>>>>>>>>> inside a std::shared_ptr and hence use reference counting.)
>>>>>>>>>>
>>>>>>>>>> Best regards, Adam.
>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>>
>>>>>>>>>>>   Albert
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> mpsuzuki
>>>>>>>>>>>>
>>>>>>>>>>>> Albert Astals Cid wrote:
>>>>>>>>>>>>> El dimecres, 27 de desembre de 2017, a les 12:26:25 CET, Jeroen
>>>>>>>>>>>>> Ooms
>>>>>>>>>>>>> va
>>>>>>>>>>>>>
>>>>>>>>>>>>> escriure:
>>>>>>>>>>>>>> Is there a method in poppler-cpp to extract text from a pdf
>>>>>>>>>>>>>> document,
>>>>>>>>>>>>>> including the position of each text box? Currently we use
>>>>>>>>>>>>>> page->text()
>>>>>>>>>>>>>> with page::physical_layout which gives all text per page, but I
>>>>>>>>>>>>>> need
>>>>>>>>>>>>>> more detailed information about each text box per page.
>>>>>>>>>>>>> You want to code the variant of qt5 frontend
>>>>>>>>>>>>> Poppler::Page::textList() for
>>>>>>>>>>>>> cpp frontend, it shouldn't be that hard getting inspiration
>>>>>>>>>>>>> (i.e.
>>>>>>>>>>>>> almost-copying) the code, do you have time for it?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>
>>>>>>>>>>>>>   Albert
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> poppler mailing list
>>>>>>>>>>>>>> poppler at lists.freedesktop.org
>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> poppler mailing list
>>>>>>>>>>>>> poppler at lists.freedesktop.org
>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> poppler mailing list
>>>>>>>>>>> poppler at lists.freedesktop.org
>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> 
> 
> 
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler
-------------- next part --------------
A non-text attachment was scrubbed...
Name: add-cpp-text-list_20180208-1215.diff.xz
Type: application/octet-stream
Size: 2632 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20180208/fc1b0e43/attachment.obj>


More information about the poppler mailing list