[poppler] How to read textbox positions?

suzuki toshiya mpsuzuki at hiroshima-u.ac.jp
Mon Feb 26 05:47:41 UTC 2018


Dear Albert,

Sorry for keeping you troubled about this.
Have I overlooked any homework assigned to me?
Please tell me what made this patch stopped...

Regards,
mpsuzuki

suzuki toshiya wrote:
> Dear Albert,
> 
>> Can you please try adding some documentation in the cpp file like the other 
>> functions have?
> 
> Oh, sorry for overlooking that task. Here it is...
> 
> Regards,
> mpsuzuki
> 
> Albert Astals Cid wrote:
>> El diumenge, 28 de gener de 2018, a les 13:24:45 CET, suzuki toshiya va 
>> escriure:
>>> Dear Albert,
>>>
>>>> As far as i can see we don't use exceptions anywhere in the poppler code,
>>>> so i would not introduce them "just for this".
>>> Oh, I'm sorry for my overlooking that. Here is updated patch which checks
>>> out-of-bound value to char_bbox() and return rect(0,0,0,0) if bad index
>>> is passed. Also I changed the type of the index from int to size_t.
>> Can you please try adding some documentation in the cpp file like the other 
>> functions have?
>>
>> Cheers,
>>   Albert
>>
>>> Regards,
>>> mpsuzuki
>>>
>>> Albert Astals Cid wrote:
>>>> El dissabte, 27 de gener de 2018, a les 17:17:37 CET, suzuki toshiya va
>>>>
>>>> escriure:
>>>>> Dear Albert,
>>>>>
>>>>> Oops, I apologize that I overlooked your response to
>>>>> improve the patch.
>>>>>
>>>>>> You have one text_box where you have 17 char bboxes but the
>>>>>> ustr.to_utf8().size() is 27, so if you iterate it's easy to get a crash
>>>>>> because you get an out of bounds.
>>>>> Indeed, it might be common scenarios for the
>>>>> programmers assuming the input as ASCII data but
>>>>> something different is given.
>>>>>
>>>>>> So it would be nice to make
>>>>>> rectf     char_bbox(int i) const;
>>>>>> not crash on an invalid i, i guess return a 0x0 rect
>>>>>> and then probably add some documentation about it?
>>>>> Checking the features of C++ std::vector, I think
>>>>> we have 2 options:
>>>>>
>>>>> a) use (as current patch) m_data->char_bbox[i] and
>>>>> sanitize the i-index before using it (as you proposed),
>>>>> and the bad values cause 0x0 rectf.
>>>>>
>>>>> b) use m_data->char_bbox.at(i) and let stdlib raise
>>>>> std::out_of_range error.
>>>>>
>>>>> which is better for C++ programmers?
>>>> As far as i can see we don't use exceptions anywhere in the poppler code,
>>>> so i would not introduce them "just for this".
>>>>
>>>> Cheers,
>>>>
>>>>   Albert
>>>>   
>>>>>> FWIW this was a real problem in Okular and i fixed it with
>>>>>> https://cgit.kde.org/okular.git/commit/generators/poppler/generator_pdf.
>>>>>> cp
>>>>>> p?id=dcf0e78227959a52300d8f253c4b1058b3e81567
>>>>>>
>>>>>> Am i making any sense here?
>>>>> Thank you very much for the important suggestion and
>>>>> sorry for delayed reponse!
>>>>>
>>>>> Regards,
>>>>> mpsuzuki
>>>>>
>>>>> Albert Astals Cid wrote:
>>>>>> El dimecres, 3 de gener de 2018, a les 21:10:06 CET, Adam Reichold va
>>>> escriure:
>>>>>>> Hello mpsuzuki,
>>>>>>>
>>>>>>> I do not think you need the two indirections: Since you want to keep
>>>>>>> that m_data indirection for ABI compatibility, just return
>>>>>>> std::vector<text_box> as text_box basically "is" a unique_ptr to its
>>>>>>> data with the additional getter methods. (You just need to make sure to
>>>>>>> default the move constructor and assignment operator since no copying
>>>>>>> implies no moving by default.)
>>>>>>>
>>>>>>> Attached is an update patch along those lines. (I also took the liberty
>>>>>>> to replace some bare deletes by use of unique_ptr in the implementation
>>>>>>> of page::text_list as well.)
>>>>>> One thing that i wanted to improve in the Qt5 frontend but never go to,
>>>>>> maybe we can start here is the fact that the number of boxes is "a bit
>>>>>> random", for example take https://bugs.kde.org/attachment.cgi?id=66189
>>>>>>
>>>>>> You have one text_box where you have 17 char bboxes but the
>>>>>> ustr.to_utf8().size() is 27, so if you iterate it's easy to get a crash
>>>>>> because you get an out of bounds.
>>>>>>
>>>>>> So it would be nice to make
>>>>>> rectf     char_bbox(int i) const;
>>>>>> not crash on an invalid i, i guess return a 0x0 rect
>>>>>> and then probably add some documentation about it?
>>>>>>
>>>>>> FWIW this was a real problem in Okular and i fixed it with
>>>>>> https://cgit.kde.org/okular.git/commit/generators/poppler/generator_pdf.
>>>>>> cp
>>>>>> p?id=dcf0e78227959a52300d8f253c4b1058b3e81567
>>>>>>
>>>>>> Am i making any sense here?
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>>   Albert
>>>>>>> Best regards, Adam.
>>>>>>>
>>>>>>> Am 03.01.2018 um 07:38 schrieb suzuki toshiya:
>>>>>>>> Dear Adam,
>>>>>>>>
>>>>>>>> Thank you for the technical information on the difference of
>>>>>>>> std::shared_ptr and std::unique_ptr. Just I've tried to replace
>>>>>>>> the vector of the pointers to that of the unique_ptrs, and
>>>>>>>> remove the next_xxx methods.
>>>>>>>>
>>>>>>>> I attached the revised patch (gzipped) to the master, and
>>>>>>>> here I quote the revised part from my last revision, for
>>>>>>>> convenience to take a look on it.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> mpsuzuki
>>>>>>>>
>>>>>>>> diff --git a/cpp/poppler-private.h b/cpp/poppler-private.h
>>>>>>>> index 726068a..2cedd1e 100644
>>>>>>>> --- a/cpp/poppler-private.h
>>>>>>>> +++ b/cpp/poppler-private.h
>>>>>>>> @@ -72,12 +72,11 @@ class text_box_data
>>>>>>>>
>>>>>>>>  {
>>>>>>>>  
>>>>>>>>  public:
>>>>>>>>      text_box_data()
>>>>>>>>
>>>>>>>> -    : next_text_box(0), has_space_after(false)
>>>>>>>> +    : has_space_after(false)
>>>>>>>>
>>>>>>>>      {
>>>>>>>>      }
>>>>>>>>      ustring text;
>>>>>>>>      rectf bbox;
>>>>>>>>
>>>>>>>> -    text_box *next_text_box;
>>>>>>>>
>>>>>>>>      std::vector<rectf> char_bboxes;
>>>>>>>>      bool has_space_after;
>>>>>>>>  
>>>>>>>>  };
>>>>>>>>
>>>>>>>> diff --git a/cpp/poppler-page.h b/cpp/poppler-page.h
>>>>>>>> index 18af213..3013997 100644
>>>>>>>> --- a/cpp/poppler-page.h
>>>>>>>> +++ b/cpp/poppler-page.h
>>>>>>>> @@ -22,6 +22,8 @@
>>>>>>>>
>>>>>>>>  #include "poppler-global.h"
>>>>>>>>  #include "poppler-rectangle.h"
>>>>>>>>
>>>>>>>> +#include <memory>
>>>>>>>> +
>>>>>>>>
>>>>>>>>  namespace poppler
>>>>>>>>  {
>>>>>>>>
>>>>>>>> @@ -33,12 +35,10 @@ class POPPLER_CPP_EXPORT text_box {
>>>>>>>>
>>>>>>>>  	~text_box();
>>>>>>>>  	ustring   text() const;
>>>>>>>>  	rectf     bbox() const;
>>>>>>>>
>>>>>>>> -	text_box *next_text_box() const;
>>>>>>>> -	text_box *next_word() { return this->next_text_box(); };
>>>>>>>>
>>>>>>>>  	rectf     char_bbox(int i) const;
>>>>>>>>  	bool      has_space_after() const;
>>>>>>>>  	
>>>>>>>>      private:
>>>>>>>> -	text_box_data* m_data;
>>>>>>>> +	std::unique_ptr<text_box_data> m_data;
>>>>>>>>
>>>>>>>>  };
>>>>>>>>  
>>>>>>>>  class document;
>>>>>>>>
>>>>>>>> @@ -79,7 +79,7 @@ public:
>>>>>>>>      ustring text(const rectf &rect = rectf()) const;
>>>>>>>>      ustring text(const rectf &rect, text_layout_enum layout_mode)
>>>>>>>>      const;
>>>>>>>>
>>>>>>>> -    std::vector<text_box*> text_list(rotation_enum rotation) const;
>>>>>>>> +    std::vector<std::unique_ptr<text_box>> text_list(rotation_enum
>>>>>>>> rotation) const;>
>>>>>>>>
>>>>>>>>  private:
>>>>>>>>      page(document_private *doc, int index);
>>>>>>>>
>>>>>>>> diff --git a/cpp/poppler-page.cpp b/cpp/poppler-page.cpp
>>>>>>>> index 9461ab9..5da2e1d 100644
>>>>>>>> --- a/cpp/poppler-page.cpp
>>>>>>>> +++ b/cpp/poppler-page.cpp
>>>>>>>> @@ -291,14 +291,13 @@ ustring page::text(const rectf &r,
>>>>>>>> text_layout_enum
>>>>>>>> layout_mode) const
>>>>>>>>
>>>>>>>>   */
>>>>>>>>  
>>>>>>>>  text_box::text_box(const ustring& text, const rectf &bbox)
>>>>>>>>  {
>>>>>>>>
>>>>>>>> -    m_data = new text_box_data();
>>>>>>>> +    m_data = std::unique_ptr<text_box_data>(new text_box_data());
>>>>>>>>
>>>>>>>>      m_data->text = text;
>>>>>>>>      m_data->bbox = bbox;
>>>>>>>>  
>>>>>>>>  }
>>>>>>>>  
>>>>>>>>  text_box::~text_box()
>>>>>>>>  {
>>>>>>>>
>>>>>>>> -    delete m_data;
>>>>>>>>
>>>>>>>>  }
>>>>>>>>  
>>>>>>>>  ustring text_box::text() const
>>>>>>>>
>>>>>>>> @@ -311,11 +310,6 @@ rectf text_box::bbox() const
>>>>>>>>
>>>>>>>>      return m_data->bbox;
>>>>>>>>  
>>>>>>>>  }
>>>>>>>>
>>>>>>>> -text_box* text_box::next_text_box() const
>>>>>>>> -{
>>>>>>>> -    return m_data->next_text_box;
>>>>>>>> -}
>>>>>>>> -
>>>>>>>>
>>>>>>>>  rectf text_box::char_bbox(int i) const
>>>>>>>>  {
>>>>>>>>  
>>>>>>>>      return m_data->char_bboxes[i];
>>>>>>>>
>>>>>>>> @@ -326,10 +320,10 @@ bool text_box::has_space_after() const
>>>>>>>>
>>>>>>>>      return m_data->has_space_after;
>>>>>>>>  
>>>>>>>>  }
>>>>>>>>
>>>>>>>> -std::vector<text_box*> page::text_list(rotation_enum rotate) const
>>>>>>>> +std::vector<std::unique_ptr<text_box>> page::text_list(rotation_enum
>>>>>>>> rotate) const>
>>>>>>>>
>>>>>>>>  {
>>>>>>>>  
>>>>>>>>      TextOutputDev *output_dev;
>>>>>>>>
>>>>>>>> -    std::vector<text_box*>  output_list;
>>>>>>>> +    std::vector<std::unique_ptr<text_box>>  output_list;
>>>>>>>>
>>>>>>>>      const int rotation_value = (int)rotate * 90;
>>>>>>>>      
>>>>>>>>      /* config values are same with Qt5 Page::TextList() */
>>>>>>>>
>>>>>>>> @@ -366,7 +360,7 @@ std::vector<text_box*>
>>>>>>>> page::text_list(rotation_enum
>>>>>>>> rotate) const
>>>>>>>>
>>>>>>>>  	double xMin, yMin, xMax, yMax;
>>>>>>>>  	word->getBBox(&xMin, &yMin, &xMax, &yMax);
>>>>>>>>
>>>>>>>> -	text_box* tb = new text_box(ustr, rectf(xMin, yMin, xMax-xMin,
>>>>>>>> yMax-yMin)); +	std::unique_ptr<text_box> tb =
>>>>>>>> std::unique_ptr<text_box>(new text_box(ustr, rectf(xMin, yMin,
>>>>>>>> xMax-xMin,
>>>>>>>> yMax-yMin)));
>>>>>>>>
>>>>>>>>  	tb->m_data->has_space_after = (word->hasSpaceAfter() == gTrue);
>>>>>>>>  	
>>>>>>>>  	tb->m_data->char_bboxes.reserve(word->getLength());
>>>>>>>>
>>>>>>>> @@ -375,10 +369,7 @@ std::vector<text_box*>
>>>>>>>> page::text_list(rotation_enum
>>>>>>>> rotate) const
>>>>>>>>
>>>>>>>>  	    tb->m_data->char_bboxes.push_back(rectf(xMin, yMin, xMax-
>> xMin,
>>>>>>>>  	    yMax-yMin));
>>>>>>>>  	
>>>>>>>>  	}
>>>>>>>>
>>>>>>>> -	if (output_list.size() > 0)
>>>>>>>> -	    output_list.back()->m_data->next_text_box = tb;
>>>>>>>> -
>>>>>>>> -	output_list.push_back(tb);
>>>>>>>> +	output_list.push_back(std::move(tb));
>>>>>>>>
>>>>>>>>      }
>>>>>>>>      delete word_list;
>>>>>>>>      delete output_dev;
>>>>>>>>
>>>>>>>> diff --git a/cpp/tests/poppler-dump.cpp b/cpp/tests/poppler-dump.cpp
>>>>>>>> index 09b180d..d26a0ef 100644
>>>>>>>> --- a/cpp/tests/poppler-dump.cpp
>>>>>>>> +++ b/cpp/tests/poppler-dump.cpp
>>>>>>>> @@ -333,7 +333,7 @@ static void print_page_text_list(poppler::page *p)
>>>>>>>>
>>>>>>>>          std::cout << std::endl;
>>>>>>>>          return;
>>>>>>>>      
>>>>>>>>      }
>>>>>>>>
>>>>>>>> -    std::vector<poppler::text_box*> text_list =
>>>>>>>> p->text_list(poppler::rotate_0); +
>>>>>>>> std::vector<std::unique_ptr<poppler::text_box>> text_list =
>>>>>>>> p->text_list(poppler::rotate_0);
>>>>>>>>
>>>>>>>>      std::cout << "---" << std::endl;
>>>>>>>>      for (size_t i = 0; i < text_list.size(); i ++) {
>>>>>>>>
>>>>>>>> @@ -343,7 +343,6 @@ static void print_page_text_list(poppler::page *p)
>>>>>>>>
>>>>>>>>          std::cout << "( x=" << bbox.x() << " y=" << bbox.y() << " w="
>>>>>>>>          <<
>>>>>>>>
>>>>>>>> bbox.width() << " h=" << bbox.height() << " )";
>>>>>>>>
>>>>>>>>          std::cout << std::endl;
>>>>>>>>
>>>>>>>> -        delete text_list[i];
>>>>>>>>
>>>>>>>>      }
>>>>>>>>      std::cout << "---" << std::endl;
>>>>>>>>  
>>>>>>>>  }
>>>>>>>>
>>>>>>>> Adam Reichold wrote:
>>>>>>>>> Hello mpsuzuki,
>>>>>>>>>
>>>>>>>>> std::shared_ptr was introduced in C++11 and Poppler does use C++11
>>>>>>>>> only
>>>>>>>>> since a relatively short time, so I guess there just was no time to
>>>>>>>>> use
>>>>>>>>> it anywhere.
>>>>>>>>>
>>>>>>>>> In any case it is a pretty heavy-handed solution due to the atomic
>>>>>>>>> memory accesses on the reference counter which will slow down even
>>>>>>>>> single-threaded programs which do not need it. Hence I would suggest
>>>>>>>>> starting with std::unique_ptr to get a move-only type.
>>>>>>>>>
>>>>>>>>> Best regards, Adam.
>>>>>>>>>
>>>>>>>>> Am 02.01.2018 um 08:19 schrieb suzuki toshiya:
>>>>>>>>>> Dear Albert, Adam,
>>>>>>>>>>
>>>>>>>>>> Thank you for the helpful review. I should have more
>>>>>>>>>> training on C++... Please let me spend to improve the
>>>>>>>>>> patch.
>>>>>>>>>>
>>>>>>>>>>> (Another easy option albeit with some
>>>>>>>>>>> overhead to get automatic destruction would be to wrap the actual
>>>>>>>>>>> data
>>>>>>>>>>> inside a std::shared_ptr and hence use reference counting.)
>>>>>>>>>> Yeah, from some posts in stackoverflow, I found about
>>>>>>>>>> shared_ptr... But it seems that current poppler does
>>>>>>>>>> not use shared_ptr at all. Is this a coding policy?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> mpsuzuki
>>>>>>>>>>
>>>>>>>>>> Adam Reichold wrote:
>>>>>>>>>>> Hello
>>>>>>>>>>>
>>>>>>>>>>> Am 02.01.2018 um 01:06 schrieb Albert Astals Cid:
>>>>>>>>>>>> El diumenge, 31 de desembre de 2017, a les 0:45:16 CET, suzuki
>>>>>>>>>>>> toshiya
>>>>>>>>>>>> va
>>>>>>>>>>>>
>>>>>>>>>>>> escriure:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've tried to implement the suggestion, I attached my current
>>>>>>>>>>>>> patch.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As suggested, the most part is just copied from Qt frontend and
>>>>>>>>>>>>> renamed,
>>>>>>>>>>>>> except of one point: TextBox.nextWord() looks slightly confusing,
>>>>>>>>>>>>> because the returned object is a pointer to TextBox. I wrote
>>>>>>>>>>>>> text_box.next_text_box() and a macro text_box.next_word() which
>>>>>>>>>>>>> calls next_text_box() internally.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Another point I want to discuss is the design of the list give by
>>>>>>>>>>>>> poppler::page::text_list(). In Qt frontend, Page::textList()
>>>>>>>>>>>>> returns
>>>>>>>>>>>>> QList<TextBox*>. For similarity, current patch returns
>>>>>>>>>>>>> std::vector<text_box*> for similarity to Qt frontend.
>>>>>>>>>>>>>
>>>>>>>>>>>>> But, if we return the vector of pointers, the client should
>>>>>>>>>>>>> destruct
>>>>>>>>>>>>> the objects pointed by the vector, before destructing vector
>>>>>>>>>>>>> itself.
>>>>>>>>>>>>> Using a vector of text_box (not the pointer but the object
>>>>>>>>>>>>> itself),
>>>>>>>>>>>>> like std::vector<text_box>, could be better, because the
>>>>>>>>>>>>> destructor
>>>>>>>>>>>>> of the vector would internally call the destructor for text_box
>>>>>>>>>>>>> object.
>>>>>>>>>>>>> (Qt has qDeleteAll(), but I think std::vector does not have
>>>>>>>>>>>>> such).
>>>>>>>>>>>>> If I'm misunderstanding about C++, please correct.
>>>>>>>>>>>> First you need to fix the text_box class, you either need to make
>>>>>>>>>>>> it
>>>>>>>>>>>> uncopiable (and thus making your suggestion not possible) or you
>>>>>>>>>>>> need
>>>>>>>>>>>> to make the copy and assignment operators not break your class
>>>>>>>>>>>> since
>>>>>>>>>>>> at the moment the default operators will just copy the pointer and
>>>>>>>>>>>> everything will break because you will end up with a dangling
>>>>>>>>>>>> pointer.
>>>>>>>>>>>>
>>>>>>>>>>>> If you go for the second option you'll want the move constructor
>>>>>>>>>>>> and
>>>>>>>>>>>> move
>>>>>>>>>>>> operator too.
>>>>>>>>>>> Should the first option also be viable if text_box becomes a
>>>>>>>>>>> move-only
>>>>>>>>>>> type, i.e. no copying but move construction and assignment? I think
>>>>>>>>>>> std::vector can handle it. (Another easy option albeit with some
>>>>>>>>>>> overhead to get automatic destruction would be to wrap the actual
>>>>>>>>>>> data
>>>>>>>>>>> inside a std::shared_ptr and hence use reference counting.)
>>>>>>>>>>>
>>>>>>>>>>> Best regards, Adam.
>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>
>>>>>>>>>>>>   Albert
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> mpsuzuki
>>>>>>>>>>>>>
>>>>>>>>>>>>> Albert Astals Cid wrote:
>>>>>>>>>>>>>> El dimecres, 27 de desembre de 2017, a les 12:26:25 CET, Jeroen
>>>>>>>>>>>>>> Ooms
>>>>>>>>>>>>>> va
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> escriure:
>>>>>>>>>>>>>>> Is there a method in poppler-cpp to extract text from a pdf
>>>>>>>>>>>>>>> document,
>>>>>>>>>>>>>>> including the position of each text box? Currently we use
>>>>>>>>>>>>>>> page->text()
>>>>>>>>>>>>>>> with page::physical_layout which gives all text per page, but I
>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>> more detailed information about each text box per page.
>>>>>>>>>>>>>> You want to code the variant of qt5 frontend
>>>>>>>>>>>>>> Poppler::Page::textList() for
>>>>>>>>>>>>>> cpp frontend, it shouldn't be that hard getting inspiration
>>>>>>>>>>>>>> (i.e.
>>>>>>>>>>>>>> almost-copying) the code, do you have time for it?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   Albert
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> poppler mailing list
>>>>>>>>>>>>>>> poppler at lists.freedesktop.org
>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> poppler mailing list
>>>>>>>>>>>>>> poppler at lists.freedesktop.org
>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> poppler mailing list
>>>>>>>>>>>> poppler at lists.freedesktop.org
>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>
>>
>>
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/poppler



More information about the poppler mailing list