[poppler] [RFC] Replace GooHash by std::unordered_map

Albert Astals Cid aacid at kde.org
Tue May 1 21:57:01 UTC 2018


El diumenge, 4 de març de 2018, a les 22:59:20 CEST, Albert Astals Cid va 
escriure:
> El dissabte, 3 de març de 2018, a les 20:51:55 CET, Adam Reichold va 
escriure:
> > Hello Albert,
> > 
> > please note that in the previous patch, I made a mistake with the move
> > constructor (the ref count should not be moved/swapped since it is tied
> > to the instance, not its value). Attached is a fixed patch.
> > 
> > I found the issue during further performance tests which however only
> > continued to strengthen to zero sum result. In any case, I extend the
> > Python-based perftest script with a C++-based driver to be able to
> > reliably measure single-page runtimes for this, which I will also attach
> > if anyone is interested in this.
> 
> Ok, so it seems there's a mild agreement (or at least no strong
> disagreement) that this may be the way forward. I will first wait to get
> poppler 0.63 out of the door (that is already late several months, between
> freedesktop shutting down ssh access, them forgetting to tell us it was
> open again, my disk dying and then me procastinating as usual) and once
> it's out I'll come back and reevaluate this.

I was trying to apply the patch for running a quick regtest but running git am 
over it gave me some warning/error I wasn't sure how to get out of correctly.

Could you try rebasing the patch on top of current master?

Cheers,
  Albert

> 
> Cheers,
>   Albert
> 
> > Best regards, Adam.
> > 
> > Am 03.03.2018 um 20:29 schrieb Albert Astals Cid:
> > > Ha! didn't realize you had already done all that, will read/answer the
> > > other email.
> > > 
> > > Cheers,
> > > 
> > >   Albert
> > > 
> > > El dissabte, 3 de març de 2018, a les 20:25:32 CET, Albert Astals Cid va
> > > 
> > > escriure:
> > >> El dimecres, 21 de febrer de 2018, a les 7:13:57 CET, Adam Reichold va
> > >> 
> > >> escriure:
> > >>> Hello again,
> > >>> 
> > >>> Am 21.02.2018 um 00:31 schrieb Albert Astals Cid:
> > >>>> El dimarts, 20 de febrer de 2018, a les 8:58:24 CET, Adam Reichold va
> > >>>> 
> > >>>> escriure:
> > >>>>> Hello again,
> > >>>>> 
> > >>>>> Am 18.02.2018 um 23:23 schrieb Adam Reichold:
> > >>>>>> Am 18.02.2018 um 23:08 schrieb Albert Astals Cid:
> > >>>>>>> El diumenge, 18 de febrer de 2018, a les 16:55:37 CET, Adam
> > >>>>>>> Reichold
> > >>>>>>> va
> > >>>>>>> 
> > >>>>>>> escriure:
> > >>>>>>>> Hello,
> > >>>>>>>> 
> > >>>>>>>> the attached patch replaced Poppler's internal hash table
> > >>>>>>>> implementation
> > >>>>>>>> GooHash by std::unordered_map found in the C++ standard library
> > >>>>>>>> since
> > >>>>>>>> C++11. This continues Poppler's slow drift towards standard
> > >>>>>>>> library
> > >>>>>>>> containers and removes one of the home-grown data structures with
> > >>>>>>>> the
> > >>>>>>>> main goals of reducing code size and improving the long term
> > >>>>>>>> maintainability of the code base.
> > >>>>>>> 
> > >>>>>>> Do you have any benchmarks on "rendering" speed and code size?
> > >>>>>> 
> > >>>>>> Sorry, not at hand. I will try to prepare them during the week.
> > >>>>> 
> > >>>>> I did run Splash rendering benchmarks of this branch against master
> > >>>>> with
> > >>>>> the result of rendering the circa 2400 documents of the TeXLive
> > >>>> 
> > >>>>> documentation present on my machine being:
> > >>>> I'm wondering if those 2400 documents are diverse enough, which they
> > >>>> may
> > >>>> not be given they are all coming from "the same place".
> > >>> 
> > >>> They seem pretty diverse w.r.t. content, some being text heavy and
> > >>> some
> > >>> graphics rich. But I guess they are definitely not diverse technically
> > >>> as all are prepared using TeX itself.
> > >>> 
> > >>> The main problem on my side is that I have failed to find my DVD copy
> > >>> of
> > >>> the Poppler regtest document collection until now. :-\ In any case, I
> > >>> am
> > >>> reasonably confident on the zero sum result since GooHash does not
> > >>> seem
> > >>> to live in any performance critical code path.
> > >>> 
> > >>>>> Cumulative run time:
> > >>>>>         Result: 90.95 min ∓ 1.1 %
> > >>>>>         Reference: 91.57 min ∓ 1.2 %
> > >>>>>         Deviation: -0.0 %
> > >>>>> 
> > >>>>> Cumulative memory usage:
> > >>>>>         Result: 37.2 MB ∓ 0.7 %
> > >>>>>         Reference: 37.0 MB ∓ 0.7 %
> > >>>>>         Deviation: +0.0 %
> > >>>>> 
> > >>>>> (Where result is this patch and the reference is master.) (The
> > >>>>> measurement was taken using the perftest script which I proposed
> > >>>>> here
> > >>>>> some time ago for which I'll attach the patch again for
> > >>>>> reproduceability.)
> > >>>> 
> > >>>> Did you really attach this before? i really don't remember it :D
> > >>> 
> > >>> This was a very long-winded thread ending on 2016-01-01 centered
> > >>> around
> > >>> the regtest framework. It went from custom driver using QTest's
> > >>> benchmark functionality to custom backend in the regtest framework to
> > >>> separate Python script. The script still has problems to reliably
> > >>> measure really short things like running "pdftotext" for which a
> > >>> custom
> > >>> C++ driver might be needed, but for long-running stuff like
> > >>> "pdftoppm",
> > >>> the results seem reasonable.
> > >>> 
> > >>>> Anyhow it seems the change is mostly a zero-sum runtime wise.
> > >>> 
> > >>> Indeed. (Although thinking really really long term, I think a Poppler
> > >>> that is using std::vector, std::string and friends everywhere, could
> > >>> loose a lot of distributed fat in the form of a lot of memory
> > >>> indirections and benefit from the highly optimized standard library.
> > >>> But
> > >>> it just is not a single patch thing to reap any of these benefits.)
> > >>> 
> > >>>> Which bring us to the code side. First i know it sucks but your
> > >>>> spacing
> > >>>> is
> > >>>> broken, please check the whole patch
> > >>>> 
> > >>>> -	nameToGID->removeInt(macGlyphNames[j]);
> > >>>> -	nameToGID->add(new GooString(macGlyphNames[j]), i);
> > >>>> +          nameToGID[macGlyphNames[j]] = i;
> > >>>> 
> > >>>> i could fix it, but it's better (for me) if you do :D
> > >>> 
> > >>> Definitely for me to fix. The main problem is that the spacing in
> > >>> those
> > >>> files was already broken and I am unsure whether I should fix the
> > >>> surrounding spacing even if I am not touching it otherwise. Due to
> > >>> that,
> > >>> there sometimes is no correct way in the current patch. If you do not
> > >>> say otherwise, I will try to make at least the touched blocks of code
> > >>> consistent.
> > >> 
> > >> Are you sure the spacing is broken? I'd say it's just xpdf weird
> > >> spacing
> > >> rules of using 2 soft spaces and then hard tabs at 8.
> > >> 
> > >>>> Now onto the code,
> > >>>> 
> > >>>>   const auto emplaceRangeMap = [&](const char* encodingName, GBool
> > >>>>   unicodeOut,>
> > >>>> 
> > >>>> UnicodeMapRange* ranges, int len) {
> > >>>> 
> > >>>>     residentUnicodeMaps.emplace(
> > >>>>     
> > >>>>       std::piecewise_construct,
> > >>>>       std::forward_as_tuple(encodingName),
> > >>>>       std::forward_as_tuple(encodingName, unicodeOut, ranges, len)
> > >>>>     
> > >>>>     );
> > >>>>   
> > >>>>   };
> > >>>> 
> > >>>> It's the first time in my life i see std::piecewise_construct and
> > >>>> std::forward_as_tuple, yes that probably makes me a bad C++
> > >>>> developer,
> > >>>> but
> > >>>> given there's lots of us around, it doesn't make me happy now i don't
> > >>>> understand what the code does.
> > >>> 
> > >>> The problem is that most internal Poppler types lack proper
> > >>> construction
> > >>> and assignment operators, hence I need to work harder to construct
> > >>> those
> > >>> UnicodeMap instances in-place inside the map (GooHash just stored a
> > >>> pointer to it, so there was no problem.)
> > >>> 
> > >>> The whole piecewise_construct and forward_as_tuple dance just ensures,
> > >>> that those parameters to emplace are properly grouped before being
> > >>> unpacked to become the parameters of std::string and UnicodeMap
> > >>> construction again. If UnicodeMap was movable, this would probably
> > >>> look
> > >>> like
> > >>> 
> > >>> residentUnicodeMaps.emplace(
> > >>> 
> > >>>   encodingName,
> > >>>   UnicodeMap{encodingName, unicodeOut, ranges, len}
> > >>> 
> > >>> );
> > >>> 
> > >>> If you like, I can try to make Unicode a move-only type and simplify
> > >>> the
> > >>> mentioned helper functions?
> > >> 
> > >> you can give it a try, not sure how easy that's going to be.
> > >> 
> > >>>> Then there's the benefit/risk ratio. The code using GooHash is mostly
> > >>>> rocksolid and we haven't probably touched any line in it for a long
> > >>>> time
> > >>>> and we have probably neither written new code that uses GooHash.
> > >>>> 
> > >>>> In that regard we risk breaking working code.
> > >>>> 
> > >>>> The benefit is not more speed nor less memory usage as your
> > >>>> measurements
> > >>>> show.
> > >>>> 
> > >>>> Mostly the benefit is "removing code from maintainership", which i
> > >>>> agree
> > >>>> is a good thing, but as mentioned before it's some code "we mostly
> > >>>> ignore", so it's not that much of a good thing.
> > >>> 
> > >>> I very much agree with the risk assessment.
> > >>> 
> > >>> But I also think the code will ossify (or maybe already is?) due to
> > >>> those custom data structures and the less than idiomatic C++ usage.
> > >>> Hence I think, Poppler would not just loose code, but the remaining
> > >>> code
> > >>> should become easier to maintain. (Of course, the piecewise_construct
> > >>> fiasco shows that this has intermediate costs. But again, I think this
> > >>> is just an incentive to provide types with the usual C++ semantics
> > >>> which
> > >>> should make all code using those types better.)
> > >> 
> > >> Yeah, i guess you're right.
> > >> 
> > >> Cheers,
> > >> 
> > >>   Albert
> > >>> 
> > >>> Best regards, Adam.
> > >>> 
> > >>>> So i'm hesitant of what to do. It really sounds like a nice thing to
> > >>>> not
> > >>>> have custom structures, but on the other hand it's not like they get
> > >>>> much
> > >>>> in the way of coding.
> > >>>> 
> > >>>> I'd really appreciate other people's comments on this.
> > >>>> 
> > >>>> Cheers,
> > >>>> 
> > >>>>   Albert
> > >>>>> 
> > >>>>> I'll also attach the detailed comparison, but the gist seems to be
> > >>>>> that
> > >>>>> if there are significant changes, the run time is reduced but the
> > >>>>> memory
> > >>>>> usage is increased in the majority of cases. But this does not seem
> > >>>>> to
> > >>>>> show up in the cumulative results.
> > >>>>> 
> > >>>>> Best regards, Adam.
> > >>>>> 
> > >>>>> P.S.: One could try to improve the memory usage by tuning the load
> > >>>>> factor or calling shrink_to_fit where appropriate. Would you like me
> > >>>>> to
> > >>>>> try to do this?
> > >>>>> 
> > >>>>> P.P.S.: One obvious area for improvement would be better
> > >>>>> interoperability between GooString and std::string, i.e. not
> > >>>>> converting
> > >>>>> them as C strings so that the length information does not need to be
> > >>>>> recomputed. I will try to prepare this as a separate patch on top of
> > >>>>> this one or should I include this here?
> > >>>>> 
> > >>>>> Best regards, Adam.
> > >>>>> 
> > >>>>>> Concerning code size, a release build of libpoppler.so goes from
> > >>>>>> 
> > >>>>>>    text    data     bss     dec     hex filename
> > >>>>>> 
> > >>>>>> 2464034  288852     360 2753246  2a02de libpoppler.so.73.0.0
> > >>>>>> 
> > >>>>>> for the current master to
> > >>>>>> 
> > >>>>>>    text    data     bss     dec     hex filename
> > >>>>>> 
> > >>>>>> 2482129  288756     360 2771245  2a492d libpoppler.so.73.0.0
> > >>>>>> 
> > >>>>>> with the patch applied, i.e. a 0.65% increase in binary size.
> > >>>>>> 
> > >>>>>> 
> > >>>>>> Please note that in my original message, I was referring only to
> > >>>>>> source
> > >>>>>> code size, i.e.
> > >>>>>> 
> > >>>>>> git diff --stat master...remove-goo-hash
> > >>>>>> 
> > >>>>>>  18 files changed, 168 insertions(+), 803 deletions(-)
> > >>>>>>  
> > >>>>>>> Cheers,
> > >>>>>>> 
> > >>>>>>>   Albert
> > >>>>>> 
> > >>>>>> Best regards, Adam.
> > >>>>>> 
> > >>>>>>>> Best regards, Adam.
> > >>>>>>> 
> > >>>>>>> _______________________________________________
> > >>>>>>> poppler mailing list
> > >>>>>>> poppler at lists.freedesktop.org
> > >>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> > >>>>>> 
> > >>>>>> _______________________________________________
> > >>>>>> poppler mailing list
> > >>>>>> poppler at lists.freedesktop.org
> > >>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> > >>>> 
> > >>>> _______________________________________________
> > >>>> poppler mailing list
> > >>>> poppler at lists.freedesktop.org
> > >>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> > >> 
> > >> _______________________________________________
> > >> poppler mailing list
> > >> poppler at lists.freedesktop.org
> > >> https://lists.freedesktop.org/mailman/listinfo/poppler
> > > 
> > > _______________________________________________
> > > poppler mailing list
> > > poppler at lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/poppler
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler






More information about the poppler mailing list