[poppler] [RFC] Replace GooHash by std::unordered_map

Albert Astals Cid aacid at kde.org
Sun Mar 4 21:59:20 UTC 2018


El dissabte, 3 de març de 2018, a les 20:51:55 CET, Adam Reichold va escriure:
> Hello Albert,
> 
> please note that in the previous patch, I made a mistake with the move
> constructor (the ref count should not be moved/swapped since it is tied
> to the instance, not its value). Attached is a fixed patch.
> 
> I found the issue during further performance tests which however only
> continued to strengthen to zero sum result. In any case, I extend the
> Python-based perftest script with a C++-based driver to be able to
> reliably measure single-page runtimes for this, which I will also attach
> if anyone is interested in this.

Ok, so it seems there's a mild agreement (or at least no strong disagreement) 
that this may be the way forward. I will first wait to get poppler 0.63 out of 
the door (that is already late several months, between freedesktop shutting 
down ssh access, them forgetting to tell us it was open again, my disk dying 
and then me procastinating as usual) and once it's out I'll come back and 
reevaluate this.

Cheers,
  Albert

> 
> Best regards, Adam.
> 
> Am 03.03.2018 um 20:29 schrieb Albert Astals Cid:
> > Ha! didn't realize you had already done all that, will read/answer the
> > other email.
> > 
> > Cheers,
> > 
> >   Albert
> > 
> > El dissabte, 3 de març de 2018, a les 20:25:32 CET, Albert Astals Cid va
> > 
> > escriure:
> >> El dimecres, 21 de febrer de 2018, a les 7:13:57 CET, Adam Reichold va
> >> 
> >> escriure:
> >>> Hello again,
> >>> 
> >>> Am 21.02.2018 um 00:31 schrieb Albert Astals Cid:
> >>>> El dimarts, 20 de febrer de 2018, a les 8:58:24 CET, Adam Reichold va
> >>>> 
> >>>> escriure:
> >>>>> Hello again,
> >>>>> 
> >>>>> Am 18.02.2018 um 23:23 schrieb Adam Reichold:
> >>>>>> Am 18.02.2018 um 23:08 schrieb Albert Astals Cid:
> >>>>>>> El diumenge, 18 de febrer de 2018, a les 16:55:37 CET, Adam Reichold
> >>>>>>> va
> >>>>>>> 
> >>>>>>> escriure:
> >>>>>>>> Hello,
> >>>>>>>> 
> >>>>>>>> the attached patch replaced Poppler's internal hash table
> >>>>>>>> implementation
> >>>>>>>> GooHash by std::unordered_map found in the C++ standard library
> >>>>>>>> since
> >>>>>>>> C++11. This continues Poppler's slow drift towards standard library
> >>>>>>>> containers and removes one of the home-grown data structures with
> >>>>>>>> the
> >>>>>>>> main goals of reducing code size and improving the long term
> >>>>>>>> maintainability of the code base.
> >>>>>>> 
> >>>>>>> Do you have any benchmarks on "rendering" speed and code size?
> >>>>>> 
> >>>>>> Sorry, not at hand. I will try to prepare them during the week.
> >>>>> 
> >>>>> I did run Splash rendering benchmarks of this branch against master
> >>>>> with
> >>>>> the result of rendering the circa 2400 documents of the TeXLive
> >>>> 
> >>>>> documentation present on my machine being:
> >>>> I'm wondering if those 2400 documents are diverse enough, which they
> >>>> may
> >>>> not be given they are all coming from "the same place".
> >>> 
> >>> They seem pretty diverse w.r.t. content, some being text heavy and some
> >>> graphics rich. But I guess they are definitely not diverse technically
> >>> as all are prepared using TeX itself.
> >>> 
> >>> The main problem on my side is that I have failed to find my DVD copy of
> >>> the Poppler regtest document collection until now. :-\ In any case, I am
> >>> reasonably confident on the zero sum result since GooHash does not seem
> >>> to live in any performance critical code path.
> >>> 
> >>>>> Cumulative run time:
> >>>>>         Result: 90.95 min ∓ 1.1 %
> >>>>>         Reference: 91.57 min ∓ 1.2 %
> >>>>>         Deviation: -0.0 %
> >>>>> 
> >>>>> Cumulative memory usage:
> >>>>>         Result: 37.2 MB ∓ 0.7 %
> >>>>>         Reference: 37.0 MB ∓ 0.7 %
> >>>>>         Deviation: +0.0 %
> >>>>> 
> >>>>> (Where result is this patch and the reference is master.) (The
> >>>>> measurement was taken using the perftest script which I proposed here
> >>>>> some time ago for which I'll attach the patch again for
> >>>>> reproduceability.)
> >>>> 
> >>>> Did you really attach this before? i really don't remember it :D
> >>> 
> >>> This was a very long-winded thread ending on 2016-01-01 centered around
> >>> the regtest framework. It went from custom driver using QTest's
> >>> benchmark functionality to custom backend in the regtest framework to
> >>> separate Python script. The script still has problems to reliably
> >>> measure really short things like running "pdftotext" for which a custom
> >>> C++ driver might be needed, but for long-running stuff like "pdftoppm",
> >>> the results seem reasonable.
> >>> 
> >>>> Anyhow it seems the change is mostly a zero-sum runtime wise.
> >>> 
> >>> Indeed. (Although thinking really really long term, I think a Poppler
> >>> that is using std::vector, std::string and friends everywhere, could
> >>> loose a lot of distributed fat in the form of a lot of memory
> >>> indirections and benefit from the highly optimized standard library. But
> >>> it just is not a single patch thing to reap any of these benefits.)
> >>> 
> >>>> Which bring us to the code side. First i know it sucks but your spacing
> >>>> is
> >>>> broken, please check the whole patch
> >>>> 
> >>>> -	nameToGID->removeInt(macGlyphNames[j]);
> >>>> -	nameToGID->add(new GooString(macGlyphNames[j]), i);
> >>>> +          nameToGID[macGlyphNames[j]] = i;
> >>>> 
> >>>> i could fix it, but it's better (for me) if you do :D
> >>> 
> >>> Definitely for me to fix. The main problem is that the spacing in those
> >>> files was already broken and I am unsure whether I should fix the
> >>> surrounding spacing even if I am not touching it otherwise. Due to that,
> >>> there sometimes is no correct way in the current patch. If you do not
> >>> say otherwise, I will try to make at least the touched blocks of code
> >>> consistent.
> >> 
> >> Are you sure the spacing is broken? I'd say it's just xpdf weird spacing
> >> rules of using 2 soft spaces and then hard tabs at 8.
> >> 
> >>>> Now onto the code,
> >>>> 
> >>>>   const auto emplaceRangeMap = [&](const char* encodingName, GBool
> >>>>   unicodeOut,>
> >>>> 
> >>>> UnicodeMapRange* ranges, int len) {
> >>>> 
> >>>>     residentUnicodeMaps.emplace(
> >>>>     
> >>>>       std::piecewise_construct,
> >>>>       std::forward_as_tuple(encodingName),
> >>>>       std::forward_as_tuple(encodingName, unicodeOut, ranges, len)
> >>>>     
> >>>>     );
> >>>>   
> >>>>   };
> >>>> 
> >>>> It's the first time in my life i see std::piecewise_construct and
> >>>> std::forward_as_tuple, yes that probably makes me a bad C++ developer,
> >>>> but
> >>>> given there's lots of us around, it doesn't make me happy now i don't
> >>>> understand what the code does.
> >>> 
> >>> The problem is that most internal Poppler types lack proper construction
> >>> and assignment operators, hence I need to work harder to construct those
> >>> UnicodeMap instances in-place inside the map (GooHash just stored a
> >>> pointer to it, so there was no problem.)
> >>> 
> >>> The whole piecewise_construct and forward_as_tuple dance just ensures,
> >>> that those parameters to emplace are properly grouped before being
> >>> unpacked to become the parameters of std::string and UnicodeMap
> >>> construction again. If UnicodeMap was movable, this would probably look
> >>> like
> >>> 
> >>> residentUnicodeMaps.emplace(
> >>> 
> >>>   encodingName,
> >>>   UnicodeMap{encodingName, unicodeOut, ranges, len}
> >>> 
> >>> );
> >>> 
> >>> If you like, I can try to make Unicode a move-only type and simplify the
> >>> mentioned helper functions?
> >> 
> >> you can give it a try, not sure how easy that's going to be.
> >> 
> >>>> Then there's the benefit/risk ratio. The code using GooHash is mostly
> >>>> rocksolid and we haven't probably touched any line in it for a long
> >>>> time
> >>>> and we have probably neither written new code that uses GooHash.
> >>>> 
> >>>> In that regard we risk breaking working code.
> >>>> 
> >>>> The benefit is not more speed nor less memory usage as your
> >>>> measurements
> >>>> show.
> >>>> 
> >>>> Mostly the benefit is "removing code from maintainership", which i
> >>>> agree
> >>>> is a good thing, but as mentioned before it's some code "we mostly
> >>>> ignore", so it's not that much of a good thing.
> >>> 
> >>> I very much agree with the risk assessment.
> >>> 
> >>> But I also think the code will ossify (or maybe already is?) due to
> >>> those custom data structures and the less than idiomatic C++ usage.
> >>> Hence I think, Poppler would not just loose code, but the remaining code
> >>> should become easier to maintain. (Of course, the piecewise_construct
> >>> fiasco shows that this has intermediate costs. But again, I think this
> >>> is just an incentive to provide types with the usual C++ semantics which
> >>> should make all code using those types better.)
> >> 
> >> Yeah, i guess you're right.
> >> 
> >> Cheers,
> >> 
> >>   Albert
> >>> 
> >>> Best regards, Adam.
> >>> 
> >>>> So i'm hesitant of what to do. It really sounds like a nice thing to
> >>>> not
> >>>> have custom structures, but on the other hand it's not like they get
> >>>> much
> >>>> in the way of coding.
> >>>> 
> >>>> I'd really appreciate other people's comments on this.
> >>>> 
> >>>> Cheers,
> >>>> 
> >>>>   Albert
> >>>>> 
> >>>>> I'll also attach the detailed comparison, but the gist seems to be
> >>>>> that
> >>>>> if there are significant changes, the run time is reduced but the
> >>>>> memory
> >>>>> usage is increased in the majority of cases. But this does not seem to
> >>>>> show up in the cumulative results.
> >>>>> 
> >>>>> Best regards, Adam.
> >>>>> 
> >>>>> P.S.: One could try to improve the memory usage by tuning the load
> >>>>> factor or calling shrink_to_fit where appropriate. Would you like me
> >>>>> to
> >>>>> try to do this?
> >>>>> 
> >>>>> P.P.S.: One obvious area for improvement would be better
> >>>>> interoperability between GooString and std::string, i.e. not
> >>>>> converting
> >>>>> them as C strings so that the length information does not need to be
> >>>>> recomputed. I will try to prepare this as a separate patch on top of
> >>>>> this one or should I include this here?
> >>>>> 
> >>>>> Best regards, Adam.
> >>>>> 
> >>>>>> Concerning code size, a release build of libpoppler.so goes from
> >>>>>> 
> >>>>>>    text    data     bss     dec     hex filename
> >>>>>> 
> >>>>>> 2464034  288852     360 2753246  2a02de libpoppler.so.73.0.0
> >>>>>> 
> >>>>>> for the current master to
> >>>>>> 
> >>>>>>    text    data     bss     dec     hex filename
> >>>>>> 
> >>>>>> 2482129  288756     360 2771245  2a492d libpoppler.so.73.0.0
> >>>>>> 
> >>>>>> with the patch applied, i.e. a 0.65% increase in binary size.
> >>>>>> 
> >>>>>> 
> >>>>>> Please note that in my original message, I was referring only to
> >>>>>> source
> >>>>>> code size, i.e.
> >>>>>> 
> >>>>>> git diff --stat master...remove-goo-hash
> >>>>>> 
> >>>>>>  18 files changed, 168 insertions(+), 803 deletions(-)
> >>>>>>  
> >>>>>>> Cheers,
> >>>>>>> 
> >>>>>>>   Albert
> >>>>>> 
> >>>>>> Best regards, Adam.
> >>>>>> 
> >>>>>>>> Best regards, Adam.
> >>>>>>> 
> >>>>>>> _______________________________________________
> >>>>>>> poppler mailing list
> >>>>>>> poppler at lists.freedesktop.org
> >>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> >>>>>> 
> >>>>>> _______________________________________________
> >>>>>> poppler mailing list
> >>>>>> poppler at lists.freedesktop.org
> >>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> >>>> 
> >>>> _______________________________________________
> >>>> poppler mailing list
> >>>> poppler at lists.freedesktop.org
> >>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> >> 
> >> _______________________________________________
> >> poppler mailing list
> >> poppler at lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/poppler
> > 
> > _______________________________________________
> > poppler mailing list
> > poppler at lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/poppler






More information about the poppler mailing list