[poppler] [RFC] Replace GooHash by std::unordered_map

Albert Astals Cid aacid at kde.org
Wed May 2 22:57:21 UTC 2018


El dimecres, 2 de maig de 2018, a les 19:11:19 CEST, Adam Reichold va 
escriure:
> Hello,
> 
> Am 01.05.2018 um 23:57 schrieb Albert Astals Cid:
> > El diumenge, 4 de març de 2018, a les 22:59:20 CEST, Albert Astals Cid va
> > 
> > escriure:
> >> El dissabte, 3 de març de 2018, a les 20:51:55 CET, Adam Reichold va
> > 
> > escriure:
> >>> Hello Albert,
> >>> 
> >>> please note that in the previous patch, I made a mistake with the move
> >>> constructor (the ref count should not be moved/swapped since it is tied
> >>> to the instance, not its value). Attached is a fixed patch.
> >>> 
> >>> I found the issue during further performance tests which however only
> >>> continued to strengthen to zero sum result. In any case, I extend the
> >>> Python-based perftest script with a C++-based driver to be able to
> >>> reliably measure single-page runtimes for this, which I will also attach
> >>> if anyone is interested in this.
> >> 
> >> Ok, so it seems there's a mild agreement (or at least no strong
> >> disagreement) that this may be the way forward. I will first wait to get
> >> poppler 0.63 out of the door (that is already late several months,
> >> between
> >> freedesktop shutting down ssh access, them forgetting to tell us it was
> >> open again, my disk dying and then me procastinating as usual) and once
> >> it's out I'll come back and reevaluate this.
> > 
> > I was trying to apply the patch for running a quick regtest but running
> > git am over it gave me some warning/error I wasn't sure how to get out of
> > correctly.

regtest results are good, i guess i'll have a second quick look at the patches 
and unless someone disagrees i'll push them.

Cheers,
  Albert

> > 
> > Could you try rebasing the patch on top of current master?
> > 
> > Cheers,
> > 
> >   Albert
> 
> Attached is a version rebased against the current master.
> 
> Best regards,
> Adam
> 
> >> Cheers,
> >> 
> >>   Albert
> >>> 
> >>> Best regards, Adam.
> >>> 
> >>> Am 03.03.2018 um 20:29 schrieb Albert Astals Cid:
> >>>> Ha! didn't realize you had already done all that, will read/answer the
> >>>> other email.
> >>>> 
> >>>> Cheers,
> >>>> 
> >>>>   Albert
> >>>> 
> >>>> El dissabte, 3 de març de 2018, a les 20:25:32 CET, Albert Astals Cid
> >>>> va
> >>>> 
> >>>> escriure:
> >>>>> El dimecres, 21 de febrer de 2018, a les 7:13:57 CET, Adam Reichold va
> >>>>> 
> >>>>> escriure:
> >>>>>> Hello again,
> >>>>>> 
> >>>>>> Am 21.02.2018 um 00:31 schrieb Albert Astals Cid:
> >>>>>>> El dimarts, 20 de febrer de 2018, a les 8:58:24 CET, Adam Reichold
> >>>>>>> va
> >>>>>>> 
> >>>>>>> escriure:
> >>>>>>>> Hello again,
> >>>>>>>> 
> >>>>>>>> Am 18.02.2018 um 23:23 schrieb Adam Reichold:
> >>>>>>>>> Am 18.02.2018 um 23:08 schrieb Albert Astals Cid:
> >>>>>>>>>> El diumenge, 18 de febrer de 2018, a les 16:55:37 CET, Adam
> >>>>>>>>>> Reichold
> >>>>>>>>>> va
> >>>>>>>>>> 
> >>>>>>>>>> escriure:
> >>>>>>>>>>> Hello,
> >>>>>>>>>>> 
> >>>>>>>>>>> the attached patch replaced Poppler's internal hash table
> >>>>>>>>>>> implementation
> >>>>>>>>>>> GooHash by std::unordered_map found in the C++ standard library
> >>>>>>>>>>> since
> >>>>>>>>>>> C++11. This continues Poppler's slow drift towards standard
> >>>>>>>>>>> library
> >>>>>>>>>>> containers and removes one of the home-grown data structures
> >>>>>>>>>>> with
> >>>>>>>>>>> the
> >>>>>>>>>>> main goals of reducing code size and improving the long term
> >>>>>>>>>>> maintainability of the code base.
> >>>>>>>>>> 
> >>>>>>>>>> Do you have any benchmarks on "rendering" speed and code size?
> >>>>>>>>> 
> >>>>>>>>> Sorry, not at hand. I will try to prepare them during the week.
> >>>>>>>> 
> >>>>>>>> I did run Splash rendering benchmarks of this branch against master
> >>>>>>>> with
> >>>>>>>> the result of rendering the circa 2400 documents of the TeXLive
> >>>>>>> 
> >>>>>>>> documentation present on my machine being:
> >>>>>>> I'm wondering if those 2400 documents are diverse enough, which they
> >>>>>>> may
> >>>>>>> not be given they are all coming from "the same place".
> >>>>>> 
> >>>>>> They seem pretty diverse w.r.t. content, some being text heavy and
> >>>>>> some
> >>>>>> graphics rich. But I guess they are definitely not diverse
> >>>>>> technically
> >>>>>> as all are prepared using TeX itself.
> >>>>>> 
> >>>>>> The main problem on my side is that I have failed to find my DVD copy
> >>>>>> of
> >>>>>> the Poppler regtest document collection until now. :-\ In any case, I
> >>>>>> am
> >>>>>> reasonably confident on the zero sum result since GooHash does not
> >>>>>> seem
> >>>>>> to live in any performance critical code path.
> >>>>>> 
> >>>>>>>> Cumulative run time:
> >>>>>>>>         Result: 90.95 min ∓ 1.1 %
> >>>>>>>>         Reference: 91.57 min ∓ 1.2 %
> >>>>>>>>         Deviation: -0.0 %
> >>>>>>>> 
> >>>>>>>> Cumulative memory usage:
> >>>>>>>>         Result: 37.2 MB ∓ 0.7 %
> >>>>>>>>         Reference: 37.0 MB ∓ 0.7 %
> >>>>>>>>         Deviation: +0.0 %
> >>>>>>>> 
> >>>>>>>> (Where result is this patch and the reference is master.) (The
> >>>>>>>> measurement was taken using the perftest script which I proposed
> >>>>>>>> here
> >>>>>>>> some time ago for which I'll attach the patch again for
> >>>>>>>> reproduceability.)
> >>>>>>> 
> >>>>>>> Did you really attach this before? i really don't remember it :D
> >>>>>> 
> >>>>>> This was a very long-winded thread ending on 2016-01-01 centered
> >>>>>> around
> >>>>>> the regtest framework. It went from custom driver using QTest's
> >>>>>> benchmark functionality to custom backend in the regtest framework to
> >>>>>> separate Python script. The script still has problems to reliably
> >>>>>> measure really short things like running "pdftotext" for which a
> >>>>>> custom
> >>>>>> C++ driver might be needed, but for long-running stuff like
> >>>>>> "pdftoppm",
> >>>>>> the results seem reasonable.
> >>>>>> 
> >>>>>>> Anyhow it seems the change is mostly a zero-sum runtime wise.
> >>>>>> 
> >>>>>> Indeed. (Although thinking really really long term, I think a Poppler
> >>>>>> that is using std::vector, std::string and friends everywhere, could
> >>>>>> loose a lot of distributed fat in the form of a lot of memory
> >>>>>> indirections and benefit from the highly optimized standard library.
> >>>>>> But
> >>>>>> it just is not a single patch thing to reap any of these benefits.)
> >>>>>> 
> >>>>>>> Which bring us to the code side. First i know it sucks but your
> >>>>>>> spacing
> >>>>>>> is
> >>>>>>> broken, please check the whole patch
> >>>>>>> 
> >>>>>>> -	nameToGID->removeInt(macGlyphNames[j]);
> >>>>>>> -	nameToGID->add(new GooString(macGlyphNames[j]), i);
> >>>>>>> +          nameToGID[macGlyphNames[j]] = i;
> >>>>>>> 
> >>>>>>> i could fix it, but it's better (for me) if you do :D
> >>>>>> 
> >>>>>> Definitely for me to fix. The main problem is that the spacing in
> >>>>>> those
> >>>>>> files was already broken and I am unsure whether I should fix the
> >>>>>> surrounding spacing even if I am not touching it otherwise. Due to
> >>>>>> that,
> >>>>>> there sometimes is no correct way in the current patch. If you do not
> >>>>>> say otherwise, I will try to make at least the touched blocks of code
> >>>>>> consistent.
> >>>>> 
> >>>>> Are you sure the spacing is broken? I'd say it's just xpdf weird
> >>>>> spacing
> >>>>> rules of using 2 soft spaces and then hard tabs at 8.
> >>>>> 
> >>>>>>> Now onto the code,
> >>>>>>> 
> >>>>>>>   const auto emplaceRangeMap = [&](const char* encodingName, GBool
> >>>>>>>   unicodeOut,>
> >>>>>>> 
> >>>>>>> UnicodeMapRange* ranges, int len) {
> >>>>>>> 
> >>>>>>>     residentUnicodeMaps.emplace(
> >>>>>>>     
> >>>>>>>       std::piecewise_construct,
> >>>>>>>       std::forward_as_tuple(encodingName),
> >>>>>>>       std::forward_as_tuple(encodingName, unicodeOut, ranges, len)
> >>>>>>>     
> >>>>>>>     );
> >>>>>>>   
> >>>>>>>   };
> >>>>>>> 
> >>>>>>> It's the first time in my life i see std::piecewise_construct and
> >>>>>>> std::forward_as_tuple, yes that probably makes me a bad C++
> >>>>>>> developer,
> >>>>>>> but
> >>>>>>> given there's lots of us around, it doesn't make me happy now i
> >>>>>>> don't
> >>>>>>> understand what the code does.
> >>>>>> 
> >>>>>> The problem is that most internal Poppler types lack proper
> >>>>>> construction
> >>>>>> and assignment operators, hence I need to work harder to construct
> >>>>>> those
> >>>>>> UnicodeMap instances in-place inside the map (GooHash just stored a
> >>>>>> pointer to it, so there was no problem.)
> >>>>>> 
> >>>>>> The whole piecewise_construct and forward_as_tuple dance just
> >>>>>> ensures,
> >>>>>> that those parameters to emplace are properly grouped before being
> >>>>>> unpacked to become the parameters of std::string and UnicodeMap
> >>>>>> construction again. If UnicodeMap was movable, this would probably
> >>>>>> look
> >>>>>> like
> >>>>>> 
> >>>>>> residentUnicodeMaps.emplace(
> >>>>>> 
> >>>>>>   encodingName,
> >>>>>>   UnicodeMap{encodingName, unicodeOut, ranges, len}
> >>>>>> 
> >>>>>> );
> >>>>>> 
> >>>>>> If you like, I can try to make Unicode a move-only type and simplify
> >>>>>> the
> >>>>>> mentioned helper functions?
> >>>>> 
> >>>>> you can give it a try, not sure how easy that's going to be.
> >>>>> 
> >>>>>>> Then there's the benefit/risk ratio. The code using GooHash is
> >>>>>>> mostly
> >>>>>>> rocksolid and we haven't probably touched any line in it for a long
> >>>>>>> time
> >>>>>>> and we have probably neither written new code that uses GooHash.
> >>>>>>> 
> >>>>>>> In that regard we risk breaking working code.
> >>>>>>> 
> >>>>>>> The benefit is not more speed nor less memory usage as your
> >>>>>>> measurements
> >>>>>>> show.
> >>>>>>> 
> >>>>>>> Mostly the benefit is "removing code from maintainership", which i
> >>>>>>> agree
> >>>>>>> is a good thing, but as mentioned before it's some code "we mostly
> >>>>>>> ignore", so it's not that much of a good thing.
> >>>>>> 
> >>>>>> I very much agree with the risk assessment.
> >>>>>> 
> >>>>>> But I also think the code will ossify (or maybe already is?) due to
> >>>>>> those custom data structures and the less than idiomatic C++ usage.
> >>>>>> Hence I think, Poppler would not just loose code, but the remaining
> >>>>>> code
> >>>>>> should become easier to maintain. (Of course, the piecewise_construct
> >>>>>> fiasco shows that this has intermediate costs. But again, I think
> >>>>>> this
> >>>>>> is just an incentive to provide types with the usual C++ semantics
> >>>>>> which
> >>>>>> should make all code using those types better.)
> >>>>> 
> >>>>> Yeah, i guess you're right.
> >>>>> 
> >>>>> Cheers,
> >>>>> 
> >>>>>   Albert
> >>>>>> 
> >>>>>> Best regards, Adam.
> >>>>>> 
> >>>>>>> So i'm hesitant of what to do. It really sounds like a nice thing to
> >>>>>>> not
> >>>>>>> have custom structures, but on the other hand it's not like they get
> >>>>>>> much
> >>>>>>> in the way of coding.
> >>>>>>> 
> >>>>>>> I'd really appreciate other people's comments on this.
> >>>>>>> 
> >>>>>>> Cheers,
> >>>>>>> 
> >>>>>>>   Albert
> >>>>>>>> 
> >>>>>>>> I'll also attach the detailed comparison, but the gist seems to be
> >>>>>>>> that
> >>>>>>>> if there are significant changes, the run time is reduced but the
> >>>>>>>> memory
> >>>>>>>> usage is increased in the majority of cases. But this does not seem
> >>>>>>>> to
> >>>>>>>> show up in the cumulative results.
> >>>>>>>> 
> >>>>>>>> Best regards, Adam.
> >>>>>>>> 
> >>>>>>>> P.S.: One could try to improve the memory usage by tuning the load
> >>>>>>>> factor or calling shrink_to_fit where appropriate. Would you like
> >>>>>>>> me
> >>>>>>>> to
> >>>>>>>> try to do this?
> >>>>>>>> 
> >>>>>>>> P.P.S.: One obvious area for improvement would be better
> >>>>>>>> interoperability between GooString and std::string, i.e. not
> >>>>>>>> converting
> >>>>>>>> them as C strings so that the length information does not need to
> >>>>>>>> be
> >>>>>>>> recomputed. I will try to prepare this as a separate patch on top
> >>>>>>>> of
> >>>>>>>> this one or should I include this here?
> >>>>>>>> 
> >>>>>>>> Best regards, Adam.
> >>>>>>>> 
> >>>>>>>>> Concerning code size, a release build of libpoppler.so goes from
> >>>>>>>>> 
> >>>>>>>>>    text    data     bss     dec     hex filename
> >>>>>>>>> 
> >>>>>>>>> 2464034  288852     360 2753246  2a02de libpoppler.so.73.0.0
> >>>>>>>>> 
> >>>>>>>>> for the current master to
> >>>>>>>>> 
> >>>>>>>>>    text    data     bss     dec     hex filename
> >>>>>>>>> 
> >>>>>>>>> 2482129  288756     360 2771245  2a492d libpoppler.so.73.0.0
> >>>>>>>>> 
> >>>>>>>>> with the patch applied, i.e. a 0.65% increase in binary size.
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> Please note that in my original message, I was referring only to
> >>>>>>>>> source
> >>>>>>>>> code size, i.e.
> >>>>>>>>> 
> >>>>>>>>> git diff --stat master...remove-goo-hash
> >>>>>>>>> 
> >>>>>>>>>  18 files changed, 168 insertions(+), 803 deletions(-)
> >>>>>>>>>  
> >>>>>>>>>> Cheers,
> >>>>>>>>>> 
> >>>>>>>>>>   Albert
> >>>>>>>>> 
> >>>>>>>>> Best regards, Adam.
> >>>>>>>>> 
> >>>>>>>>>>> Best regards, Adam.
> >>>>>>>>>> 
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> poppler mailing list
> >>>>>>>>>> poppler at lists.freedesktop.org
> >>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> >>>>>>>>> 
> >>>>>>>>> _______________________________________________
> >>>>>>>>> poppler mailing list
> >>>>>>>>> poppler at lists.freedesktop.org
> >>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> >>>>>>> 
> >>>>>>> _______________________________________________
> >>>>>>> poppler mailing list
> >>>>>>> poppler at lists.freedesktop.org
> >>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> >>>>> 
> >>>>> _______________________________________________
> >>>>> poppler mailing list
> >>>>> poppler at lists.freedesktop.org
> >>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> >>>> 
> >>>> _______________________________________________
> >>>> poppler mailing list
> >>>> poppler at lists.freedesktop.org
> >>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> >> 
> >> _______________________________________________
> >> poppler mailing list
> >> poppler at lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/poppler
> > 
> > _______________________________________________
> > poppler mailing list
> > poppler at lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/poppler






More information about the poppler mailing list