[poppler] [RFC] Replace GooHash by std::unordered_map
Albert Astals Cid
aacid at kde.org
Sat Mar 3 19:29:51 UTC 2018
Ha! didn't realize you had already done all that, will read/answer the other
email.
Cheers,
Albert
El dissabte, 3 de març de 2018, a les 20:25:32 CET, Albert Astals Cid va
escriure:
> El dimecres, 21 de febrer de 2018, a les 7:13:57 CET, Adam Reichold va
>
> escriure:
> > Hello again,
> >
> > Am 21.02.2018 um 00:31 schrieb Albert Astals Cid:
> > > El dimarts, 20 de febrer de 2018, a les 8:58:24 CET, Adam Reichold va
> > >
> > > escriure:
> > >> Hello again,
> > >>
> > >> Am 18.02.2018 um 23:23 schrieb Adam Reichold:
> > >>> Am 18.02.2018 um 23:08 schrieb Albert Astals Cid:
> > >>>> El diumenge, 18 de febrer de 2018, a les 16:55:37 CET, Adam Reichold
> > >>>> va
> > >>>>
> > >>>> escriure:
> > >>>>> Hello,
> > >>>>>
> > >>>>> the attached patch replaced Poppler's internal hash table
> > >>>>> implementation
> > >>>>> GooHash by std::unordered_map found in the C++ standard library
> > >>>>> since
> > >>>>> C++11. This continues Poppler's slow drift towards standard library
> > >>>>> containers and removes one of the home-grown data structures with
> > >>>>> the
> > >>>>> main goals of reducing code size and improving the long term
> > >>>>> maintainability of the code base.
> > >>>>
> > >>>> Do you have any benchmarks on "rendering" speed and code size?
> > >>>
> > >>> Sorry, not at hand. I will try to prepare them during the week.
> > >>
> > >> I did run Splash rendering benchmarks of this branch against master
> > >> with
> > >> the result of rendering the circa 2400 documents of the TeXLive
> > >
> > >> documentation present on my machine being:
> > > I'm wondering if those 2400 documents are diverse enough, which they may
> > > not be given they are all coming from "the same place".
> >
> > They seem pretty diverse w.r.t. content, some being text heavy and some
> > graphics rich. But I guess they are definitely not diverse technically
> > as all are prepared using TeX itself.
> >
> > The main problem on my side is that I have failed to find my DVD copy of
> > the Poppler regtest document collection until now. :-\ In any case, I am
> > reasonably confident on the zero sum result since GooHash does not seem
> > to live in any performance critical code path.
> >
> > >> Cumulative run time:
> > >> Result: 90.95 min ∓ 1.1 %
> > >> Reference: 91.57 min ∓ 1.2 %
> > >> Deviation: -0.0 %
> > >>
> > >> Cumulative memory usage:
> > >> Result: 37.2 MB ∓ 0.7 %
> > >> Reference: 37.0 MB ∓ 0.7 %
> > >> Deviation: +0.0 %
> > >>
> > >> (Where result is this patch and the reference is master.) (The
> > >> measurement was taken using the perftest script which I proposed here
> > >> some time ago for which I'll attach the patch again for
> > >> reproduceability.)
> > >
> > > Did you really attach this before? i really don't remember it :D
> >
> > This was a very long-winded thread ending on 2016-01-01 centered around
> > the regtest framework. It went from custom driver using QTest's
> > benchmark functionality to custom backend in the regtest framework to
> > separate Python script. The script still has problems to reliably
> > measure really short things like running "pdftotext" for which a custom
> > C++ driver might be needed, but for long-running stuff like "pdftoppm",
> > the results seem reasonable.
> >
> > > Anyhow it seems the change is mostly a zero-sum runtime wise.
> >
> > Indeed. (Although thinking really really long term, I think a Poppler
> > that is using std::vector, std::string and friends everywhere, could
> > loose a lot of distributed fat in the form of a lot of memory
> > indirections and benefit from the highly optimized standard library. But
> > it just is not a single patch thing to reap any of these benefits.)
> >
> > > Which bring us to the code side. First i know it sucks but your spacing
> > > is
> > > broken, please check the whole patch
> > >
> > > - nameToGID->removeInt(macGlyphNames[j]);
> > > - nameToGID->add(new GooString(macGlyphNames[j]), i);
> > > + nameToGID[macGlyphNames[j]] = i;
> > >
> > > i could fix it, but it's better (for me) if you do :D
> >
> > Definitely for me to fix. The main problem is that the spacing in those
> > files was already broken and I am unsure whether I should fix the
> > surrounding spacing even if I am not touching it otherwise. Due to that,
> > there sometimes is no correct way in the current patch. If you do not
> > say otherwise, I will try to make at least the touched blocks of code
> > consistent.
>
> Are you sure the spacing is broken? I'd say it's just xpdf weird spacing
> rules of using 2 soft spaces and then hard tabs at 8.
>
> > > Now onto the code,
> > >
> > > const auto emplaceRangeMap = [&](const char* encodingName, GBool
> > > unicodeOut,>
> > >
> > > UnicodeMapRange* ranges, int len) {
> > >
> > > residentUnicodeMaps.emplace(
> > >
> > > std::piecewise_construct,
> > > std::forward_as_tuple(encodingName),
> > > std::forward_as_tuple(encodingName, unicodeOut, ranges, len)
> > >
> > > );
> > >
> > > };
> > >
> > > It's the first time in my life i see std::piecewise_construct and
> > > std::forward_as_tuple, yes that probably makes me a bad C++ developer,
> > > but
> > > given there's lots of us around, it doesn't make me happy now i don't
> > > understand what the code does.
> >
> > The problem is that most internal Poppler types lack proper construction
> > and assignment operators, hence I need to work harder to construct those
> > UnicodeMap instances in-place inside the map (GooHash just stored a
> > pointer to it, so there was no problem.)
> >
> > The whole piecewise_construct and forward_as_tuple dance just ensures,
> > that those parameters to emplace are properly grouped before being
> > unpacked to become the parameters of std::string and UnicodeMap
> > construction again. If UnicodeMap was movable, this would probably look
> > like
> >
> > residentUnicodeMaps.emplace(
> >
> > encodingName,
> > UnicodeMap{encodingName, unicodeOut, ranges, len}
> >
> > );
> >
> > If you like, I can try to make Unicode a move-only type and simplify the
> > mentioned helper functions?
>
> you can give it a try, not sure how easy that's going to be.
>
> > > Then there's the benefit/risk ratio. The code using GooHash is mostly
> > > rocksolid and we haven't probably touched any line in it for a long time
> > > and we have probably neither written new code that uses GooHash.
> > >
> > > In that regard we risk breaking working code.
> > >
> > > The benefit is not more speed nor less memory usage as your measurements
> > > show.
> > >
> > > Mostly the benefit is "removing code from maintainership", which i agree
> > > is a good thing, but as mentioned before it's some code "we mostly
> > > ignore", so it's not that much of a good thing.
> >
> > I very much agree with the risk assessment.
> >
> > But I also think the code will ossify (or maybe already is?) due to
> > those custom data structures and the less than idiomatic C++ usage.
> > Hence I think, Poppler would not just loose code, but the remaining code
> > should become easier to maintain. (Of course, the piecewise_construct
> > fiasco shows that this has intermediate costs. But again, I think this
> > is just an incentive to provide types with the usual C++ semantics which
> > should make all code using those types better.)
>
> Yeah, i guess you're right.
>
> Cheers,
> Albert
>
> > Best regards, Adam.
> >
> > > So i'm hesitant of what to do. It really sounds like a nice thing to not
> > > have custom structures, but on the other hand it's not like they get
> > > much
> > > in the way of coding.
> > >
> > > I'd really appreciate other people's comments on this.
> > >
> > > Cheers,
> > >
> > > Albert
> > >>
> > >> I'll also attach the detailed comparison, but the gist seems to be that
> > >> if there are significant changes, the run time is reduced but the
> > >> memory
> > >> usage is increased in the majority of cases. But this does not seem to
> > >> show up in the cumulative results.
> > >>
> > >> Best regards, Adam.
> > >>
> > >> P.S.: One could try to improve the memory usage by tuning the load
> > >> factor or calling shrink_to_fit where appropriate. Would you like me to
> > >> try to do this?
> > >>
> > >> P.P.S.: One obvious area for improvement would be better
> > >> interoperability between GooString and std::string, i.e. not converting
> > >> them as C strings so that the length information does not need to be
> > >> recomputed. I will try to prepare this as a separate patch on top of
> > >> this one or should I include this here?
> > >>
> > >> Best regards, Adam.
> > >>
> > >>> Concerning code size, a release build of libpoppler.so goes from
> > >>>
> > >>> text data bss dec hex filename
> > >>>
> > >>> 2464034 288852 360 2753246 2a02de libpoppler.so.73.0.0
> > >>>
> > >>> for the current master to
> > >>>
> > >>> text data bss dec hex filename
> > >>>
> > >>> 2482129 288756 360 2771245 2a492d libpoppler.so.73.0.0
> > >>>
> > >>> with the patch applied, i.e. a 0.65% increase in binary size.
> > >>>
> > >>>
> > >>> Please note that in my original message, I was referring only to
> > >>> source
> > >>> code size, i.e.
> > >>>
> > >>> git diff --stat master...remove-goo-hash
> > >>>
> > >>> 18 files changed, 168 insertions(+), 803 deletions(-)
> > >>>
> > >>>> Cheers,
> > >>>>
> > >>>> Albert
> > >>>
> > >>> Best regards, Adam.
> > >>>
> > >>>>> Best regards, Adam.
> > >>>>
> > >>>> _______________________________________________
> > >>>> poppler mailing list
> > >>>> poppler at lists.freedesktop.org
> > >>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> > >>>
> > >>> _______________________________________________
> > >>> poppler mailing list
> > >>> poppler at lists.freedesktop.org
> > >>> https://lists.freedesktop.org/mailman/listinfo/poppler
> > >
> > > _______________________________________________
> > > poppler mailing list
> > > poppler at lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/poppler
>
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler
More information about the poppler
mailing list