[poppler] Find usign RegEx on top of xpdf-3.03

wwwacky at free.fr wwwacky at free.fr
Fri Oct 7 05:48:47 PDT 2011


Quoting Albert Astals Cid <aacid at kde.org>:

> A Dijous, 6 d'octubre de 2011, wwwacky at free.fr vàreu escriure:
> > Quoting Albert Astals Cid <aacid at kde.org>:
> > > A Dimecres, 5 d'octubre de 2011, wwwacky at free.fr và reu escriure:
> > > > Dear all,
> > >
> > > Hi
> > >
> > > > for some months I had the need for a regex find to dig out into huge
> > > > pdf docs. Please find a patch attached that implements this feature
> > > > on top of xpdf-3.03. It support ASCII only, backward and
> > > > case-sensitive searches (word-only check-box has no effect any
> > > > more). The xpdf MMI haven't been modified so that you can only
> > > > perform regex searches with this patch! I saw that xpdf-3.03 is
> > > > being merge in Poppler. Hope that it could help to make a review :)
> > > > Let me know if you are interested in this patch so that I can help
> > > > to merge it in Poppler.
> > >
> > > We still have not merged xpdf-3.03 and it will probably still take a
> > > while, but anyways i am not sure ASCII only is a good idea. Why that
> > > limitation?
> > >
> > > Albert
> >
> > Hi,
> >
> > In fact, this basic implementation relies on POSIX regex functions regcomp,
> > regexec, regerror, regfree. These functions takes char strings and not
> > Unicode strings in input. Thus, ASCII control chars and ASCII printable
> > chars can be matched. Supporting Unicode-compatible regex search is much
> > eavy to implement and out of my scope for the time being. I would like to
> > support much more but I forecast a huge effort to gain Unicode. Morerover,
> > ASCII matches 99% of my need in term of search in English data-sheets :)
>
> Sure, it might match your needs, but if you contribute it to poppler, people
> will start demanding that it works with non ASCII characters and you will
> probably not be here anymore and the burden will be on our side.
>
> Albert
>
> > I know that this patch has some weaknesses but I think it can be great to
> > get regex search in some applications such as Evince of which is gui _
> > according to me _ smarter than xpdf one.
> >
> > Best regards
> > Jerry
> >
> > PS: Sorry for my poor English and my clumsy proposal :)
> >
> > > > Best regards
> > > > Jerry
> > >
> > > _______________________________________________
> > > poppler mailing list
> > > poppler at lists.freedesktop.org
> > > http://lists.freedesktop.org/mailman/listinfo/poppler
> >
> > _______________________________________________
> > poppler mailing list
> > poppler at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/poppler
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
>

Hi Albert,

To be more precise, the patch supports also extended ASCII 0x7F-0xFF as well as
control chars 0x01-0x1F and printable chars 0x20-0x7F. This means that on my
Ubuntu 10.04 I can input and find ASCII and all iso latin 1 chars (iso-8859-1)
such as e acute 'é', a grave 'à' and so on. All other extended ASCII sets are
supported according to your computer configuration and keyboard settings.

I think that it covers not only my needs but also most of EMEA users' ones.
RegEx search is  a well-known old feature for many editors and script language.
This patch brings this powerful feature to xpdf and it can be a totally new on
Poppler. Supporting only 1-byte charset encoding is more a restriction for APAC
users than a bug.

For instance, mind that you are searching a sentence beginning by "The "
followed by any word and then by " is" you just have to type "The .* is" regex
in find dialog box. Only regex offers this possibility and combinations are
quiet infinite.

Maybe may I push my modified xpdf binary so that you can test it?

With best regards,
Jerry


More information about the poppler mailing list