[poppler] Find usign RegEx on top of xpdf-3.03

Albert Astals Cid aacid at kde.org
Sat Oct 29 07:16:44 PDT 2011


A Divendres, 28 d'octubre de 2011, wwwacky at free.fr vàreu escriure:
> Quoting wwwacky at free.fr:
> > Quoting Albert Astals Cid <aacid at kde.org>:
> > > A Dijous, 6 d'octubre de 2011, wwwacky at free.fr và reu escriure:
> > > > Quoting Albert Astals Cid <aacid at kde.org>:
> > > > > A Dimecres, 5 d'octubre de 2011, wwwacky at free.fr và reu escriure:
> > > > > > Dear all,
> > > > > 
> > > > > Hi
> > > > > 
> > > > > > for some months I had the need for a regex find to dig
> > > > > > out into huge pdf docs. Please find a patch attached
> > > > > > that implements this feature on top of xpdf-3.03. It
> > > > > > support ASCII only, backward and case-sensitive
> > > > > > searches (word-only check-box has no effect any more).
> > > > > > The xpdf MMI haven't been modified so that you can only
> > > > > > perform regex searches with this patch! I saw that
> > > > > > xpdf-3.03 is being merge in Poppler. Hope that it could
> > > > > > help to make a review :) Let me know if you are
> > > > > > interested in this patch so that I can help to merge it
> > > > > > in Poppler.
> > > > > 
> > > > > We still have not merged xpdf-3.03 and it will probably
> > > > > still take a while, but anyways i am not sure ASCII only is
> > > > > a good idea. Why that limitation?
> > > > > 
> > > > > Albert
> > > > 
> > > > Hi,
> > > > 
> > > > In fact, this basic implementation relies on POSIX regex
> > > > functions
> > 
> > regcomp,
> > 
> > > > regexec, regerror, regfree. These functions takes char strings
> > > > and not Unicode strings in input. Thus, ASCII control chars and
> > > > ASCII printable chars can be matched. Supporting
> > > > Unicode-compatible regex search is much eavy to implement and
> > > > out of my scope for the time being. I would like to support
> > > > much more but I forecast a huge effort to gain Unicode.> 
> > Morerover,
> > 
> > > > ASCII matches 99% of my need in term of search in English
> > > > data-sheets :)> > 
> > > Sure, it might match your needs, but if you contribute it to
> > > poppler,
> > 
> > people
> > 
> > > will start demanding that it works with non ASCII characters and you
> > > will probably not be here anymore and the burden will be on our
> > > side.
> > > 
> > > Albert
> > > 
> > > > I know that this patch has some weaknesses but I think it can be
> > > > great to get regex search in some applications such as Evince
> > > > of which is gui _ according to me _ smarter than xpdf one.
> > > > 
> > > > Best regards
> > > > Jerry
> > > > 
> > > > PS: Sorry for my poor English and my clumsy proposal :)
> > > > 
> > > > > > Best regards
> > > > > > Jerry
> > > > > 
> > > > > _______________________________________________
> > > > > poppler mailing list
> > > > > poppler at lists.freedesktop.org
> > > > > http://lists.freedesktop.org/mailman/listinfo/poppler
> > > > 
> > > > _______________________________________________
> > > > poppler mailing list
> > > > poppler at lists.freedesktop.org
> > > > http://lists.freedesktop.org/mailman/listinfo/poppler
> > > 
> > > _______________________________________________
> > > poppler mailing list
> > > poppler at lists.freedesktop.org
> > > http://lists.freedesktop.org/mailman/listinfo/poppler
> > 
> > Hi Albert,
> > 
> > To be more precise, the patch supports also extended ASCII 0x7F-0xFF as
> > well as
> > control chars 0x01-0x1F and printable chars 0x20-0x7F. This means that
> > on my Ubuntu 10.04 I can input and find ASCII and all iso latin 1 chars
> > (iso-8859-1)
> > such as e acute 'é', a grave 'à' and so on. All other extended ASCII
> > sets are supported according to your computer configuration and
> > keyboard settings.
> > 
> > I think that it covers not only my needs but also most of EMEA users'
> > ones. RegEx search is  a well-known old feature for many editors and
> > script language.
> > This patch brings this powerful feature to xpdf and it can be a totally
> > new on
> > Poppler. Supporting only 1-byte charset encoding is more a restriction
> > for APAC
> > users than a bug.
> > 
> > For instance, mind that you are searching a sentence beginning by "The "
> > followed by any word and then by " is" you just have to type "The .* is"
> > regex
> > in find dialog box. Only regex offers this possibility and combinations
> > are quiet infinite.
> > 
> > Maybe may I push my modified xpdf binary so that you can test it?
> > 
> > With best regards,
> > Jerry
> > _______________________________________________
> > poppler mailing list
> > poppler at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/poppler
> 
> Hi All,
> 
> I just see Marc's mail in Poppler archive.
> "[poppler] whole word search?"
> It seems that he is almost the only one to refer to regex in pdf search
> engine
> :(
> 
> There is a possibility to support regex over Unicode in Poppler (which is
> quiet difficult with xpdf).
> But I would like to know if there is some Poppler's contributers interested
> in. In this case spending my time implementing a clean patch supporting
> Unicode will be more reasonable. Else I will keep it for me ...

We are interested in code that makes sense from a library point of view, a 
search that only works with latin characters does not make much sense, one 
that works on Unicode makes sense.

Albert

> 
> Best regards
> Jerry
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler


More information about the poppler mailing list