[poppler] Extract title from pdf file.
Peter A. Kerzum
kerzum at yandex-team.ru
Wed Nov 9 07:50:46 PST 2011
Hi
> Describe your method!
- for every sentence of text get some numeric or boolean properties, like
font, layout and character distribution.
- use machine learning algorithm to build formula that maps those properties
to score
- for every document select the sentence with the greatest score. Filter out
some sentences, based on dictiory (like urls, etc)
machine with 15 properties works reasonably well
On Wednesday 09 November 2011 17:53:27 Alec Taylor wrote:
> Hi Peter,
>
>
> Cheers,
>
> Alec Taylor
>
> On Thu, Nov 10, 2011 at 1:51 AM, Peter A. Kerzum <kerzum at yandex-team.ru>
wrote:
> > Hi!
> >
> > We use some approach based on character properties to extract meaningful
> > title from document text. Metadata usualy stores filename in title
> > field.
> >
> > --
> > Peter
> >
> > On Wednesday 09 November 2011 16:16:14 Alec Taylor wrote:
> >> On Wed, Nov 9, 2011 at 10:37 PM, Albert Astals Cid <aacid at kde.org> wrote:
> >> > A Dimecres, 9 de novembre de 2011, Alec Taylor vàreu escriure:
> >> >> Incorrect, all getDocInfo tells you is what the meta info says, it
> >> >> doesn't analyse the actual document, whereas my pdftopdf will update
> >> >> the metadata with the appropriate info after PDF analysis
> >> >
> >> > Please do not top post, makes reading e-mail incredibly hard.
> >> >
> >> > And no it is not incorrect, if the metadata does not have a title,
> >> > then the document does not have a title as defined per the spec.
> >> >
> >> > Albert
> >>
> >> But maybe the document doesn't have a title, because it was grabbed
> >> from scanning the book, then OCRing it. So what I will facilitate is
> >> the generation of proper metadata (+ more) from a current PDF lacking
> >> such.
> >>
> >> So if the document does have a title, my pdftopdf tool will find it,
> >> and add it to the metadata.
> >>
> >> I will contribute pdftopdf to poppler.
> >> _______________________________________________
> >> poppler mailing list
> >> poppler at lists.freedesktop.org
> >> http://lists.freedesktop.org/mailman/listinfo/poppler
--
Пётр Керзум
Группа разработки поисковой платформы
СПб, тел. 8508
More information about the poppler
mailing list