[poppler] Extract title from pdf file.

Peter A. Kerzum kerzum at yandex-team.ru
Wed Nov 9 07:50:46 PST 2011


Hi

> Describe your method!

- for every sentence of text get some numeric or boolean properties, like 
font, layout and character distribution.
- use machine learning algorithm to build formula that maps those properties 
to score
- for every document select the sentence with the greatest score. Filter out 
some sentences, based on dictiory (like urls, etc)

machine with 15 properties works reasonably well

On Wednesday 09 November 2011 17:53:27 Alec Taylor wrote:
> Hi Peter,
> 
> 
> Cheers,
> 
> Alec Taylor
> 
> On Thu, Nov 10, 2011 at 1:51 AM, Peter A. Kerzum <kerzum at yandex-team.ru> 
wrote:
> > Hi!
> > 
> > We use some approach based on character properties to extract meaningful
> > title from document text. Metadata usualy stores filename in title
> > field.
> > 
> > --
> > Peter
> > 
> > On Wednesday 09 November 2011 16:16:14 Alec Taylor wrote:
> >> On Wed, Nov 9, 2011 at 10:37 PM, Albert Astals Cid <aacid at kde.org> wrote:
> >> > A Dimecres, 9 de novembre de 2011, Alec Taylor vàreu escriure:
> >> >> Incorrect, all getDocInfo tells you is what the meta info says, it
> >> >> doesn't analyse the actual document, whereas my pdftopdf will update
> >> >> the metadata with the appropriate info after PDF analysis
> >> > 
> >> > Please do not top post, makes reading e-mail incredibly hard.
> >> > 
> >> > And no it is not incorrect, if the metadata does not have a title,
> >> > then the document does not have a title as defined per the spec.
> >> > 
> >> > Albert
> >> 
> >> But maybe the document doesn't have a title, because it was grabbed
> >> from scanning the book, then OCRing it. So what I will facilitate is
> >> the generation of proper metadata (+ more) from a current PDF lacking
> >> such.
> >> 
> >> So if the document does have a title, my pdftopdf tool will find it,
> >> and add it to the metadata.
> >> 
> >> I will contribute pdftopdf to poppler.
> >> _______________________________________________
> >> poppler mailing list
> >> poppler at lists.freedesktop.org
> >> http://lists.freedesktop.org/mailman/listinfo/poppler

-- 
Пётр Керзум
Группа разработки поисковой платформы
СПб, тел. 8508


More information about the poppler mailing list