[poppler] Extract title from pdf file.

Peter A. Kerzum kerzum at yandex-team.ru
Thu Nov 10 07:03:44 PST 2011


Alec,

I'd like to opensource this feature, but I need to manage it first.
The code is not mine

On Thursday 10 November 2011 00:12:09 Alec Taylor wrote:
> What are we looking for?
> 
> Size? - Font? - Position? 

Yes, Size, face, Bold / Italic, UPPER | Title, number of alpha / non-alpha 
symbols

> - Previous/next page is copyright?
> 2011/11/10 Josh Richardson <jric at chegg.com>:
> > The machine-learning approach seems like a good idea for finding section
> > headings, and maybe the title too.  For finding the document title, you
> > might want to look at only the first, or maybe first few pages, rather
> > than every sentence in the document?
> > 
> > --josh
> > 
> > On 11/9/11 9:18 AM, "Alec Taylor" <alec.taylor6 at gmail.com> wrote:
> >>On Thu, Nov 10, 2011 at 2:50 AM, Peter A. Kerzum <kerzum at yandex-team.ru>
> >>
> >>wrote:
> >>> Hi
> >>> 
> >>>> Describe your method!
> >>> 
> >>> - for every sentence of text get some numeric or boolean properties,
> >>>
> >>>like
> >>>
> >>> font, layout and character distribution.
> >>> - use machine learning algorithm to build formula that maps those
> >>>
> >>>properties
> >>>
> >>> to score
> >>> - for every document select the sentence with the greatest score.
> >>>
> >>>Filter out
> >>>
> >>> some sentences, based on dictiory (like urls, etc)
> >>> 
> >>> machine with 15 properties works reasonably well
> >>
> >>Doesn't sound very accurate... I can bring out ~98.5% accuracy. What
> >>are your initial estimations?
> >>
> >>> On Wednesday 09 November 2011 17:53:27 Alec Taylor wrote:
> >>>> Hi Peter,
> >>>> 
> >>>> 
> >>>> Cheers,
> >>>> 
> >>>> Alec Taylor
> >>>> 
> >>>> On Thu, Nov 10, 2011 at 1:51 AM, Peter A. Kerzum
> >>>>
> >>>><kerzum at yandex-team.ru>
> >>>>
> >>> wrote:
> >>>> > Hi!
> >>>> > 
> >>>> > We use some approach based on character properties to extract
> >>>>
> >>>>meaningful
> >>>>
> >>>> > title from document text. Metadata usualy stores filename in title
> >>>> > field.
> >>>> > 
> >>>> > --
> >>>> > Peter
> >>>> > 
> >>>> > On Wednesday 09 November 2011 16:16:14 Alec Taylor wrote:
> >>>> >> On Wed, Nov 9, 2011 at 10:37 PM, Albert Astals Cid <aacid at kde.org>
> >>>>
> >>>>wrote:
> >>>> >> > A Dimecres, 9 de novembre de 2011, Alec Taylor vàreu escriure:
> >>>> >> >> Incorrect, all getDocInfo tells you is what the meta info says,
> >>>>
> >>>>it
> >>>>
> >>>> >> >> doesn't analyse the actual document, whereas my pdftopdf will
> >>>>
> >>>>update
> >>>>
> >>>> >> >> the metadata with the appropriate info after PDF analysis
> >>>> >> > 
> >>>> >> > Please do not top post, makes reading e-mail incredibly hard.
> >>>> >> > 
> >>>> >> > And no it is not incorrect, if the metadata does not have a
> >>>> >> > title, then the document does not have a title as defined per
> >>>> >> > the spec.
> >>>> >> > 
> >>>> >> > Albert
> >>>> >> 
> >>>> >> But maybe the document doesn't have a title, because it was grabbed
> >>>> >> from scanning the book, then OCRing it. So what I will facilitate
> >>>> >> is the generation of proper metadata (+ more) from a current PDF
> >>>>
> >>>>lacking
> >>>>
> >>>> >> such.
> >>>> >> 
> >>>> >> So if the document does have a title, my pdftopdf tool will find
> >>>> >> it, and add it to the metadata.
> >>>> >> 
> >>>> >> I will contribute pdftopdf to poppler.
> >>>> >> _______________________________________________
> >>>> >> poppler mailing list
> >>>> >> poppler at lists.freedesktop.org
> >>>> >> http://lists.freedesktop.org/mailman/listinfo/poppler
> >>> 
> >>> --
> >>> Пётр Керзум
> >>> Группа разработки поисковой платформы
> >>> СПб, тел. 8508
> >>
> >>_______________________________________________
> >>poppler mailing list
> >>poppler at lists.freedesktop.org
> >>http://lists.freedesktop.org/mailman/listinfo/poppler

-- 
Пётр Керзум
Группа разработки поисковой платформы
СПб, тел. 8508


More information about the poppler mailing list