[poppler] Extract title from pdf file.

Alec Taylor alec.taylor6 at gmail.com
Wed Nov 9 09:18:21 PST 2011


On Thu, Nov 10, 2011 at 2:50 AM, Peter A. Kerzum <kerzum at yandex-team.ru> wrote:
> Hi
>
>> Describe your method!
>
> - for every sentence of text get some numeric or boolean properties, like
> font, layout and character distribution.
> - use machine learning algorithm to build formula that maps those properties
> to score
> - for every document select the sentence with the greatest score. Filter out
> some sentences, based on dictiory (like urls, etc)
>
> machine with 15 properties works reasonably well

Doesn't sound very accurate... I can bring out ~98.5% accuracy. What
are your initial estimations?

> On Wednesday 09 November 2011 17:53:27 Alec Taylor wrote:
>> Hi Peter,
>>
>>
>> Cheers,
>>
>> Alec Taylor
>>
>> On Thu, Nov 10, 2011 at 1:51 AM, Peter A. Kerzum <kerzum at yandex-team.ru>
> wrote:
>> > Hi!
>> >
>> > We use some approach based on character properties to extract meaningful
>> > title from document text. Metadata usualy stores filename in title
>> > field.
>> >
>> > --
>> > Peter
>> >
>> > On Wednesday 09 November 2011 16:16:14 Alec Taylor wrote:
>> >> On Wed, Nov 9, 2011 at 10:37 PM, Albert Astals Cid <aacid at kde.org> wrote:
>> >> > A Dimecres, 9 de novembre de 2011, Alec Taylor vàreu escriure:
>> >> >> Incorrect, all getDocInfo tells you is what the meta info says, it
>> >> >> doesn't analyse the actual document, whereas my pdftopdf will update
>> >> >> the metadata with the appropriate info after PDF analysis
>> >> >
>> >> > Please do not top post, makes reading e-mail incredibly hard.
>> >> >
>> >> > And no it is not incorrect, if the metadata does not have a title,
>> >> > then the document does not have a title as defined per the spec.
>> >> >
>> >> > Albert
>> >>
>> >> But maybe the document doesn't have a title, because it was grabbed
>> >> from scanning the book, then OCRing it. So what I will facilitate is
>> >> the generation of proper metadata (+ more) from a current PDF lacking
>> >> such.
>> >>
>> >> So if the document does have a title, my pdftopdf tool will find it,
>> >> and add it to the metadata.
>> >>
>> >> I will contribute pdftopdf to poppler.
>> >> _______________________________________________
>> >> poppler mailing list
>> >> poppler at lists.freedesktop.org
>> >> http://lists.freedesktop.org/mailman/listinfo/poppler
>
> --
> Пётр Керзум
> Группа разработки поисковой платформы
> СПб, тел. 8508
>


More information about the poppler mailing list