[poppler] Extract title from pdf file.

Alec Taylor alec.taylor6 at gmail.com
Wed Nov 9 13:12:09 PST 2011


What are we looking for?

Size? - Font? - Position? - Previous/next page is copyright?

2011/11/10 Josh Richardson <jric at chegg.com>:
> The machine-learning approach seems like a good idea for finding section
> headings, and maybe the title too.  For finding the document title, you
> might want to look at only the first, or maybe first few pages, rather
> than every sentence in the document?
>
> --josh
>
> On 11/9/11 9:18 AM, "Alec Taylor" <alec.taylor6 at gmail.com> wrote:
>
>>On Thu, Nov 10, 2011 at 2:50 AM, Peter A. Kerzum <kerzum at yandex-team.ru>
>>wrote:
>>> Hi
>>>
>>>> Describe your method!
>>>
>>> - for every sentence of text get some numeric or boolean properties,
>>>like
>>> font, layout and character distribution.
>>> - use machine learning algorithm to build formula that maps those
>>>properties
>>> to score
>>> - for every document select the sentence with the greatest score.
>>>Filter out
>>> some sentences, based on dictiory (like urls, etc)
>>>
>>> machine with 15 properties works reasonably well
>>
>>Doesn't sound very accurate... I can bring out ~98.5% accuracy. What
>>are your initial estimations?
>>
>>> On Wednesday 09 November 2011 17:53:27 Alec Taylor wrote:
>>>> Hi Peter,
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Alec Taylor
>>>>
>>>> On Thu, Nov 10, 2011 at 1:51 AM, Peter A. Kerzum
>>>><kerzum at yandex-team.ru>
>>> wrote:
>>>> > Hi!
>>>> >
>>>> > We use some approach based on character properties to extract
>>>>meaningful
>>>> > title from document text. Metadata usualy stores filename in title
>>>> > field.
>>>> >
>>>> > --
>>>> > Peter
>>>> >
>>>> > On Wednesday 09 November 2011 16:16:14 Alec Taylor wrote:
>>>> >> On Wed, Nov 9, 2011 at 10:37 PM, Albert Astals Cid <aacid at kde.org>
>>>>wrote:
>>>> >> > A Dimecres, 9 de novembre de 2011, Alec Taylor vàreu escriure:
>>>> >> >> Incorrect, all getDocInfo tells you is what the meta info says,
>>>>it
>>>> >> >> doesn't analyse the actual document, whereas my pdftopdf will
>>>>update
>>>> >> >> the metadata with the appropriate info after PDF analysis
>>>> >> >
>>>> >> > Please do not top post, makes reading e-mail incredibly hard.
>>>> >> >
>>>> >> > And no it is not incorrect, if the metadata does not have a title,
>>>> >> > then the document does not have a title as defined per the spec.
>>>> >> >
>>>> >> > Albert
>>>> >>
>>>> >> But maybe the document doesn't have a title, because it was grabbed
>>>> >> from scanning the book, then OCRing it. So what I will facilitate is
>>>> >> the generation of proper metadata (+ more) from a current PDF
>>>>lacking
>>>> >> such.
>>>> >>
>>>> >> So if the document does have a title, my pdftopdf tool will find it,
>>>> >> and add it to the metadata.
>>>> >>
>>>> >> I will contribute pdftopdf to poppler.
>>>> >> _______________________________________________
>>>> >> poppler mailing list
>>>> >> poppler at lists.freedesktop.org
>>>> >> http://lists.freedesktop.org/mailman/listinfo/poppler
>>>
>>> --
>>> Пётр Керзум
>>> Группа разработки поисковой платформы
>>> СПб, тел. 8508
>>>
>>_______________________________________________
>>poppler mailing list
>>poppler at lists.freedesktop.org
>>http://lists.freedesktop.org/mailman/listinfo/poppler
>
>


More information about the poppler mailing list