[poppler] Extract title from pdf file.
Josh Richardson
jric at chegg.com
Wed Nov 9 09:49:29 PST 2011
The machine-learning approach seems like a good idea for finding section
headings, and maybe the title too. For finding the document title, you
might want to look at only the first, or maybe first few pages, rather
than every sentence in the document?
--josh
On 11/9/11 9:18 AM, "Alec Taylor" <alec.taylor6 at gmail.com> wrote:
>On Thu, Nov 10, 2011 at 2:50 AM, Peter A. Kerzum <kerzum at yandex-team.ru>
>wrote:
>> Hi
>>
>>> Describe your method!
>>
>> - for every sentence of text get some numeric or boolean properties,
>>like
>> font, layout and character distribution.
>> - use machine learning algorithm to build formula that maps those
>>properties
>> to score
>> - for every document select the sentence with the greatest score.
>>Filter out
>> some sentences, based on dictiory (like urls, etc)
>>
>> machine with 15 properties works reasonably well
>
>Doesn't sound very accurate... I can bring out ~98.5% accuracy. What
>are your initial estimations?
>
>> On Wednesday 09 November 2011 17:53:27 Alec Taylor wrote:
>>> Hi Peter,
>>>
>>>
>>> Cheers,
>>>
>>> Alec Taylor
>>>
>>> On Thu, Nov 10, 2011 at 1:51 AM, Peter A. Kerzum
>>><kerzum at yandex-team.ru>
>> wrote:
>>> > Hi!
>>> >
>>> > We use some approach based on character properties to extract
>>>meaningful
>>> > title from document text. Metadata usualy stores filename in title
>>> > field.
>>> >
>>> > --
>>> > Peter
>>> >
>>> > On Wednesday 09 November 2011 16:16:14 Alec Taylor wrote:
>>> >> On Wed, Nov 9, 2011 at 10:37 PM, Albert Astals Cid <aacid at kde.org>
>>>wrote:
>>> >> > A Dimecres, 9 de novembre de 2011, Alec Taylor vàreu escriure:
>>> >> >> Incorrect, all getDocInfo tells you is what the meta info says,
>>>it
>>> >> >> doesn't analyse the actual document, whereas my pdftopdf will
>>>update
>>> >> >> the metadata with the appropriate info after PDF analysis
>>> >> >
>>> >> > Please do not top post, makes reading e-mail incredibly hard.
>>> >> >
>>> >> > And no it is not incorrect, if the metadata does not have a title,
>>> >> > then the document does not have a title as defined per the spec.
>>> >> >
>>> >> > Albert
>>> >>
>>> >> But maybe the document doesn't have a title, because it was grabbed
>>> >> from scanning the book, then OCRing it. So what I will facilitate is
>>> >> the generation of proper metadata (+ more) from a current PDF
>>>lacking
>>> >> such.
>>> >>
>>> >> So if the document does have a title, my pdftopdf tool will find it,
>>> >> and add it to the metadata.
>>> >>
>>> >> I will contribute pdftopdf to poppler.
>>> >> _______________________________________________
>>> >> poppler mailing list
>>> >> poppler at lists.freedesktop.org
>>> >> http://lists.freedesktop.org/mailman/listinfo/poppler
>>
>> --
>> Пётр Керзум
>> Группа разработки поисковой платформы
>> СПб, тел. 8508
>>
>_______________________________________________
>poppler mailing list
>poppler at lists.freedesktop.org
>http://lists.freedesktop.org/mailman/listinfo/poppler
More information about the poppler
mailing list