[poppler] approches used for language detection on images ...

Tue Feb 4 21:38:39 UTC 2020

Tesseract can do multiple languages in one file. Try “-l eng+ita” for example. 

John Muccigrosso

> Il giorno 4 feb 2020, alle ore 19:21, William Bader <williambader at hotmail.com> ha scritto:
> 
> 
> > which tools could be used to extract the text on the Images?
> 
> $ pdfimages -png 20020122exam.pdf im
> $ tesseract im-000.png im-000
> Tesseract Open Source OCR Engine v4.1.0 with Leptonica
> $ cat im-000.txt 
> ‘The Vultures’ Roost
> 'Sasca Ga ay, Te Goa Rapa: Poa
> $ tesseract -l eng im-000.png im-000 hocr
> Tesseract Open Source OCR Engine v4.1.0 with Leptonica
> $ grep word im-000.hocr | grep -v '> <' | head -10
>   <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
>       <span class='ocrx_word' id='word_1_5' title='bbox 60 254 80 262; x_wconf 83'>‘The</span>
>       <span class='ocrx_word' id='word_1_6' title='bbox 84 254 129 262; x_wconf 89'>Vultures’</span>
>       <span class='ocrx_word' id='word_1_7' title='bbox 133 254 163 262; x_wconf 91'>Roost</span>
>       <span class='ocrx_word' id='word_1_9' title='bbox 5 274 30 280; x_wconf 0'>‘Sasca</span>
>       <span class='ocrx_word' id='word_1_10' title='bbox 35 274 52 280; x_wconf 58'>Ga</span>
>       <span class='ocrx_word' id='word_1_11' title='bbox 57 274 77 282; x_wconf 71'>ay,</span>
>       <span class='ocrx_word' id='word_1_12' title='bbox 83 274 95 280; x_wconf 52'>Te</span>
>       <span class='ocrx_word' id='word_1_13' title='bbox 98 274 127 280; x_wconf 47'>Goa</span>
>       <span class='ocrx_word' id='word_1_14' title='bbox 130 274 160 281; x_wconf 18'>Rapa:</span>
> 
> You could post-process this or maybe write a more powerful class using CSS.
> 
> I don't know of any open source OCR that supports multiple languages in the same file. Supporting a single language is hard enough.
> 
> >Why don't poppler utils:
> >a) underline text segments since they know their exact X,Y offsets;
> 
> You could add an option for that or maybe write CSS.
> 
> $ pdftotext -bbox 20020122exam.pdf 
> $ grep xMin 20020122exam.html | head -10
>     <word xMin="207.337000" yMin="48.999400" xMax="226.855400" yMax="60.395400">The</word>
>     <word xMin="229.970600" yMin="48.999400" xMax="280.375900" yMax="60.395400">University</word>
>     <word xMin="283.491100" yMin="48.999400" xMax="293.354800" yMax="60.395400">of</word>
>     <word xMin="296.470000" yMin="48.999400" xMax="312.523400" yMax="60.395400">the</word>
>     <word xMin="315.638600" yMin="48.999400" xMax="340.617400" yMax="60.395400">State</word>
>     <word xMin="343.732600" yMin="48.999400" xMax="353.596300" yMax="60.395400">of</word>
>     <word xMin="356.711500" yMin="48.999400" xMax="379.078900" yMax="60.395400">New</word>
>     <word xMin="382.194100" yMin="48.999400" xMax="404.647300" yMax="60.395400">York</word>
>     <word xMin="187.461100" yMin="71.999300" xMax="242.047500" yMax="83.395300">REGENTS</word>
>     <word xMin="248.771800" yMin="71.999300" xMax="280.536500" yMax="83.395300">HIGH</word>
> 
> Regards, William
> 
> 
> From: poppler <poppler-bounces at lists.freedesktop.org> on behalf of Albretch Mueller <lbrtchx at gmail.com>
> Sent: Tuesday, February 4, 2020 7:37 AM
> To: poppler at lists.freedesktop.org <poppler at lists.freedesktop.org>
> Subject: [poppler] approches used for language detection on images ...
>  
> Hi *:
> 
>  I work on pdf files some of which might be image-based (with or
> without the text included), or searchable pdf which include images of
> varying quality and with text embedded in various ways. This would be
> the typical text I would be dealing with:
> 
>  https://www.nysedregents.org/USHistoryGov/Archive/20020122exam.pdf
> 
>  which tools could be used to extract the text on the Images?
> 
>  As Liam on the gimpusers Forum pointed out to me, you Need:
> 
>  (1) feature extraction, finding the writing,
>  (2) OCR of some sort, to turn pictures of letters into letters, and then
>  (3) the linguistic analysis.
> 
>  which tools and/or strategies could be used for steps 1-3?
> 
>  Another example of textual file I work with would be:
> 
>  https://scholarworks.iu.edu/dspace/bitstream/handle/2022/18961/Notes
>  and Texts.pdf
> 
>  on that searchable file file pdftohtml produces one background file
> per page, but when you stratify the content (simply using hash
> signatures) you realize most files are of the same kind (just blank
> background images or files containing a single line (for example,
> underlining a title) or framing a blocked message), then there are
> full-page blank Images with segments of greek text, ...
> 
>  Why don't poppler utils:
> 
>  a) underline text segments since they know their exact X,Y offsets;
> 
>  b) encode blocked text using html blocks;
> 
>  c) include the image of textual characters in foreing languages as
> character sequences;
> 
>  instead of creating for such purposes a background Image for each page?
> 
>  Maybe there is a way to work around such hurdles I don't know and/or
> someone has already written code to take care of that.
> 
>   Do you know of such a code?
> 
>  Thank you,
>   lbrtchx
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20200204/d232d209/attachment-0001.htm>