[poppler] approches used for language detection on images ...

Tue Feb 4 18:21:47 UTC 2020

> which tools could be used to extract the text on the Images?

$ pdfimages -png 20020122exam.pdf im
$ tesseract im-000.png im-000
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
$ cat im-000.txt
‘The Vultures’ Roost
'Sasca Ga ay, Te Goa Rapa: Poa
$ tesseract -l eng im-000.png im-000 hocr
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
$ grep word im-000.hocr | grep -v '> <' | head -10
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
      <span class='ocrx_word' id='word_1_5' title='bbox 60 254 80 262; x_wconf 83'>‘The</span>
      <span class='ocrx_word' id='word_1_6' title='bbox 84 254 129 262; x_wconf 89'>Vultures’</span>
      <span class='ocrx_word' id='word_1_7' title='bbox 133 254 163 262; x_wconf 91'>Roost</span>
      <span class='ocrx_word' id='word_1_9' title='bbox 5 274 30 280; x_wconf 0'>‘Sasca</span>
      <span class='ocrx_word' id='word_1_10' title='bbox 35 274 52 280; x_wconf 58'>Ga</span>
      <span class='ocrx_word' id='word_1_11' title='bbox 57 274 77 282; x_wconf 71'>ay,</span>
      <span class='ocrx_word' id='word_1_12' title='bbox 83 274 95 280; x_wconf 52'>Te</span>
      <span class='ocrx_word' id='word_1_13' title='bbox 98 274 127 280; x_wconf 47'>Goa</span>
      <span class='ocrx_word' id='word_1_14' title='bbox 130 274 160 281; x_wconf 18'>Rapa:</span>

You could post-process this or maybe write a more powerful class using CSS.

I don't know of any open source OCR that supports multiple languages in the same file. Supporting a single language is hard enough.

>Why don't poppler utils:
>a) underline text segments since they know their exact X,Y offsets;

You could add an option for that or maybe write CSS.

$ pdftotext -bbox 20020122exam.pdf
$ grep xMin 20020122exam.html | head -10
    <word xMin="207.337000" yMin="48.999400" xMax="226.855400" yMax="60.395400">The</word>
    <word xMin="229.970600" yMin="48.999400" xMax="280.375900" yMax="60.395400">University</word>
    <word xMin="283.491100" yMin="48.999400" xMax="293.354800" yMax="60.395400">of</word>
    <word xMin="296.470000" yMin="48.999400" xMax="312.523400" yMax="60.395400">the</word>
    <word xMin="315.638600" yMin="48.999400" xMax="340.617400" yMax="60.395400">State</word>
    <word xMin="343.732600" yMin="48.999400" xMax="353.596300" yMax="60.395400">of</word>
    <word xMin="356.711500" yMin="48.999400" xMax="379.078900" yMax="60.395400">New</word>
    <word xMin="382.194100" yMin="48.999400" xMax="404.647300" yMax="60.395400">York</word>
    <word xMin="187.461100" yMin="71.999300" xMax="242.047500" yMax="83.395300">REGENTS</word>
    <word xMin="248.771800" yMin="71.999300" xMax="280.536500" yMax="83.395300">HIGH</word>

Regards, William

________________________________
From: poppler <poppler-bounces at lists.freedesktop.org> on behalf of Albretch Mueller <lbrtchx at gmail.com>
Sent: Tuesday, February 4, 2020 7:37 AM
To: poppler at lists.freedesktop.org <poppler at lists.freedesktop.org>
Subject: [poppler] approches used for language detection on images ...

Hi *:

 I work on pdf files some of which might be image-based (with or
without the text included), or searchable pdf which include images of
varying quality and with text embedded in various ways. This would be
the typical text I would be dealing with:

 https://www.nysedregents.org/USHistoryGov/Archive/20020122exam.pdf

 which tools could be used to extract the text on the Images?

 As Liam on the gimpusers Forum pointed out to me, you Need:

 (1) feature extraction, finding the writing,
 (2) OCR of some sort, to turn pictures of letters into letters, and then
 (3) the linguistic analysis.

 which tools and/or strategies could be used for steps 1-3?

 Another example of textual file I work with would be:

 https://scholarworks.iu.edu/dspace/bitstream/handle/2022/18961/Notes
 and Texts.pdf

 on that searchable file file pdftohtml produces one background file
per page, but when you stratify the content (simply using hash
signatures) you realize most files are of the same kind (just blank
background images or files containing a single line (for example,
underlining a title) or framing a blocked message), then there are
full-page blank Images with segments of greek text, ...

 Why don't poppler utils:

 a) underline text segments since they know their exact X,Y offsets;

 b) encode blocked text using html blocks;

 c) include the image of textual characters in foreing languages as
character sequences;

 instead of creating for such purposes a background Image for each page?

 Maybe there is a way to work around such hurdles I don't know and/or
someone has already written code to take care of that.

  Do you know of such a code?

 Thank you,
  lbrtchx
_______________________________________________
poppler mailing list
poppler at lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/poppler
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20200204/5bc04ea7/attachment.htm>