<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto">Tesseract can do multiple languages in one file. Try “-l eng+ita” for example. <br><br><div dir="ltr"><div>John Muccigrosso</div></div><div dir="ltr"><br><blockquote type="cite">Il giorno 4 feb 2020, alle ore 19:21, William Bader <williambader@hotmail.com> ha scritto:<br><br></blockquote></div><blockquote type="cite"><div dir="ltr"> <meta http-equiv="Content-Type" content="text/html; charset=Windows-1252"> <div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);"> ><span> which tools could be used to extract the text on the Images?</span></div> <div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);"> <span><br> </span></div> <div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);"> $ pdfimages -png 20020122exam.pdf im<br> </div> <div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);"> $ tesseract im-000.png im-000<br> </div> <div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);"> Tesseract Open Source OCR Engine v4.1.0 with Leptonica<br> </div> <div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);"> <span>$ cat im-000.txt <br> </span> <div>‘The Vultures’ Roost</div> <div>'Sasca Ga ay, Te Goa Rapa: Poa</div> <div>$ tesseract -l eng im-000.png im-000 hocr</div> <div>Tesseract Open Source OCR Engine v4.1.0 with Leptonica<br> </div> <div><span>$ grep word im-000.hocr | grep -v '> <' | head -10<br> </span> <div> <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/><br> </div> <div> <span class='ocrx_word' id='word_1_5' title='bbox 60 254 80 262; x_wconf 83'>‘The</span><br> </div> <div> <span class='ocrx_word' id='word_1_6' title='bbox 84 254 129 262; x_wconf 89'>Vultures’</span><br> </div> <div> <span class='ocrx_word' id='word_1_7' title='bbox 133 254 163 262; x_wconf 91'>Roost</span><br> </div> <div> <span class='ocrx_word' id='word_1_9' title='bbox 5 274 30 280; x_wconf 0'>‘Sasca</span><br> </div> <div> <span class='ocrx_word' id='word_1_10' title='bbox 35 274 52 280; x_wconf 58'>Ga</span><br> </div> <div> <span class='ocrx_word' id='word_1_11' title='bbox 57 274 77 282; x_wconf 71'>ay,</span><br> </div> <div> <span class='ocrx_word' id='word_1_12' title='bbox 83 274 95 280; x_wconf 52'>Te</span><br> </div> <div> <span class='ocrx_word' id='word_1_13' title='bbox 98 274 127 280; x_wconf 47'>Goa</span><br> </div> <div> <span class='ocrx_word' id='word_1_14' title='bbox 130 274 160 281; x_wconf 18'>Rapa:</span><br> </div> <span></span></div> <div><br> </div> <div>You could post-process this or maybe write a more powerful class using CSS.</div> <div><br> </div> <div>I don't know of any open source OCR that supports multiple languages in the same file. Supporting a single language is hard enough.</div> <div><br> </div> <div><span>>Why don't poppler utils:<br> </span><span>>a) underline text segments since they know their exact X,Y offsets;</span><br> </div> <div><span><br> </span></div> <div><span>You could add an option for that or maybe write CSS.</span></div> <div><br> </div> <div><span>$ pdftotext -bbox 20020122exam.pdf <br> </span> <div>$ grep xMin 20020122exam.html | head -10<br> </div> <div> <word xMin="207.337000" yMin="48.999400" xMax="226.855400" yMax="60.395400">The</word><br> </div> <div> <word xMin="229.970600" yMin="48.999400" xMax="280.375900" yMax="60.395400">University</word><br> </div> <div> <word xMin="283.491100" yMin="48.999400" xMax="293.354800" yMax="60.395400">of</word><br> </div> <div> <word xMin="296.470000" yMin="48.999400" xMax="312.523400" yMax="60.395400">the</word><br> </div> <div> <word xMin="315.638600" yMin="48.999400" xMax="340.617400" yMax="60.395400">State</word><br> </div> <div> <word xMin="343.732600" yMin="48.999400" xMax="353.596300" yMax="60.395400">of</word><br> </div> <div> <word xMin="356.711500" yMin="48.999400" xMax="379.078900" yMax="60.395400">New</word><br> </div> <div> <word xMin="382.194100" yMin="48.999400" xMax="404.647300" yMax="60.395400">York</word><br> </div> <div> <word xMin="187.461100" yMin="71.999300" xMax="242.047500" yMax="83.395300">REGENTS</word><br> </div> <div> <word xMin="248.771800" yMin="71.999300" xMax="280.536500" yMax="83.395300">HIGH</word><br> </div> <span></span><br> </div> <span></span>Regards, William</div> <div style="font-family: Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);"> <br> </div> <div> <div id="appendonsend"></div> <div style="font-family:Calibri,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)"> <br> </div> <hr tabindex="-1" style="display:inline-block; width:98%"> <div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> poppler <poppler-bounces@lists.freedesktop.org> on behalf of Albretch Mueller <lbrtchx@gmail.com><br> <b>Sent:</b> Tuesday, February 4, 2020 7:37 AM<br> <b>To:</b> poppler@lists.freedesktop.org <poppler@lists.freedesktop.org><br> <b>Subject:</b> [poppler] approches used for language detection on images ...</font> <div> </div> </div> <div class="BodyFragment"><font size="2"><span style="font-size:11pt"> <div class="PlainText">Hi *:<br> <br> I work on pdf files some of which might be image-based (with or<br> without the text included), or searchable pdf which include images of<br> varying quality and with text embedded in various ways. This would be<br> the typical text I would be dealing with:<br> <br> <a href="https://www.nysedregents.org/USHistoryGov/Archive/20020122exam.pdf">https://www.nysedregents.org/USHistoryGov/Archive/20020122exam.pdf</a><br> <br> which tools could be used to extract the text on the Images?<br> <br> As Liam on the gimpusers Forum pointed out to me, you Need:<br> <br> (1) feature extraction, finding the writing,<br> (2) OCR of some sort, to turn pictures of letters into letters, and then<br> (3) the linguistic analysis.<br> <br> which tools and/or strategies could be used for steps 1-3?<br> <br> Another example of textual file I work with would be:<br> <br> <a href="https://scholarworks.iu.edu/dspace/bitstream/handle/2022/18961/Notes">https://scholarworks.iu.edu/dspace/bitstream/handle/2022/18961/Notes</a><br> and Texts.pdf<br> <br> on that searchable file file pdftohtml produces one background file<br> per page, but when you stratify the content (simply using hash<br> signatures) you realize most files are of the same kind (just blank<br> background images or files containing a single line (for example,<br> underlining a title) or framing a blocked message), then there are<br> full-page blank Images with segments of greek text, ...<br> <br> Why don't poppler utils:<br> <br> a) underline text segments since they know their exact X,Y offsets;<br> <br> b) encode blocked text using html blocks;<br> <br> c) include the image of textual characters in foreing languages as<br> character sequences;<br> <br> instead of creating for such purposes a background Image for each page?<br> <br> Maybe there is a way to work around such hurdles I don't know and/or<br> someone has already written code to take care of that.<br> <br> Do you know of such a code?<br> <br> Thank you,<br> lbrtchx<br> _______________________________________________<br> poppler mailing list<br> poppler@lists.freedesktop.org<br> <a href="https://lists.freedesktop.org/mailman/listinfo/poppler">https://lists.freedesktop.org/mailman/listinfo/poppler</a><br> </div> </span></font></div> </div> <span>_______________________________________________</span><br><span>poppler mailing list</span><br><span>poppler@lists.freedesktop.org</span><br><span>https://lists.freedesktop.org/mailman/listinfo/poppler</span><br></div></blockquote></body></html>