[poppler] approches used for language detection on images ...

Tue Feb 4 12:37:50 UTC 2020

 Hi *:

 I work on pdf files some of which might be image-based (with or
without the text included), or searchable pdf which include images of
varying quality and with text embedded in various ways. This would be
the typical text I would be dealing with:

 https://www.nysedregents.org/USHistoryGov/Archive/20020122exam.pdf

 which tools could be used to extract the text on the Images?

 As Liam on the gimpusers Forum pointed out to me, you Need:

 (1) feature extraction, finding the writing,
 (2) OCR of some sort, to turn pictures of letters into letters, and then
 (3) the linguistic analysis.

 which tools and/or strategies could be used for steps 1-3?

 Another example of textual file I work with would be:

 https://scholarworks.iu.edu/dspace/bitstream/handle/2022/18961/Notes
 and Texts.pdf

 on that searchable file file pdftohtml produces one background file
per page, but when you stratify the content (simply using hash
signatures) you realize most files are of the same kind (just blank
background images or files containing a single line (for example,
underlining a title) or framing a blocked message), then there are
full-page blank Images with segments of greek text, ...

 Why don't poppler utils:

 a) underline text segments since they know their exact X,Y offsets;

 b) encode blocked text using html blocks;

 c) include the image of textual characters in foreing languages as
character sequences;

 instead of creating for such purposes a background Image for each page?

 Maybe there is a way to work around such hurdles I don't know and/or
someone has already written code to take care of that.

  Do you know of such a code?

 Thank you,
  lbrtchx