[poppler] Extra spaces in text when using Poppler pdftotext

William Bader williambader at hotmail.com
Mon May 27 09:03:48 PDT 2013


I have a section of the PDF below. Tc sets character spacing, Tf sets the font, Tm sets the transform matrix, Tw sets word spacing, Tz sets horizontal scaling, Tj draws a string.  Maybe the Tc or Tz is throwing off a calculation.  The words are mostly in strings instead of placed letter by letter, so pdftotext should be able to figure it out.
If you have time to experiment, you could try tuning maxCharSpacing and maxWideCharSpacing in TextOutputDev.cc.  Look for the comment "// merge words".  
William

q595.200 0 0 841.920 0 0 cm/im1 DoQBT92.345 Tz/F4 9 Tf3 Tr0.706 Tc1 0 0 1 47.040 713.760 Tm(1234) Tj0 Tc(5) Tj1.762 Tw98 Tz( -) Tj0.743 Tw81.042 Tz1.542 Tc( Fra) Tj0 Tc(u) Tj2.632 Tw86 Tz1.021 Tc( Jeann) Tj0 Tc(e) Tj3.066 Tw69.979 Tz2.717 Tc( Larivier) Tj0 Tc(e) Tj3.395 Tw98 Tz( -) Tj-0.238 Tw86 Tz1.243 Tc( Wasserber) Tj

> Date: Mon, 27 May 2013 14:34:39 +0200
> From: runarb at gmail.com
> To: poppler at lists.freedesktop.org
> Subject: [poppler] Extra spaces in text when using Poppler pdftotext
> 
> Hi
> 
> I am using your pdftotext program to extract text from a large number
> of PDF files. Unfortunately some words get extra spaces between the
> characters. For example in one PDF files the word “Wasserberg” appears
> as “W a s s e r b e r g”.
> 
> However if I cut and paste this text from Acrobat Reader the text is
> more correctly formatted.
> 
> The following image expanse this better:
> http://bbh-001.boitho.com/div/pdf_space_bug/Text.png (notice how the
> yellow text has extra spaces in it in the Popler version).
> 
> Any thoughts on how I can extract a more correct text?
> 
> 
> 
> An example PDF that gets converted like this is available at.
> http://bbh-001.boitho.com/div/pdf_space_bug/example.pdf . It isn’t
> perfect because we have to do some manual modification to remove
> confidential information, but it shows the symptoms correctly.
> 
> The text is from a scan and then it is OCRed using ABBYY. I am using
> Poppler 0.22.4.
> 
> 
> Best regards
> Runar Buvik
> CTO Searchdaimon As
> http://www.searchdaimon.com/
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20130527/e6a403ee/attachment-0001.html>


More information about the poppler mailing list