<html> <head> <style></style></head> <body class='hmmessage'><div dir='ltr'>I have a section of the PDF below. Tc sets character spacing, Tf sets the font, Tm sets the transform matrix, Tw sets word spacing, Tz sets horizontal scaling, Tj draws a string. Maybe the Tc or Tz is throwing off a calculation. The words are mostly in strings instead of placed letter by letter, so pdftotext should be able to figure it out.<div><br><div>If you have time to experiment, you could try tuning maxCharSpacing and maxWideCharSpacing in TextOutputDev.cc. <span style="font-size: 12pt;">Look for the comment "// merge words". </span></div><div><br></div><div>William<br><div><div><br></div><div><div>q</div><div>595.200 0 0 841.920 0 0 cm</div><div>/im1 Do</div><div>Q</div><div>BT</div><div>92.345 Tz</div><div>/F4 9 Tf</div><div>3 Tr</div><div>0.706 Tc</div><div>1 0 0 1 47.040 713.760 Tm</div><div>(1234) Tj</div><div>0 Tc</div><div>(5) Tj</div><div>1.762 Tw</div><div>98 Tz</div><div>( -) Tj</div><div>0.743 Tw</div><div>81.042 Tz</div><div>1.542 Tc</div><div>( Fra) Tj</div><div>0 Tc</div><div>(u) Tj</div><div>2.632 Tw</div><div>86 Tz</div><div>1.021 Tc</div><div>( Jeann) Tj</div><div>0 Tc</div><div>(e) Tj</div><div>3.066 Tw</div><div>69.979 Tz</div><div>2.717 Tc</div><div>( Larivier) Tj</div><div>0 Tc</div><div>(e) Tj</div><div>3.395 Tw</div><div>98 Tz</div><div>( -) Tj</div><div>-0.238 Tw</div><div>86 Tz</div><div>1.243 Tc</div><div>( Wasserber) Tj</div><div><br></div><br><div>> Date: Mon, 27 May 2013 14:34:39 +0200<br>> From: runarb@gmail.com<br>> To: poppler@lists.freedesktop.org<br>> Subject: [poppler] Extra spaces in text when using Poppler pdftotext<br>> <br>> Hi<br>> <br>> I am using your pdftotext program to extract text from a large number<br>> of PDF files. Unfortunately some words get extra spaces between the<br>> characters. For example in one PDF files the word “Wasserberg” appears<br>> as “W a s s e r b e r g”.<br>> <br>> However if I cut and paste this text from Acrobat Reader the text is<br>> more correctly formatted.<br>> <br>> The following image expanse this better:<br>> http://bbh-001.boitho.com/div/pdf_space_bug/Text.png (notice how the<br>> yellow text has extra spaces in it in the Popler version).<br>> <br>> Any thoughts on how I can extract a more correct text?<br>> <br>> <br>> <br>> An example PDF that gets converted like this is available at.<br>> http://bbh-001.boitho.com/div/pdf_space_bug/example.pdf . It isn’t<br>> perfect because we have to do some manual modification to remove<br>> confidential information, but it shows the symptoms correctly.<br>> <br>> The text is from a scan and then it is OCRed using ABBYY. I am using<br>> Poppler 0.22.4.<br>> <br>> <br>> Best regards<br>> Runar Buvik<br>> CTO Searchdaimon As<br>> http://www.searchdaimon.com/<br>> _______________________________________________<br>> poppler mailing list<br>> poppler@lists.freedesktop.org<br>> http://lists.freedesktop.org/mailman/listinfo/poppler<br></div></div></div></div></div> </div></body> </html>