[poppler] pdftotext

MS poppler at 4n4.de
Sat Dec 23 06:32:05 PST 2006


Hi all!
First of all thanks for the great work which is achieved with poppler.
I have a question about extracting text out of a pdf.
Basically it's about extracting the departures of a timetable stored in
a pdf (e.g. http://4n4.de/vvs/501.pdf). Therefore I use "pdftotext
-layout" and it works good but not perfect.
The extracted data out I get there is the input data for another program
of mine, that's why the text should be in a good shape to import it easily.
If I use the pdftotext provided from poppler (xpdf) I have the problem
that the columns out of the pdf are not in a column in the exported text
file any more.

an example:
Station A     08.45 10.45 12.38 14.38
Station B     08.53 10.53 12.46 14.46
Station C     08.56    10.56   12.56    14.56  
Station D     08.57    10.57   12.57    14.57  

I already improved the output of pdftotext by decreasing the

// Minimum spacing between columns, as a fraction of the font size.
#define minColSpacing2 0.5 (<-- originally 0.3)

in TextOutputDev.cc. But I still have some problems getting a good
output file.

Does anyone have a good idea for me how to get the data out of the pdf?
Or are there any other good switches/options which I could change to get
better results (I already tried a couple but the only real improvement
was the thing I mentioned above). Or any other idea how to get the data
out of the pdf?

Here a couple of links:

source pdf:
http://4n4.de/vvs/501.pdf

original pdftotext-output:
http://4n4.de/vvs/501.txt.oldversion

modified pdftotext-output
http://4n4.de/vvs/501.txt

All right. I would be glad to get some hints of you...
  Michael


More information about the poppler mailing list