[poppler] pdftotext
MS
poppler at 4n4.de
Sat Dec 23 06:32:05 PST 2006
Hi all!
First of all thanks for the great work which is achieved with poppler.
I have a question about extracting text out of a pdf.
Basically it's about extracting the departures of a timetable stored in
a pdf (e.g. http://4n4.de/vvs/501.pdf). Therefore I use "pdftotext
-layout" and it works good but not perfect.
The extracted data out I get there is the input data for another program
of mine, that's why the text should be in a good shape to import it easily.
If I use the pdftotext provided from poppler (xpdf) I have the problem
that the columns out of the pdf are not in a column in the exported text
file any more.
an example:
Station A 08.45 10.45 12.38 14.38
Station B 08.53 10.53 12.46 14.46
Station C 08.56 10.56 12.56 14.56
Station D 08.57 10.57 12.57 14.57
I already improved the output of pdftotext by decreasing the
// Minimum spacing between columns, as a fraction of the font size.
#define minColSpacing2 0.5 (<-- originally 0.3)
in TextOutputDev.cc. But I still have some problems getting a good
output file.
Does anyone have a good idea for me how to get the data out of the pdf?
Or are there any other good switches/options which I could change to get
better results (I already tried a couple but the only real improvement
was the thing I mentioned above). Or any other idea how to get the data
out of the pdf?
Here a couple of links:
source pdf:
http://4n4.de/vvs/501.pdf
original pdftotext-output:
http://4n4.de/vvs/501.txt.oldversion
modified pdftotext-output
http://4n4.de/vvs/501.txt
All right. I would be glad to get some hints of you...
Michael
More information about the poppler
mailing list