[poppler] pdftotext
Jauco Noordzij
jauco at jauco.nl
Sat Dec 23 08:40:52 PST 2006
That's rather tough I think. Tables are quite hard to handle.
Looking at the txt output, it seems like it can be parsed to a csv
quite easily using sed
sed 's/^\(- \)\?\(\([^ ]\+ \)\+\) \+\(a[bn]\)\{0,1\} \+/"\2";"\4";/'
501.txt | sed 's/\(\([0-9]\{2\}.[0-9]\{2\}\)\|\( \)\)
\{1,3\}/"\1";/g' |sed "s/[0-9]\{2\}.[0-9]\{2\}/'&/g" > 501.csv
should present kinda usefull output, but it could use some polishing.
the second output you sent looks nice by itself though. What's wrong
with it?
On 12/23/06, MS <poppler at 4n4.de> wrote:
> Hi all!
> First of all thanks for the great work which is achieved with poppler.
> I have a question about extracting text out of a pdf.
> Basically it's about extracting the departures of a timetable stored in
> a pdf (e.g. http://4n4.de/vvs/501.pdf). Therefore I use "pdftotext
> -layout" and it works good but not perfect.
> The extracted data out I get there is the input data for another program
> of mine, that's why the text should be in a good shape to import it easily.
> If I use the pdftotext provided from poppler (xpdf) I have the problem
> that the columns out of the pdf are not in a column in the exported text
> file any more.
>
> an example:
> Station A 08.45 10.45 12.38 14.38
> Station B 08.53 10.53 12.46 14.46
> Station C 08.56 10.56 12.56 14.56
> Station D 08.57 10.57 12.57 14.57
>
> I already improved the output of pdftotext by decreasing the
>
> // Minimum spacing between columns, as a fraction of the font size.
> #define minColSpacing2 0.5 (<-- originally 0.3)
>
> in TextOutputDev.cc. But I still have some problems getting a good
> output file.
>
> Does anyone have a good idea for me how to get the data out of the pdf?
> Or are there any other good switches/options which I could change to get
> better results (I already tried a couple but the only real improvement
> was the thing I mentioned above). Or any other idea how to get the data
> out of the pdf?
>
> Here a couple of links:
>
> source pdf:
> http://4n4.de/vvs/501.pdf
>
> original pdftotext-output:
> http://4n4.de/vvs/501.txt.oldversion
>
> modified pdftotext-output
> http://4n4.de/vvs/501.txt
>
> All right. I would be glad to get some hints of you...
> Michael
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
>
--
groeten,
Jauco Noordzij
More information about the poppler
mailing list