[poppler] pdftotext

Jauco Noordzij jauco at jauco.nl
Sat Dec 23 08:40:52 PST 2006


That's rather tough I think. Tables are quite hard to handle.
Looking at the txt output, it seems like it can be parsed to a csv
quite easily using sed

sed 's/^\(- \)\?\(\([^ ]\+ \)\+\) \+\(a[bn]\)\{0,1\} \+/"\2";"\4";/'
501.txt | sed 's/\(\([0-9]\{2\}.[0-9]\{2\}\)\|\(      \)\)
\{1,3\}/"\1";/g' |sed "s/[0-9]\{2\}.[0-9]\{2\}/'&/g" > 501.csv

should present kinda usefull output, but it could use some polishing.
the second output you sent looks nice by itself though. What's wrong
with it?

On 12/23/06, MS <poppler at 4n4.de> wrote:
> Hi all!
> First of all thanks for the great work which is achieved with poppler.
> I have a question about extracting text out of a pdf.
> Basically it's about extracting the departures of a timetable stored in
> a pdf (e.g. http://4n4.de/vvs/501.pdf). Therefore I use "pdftotext
> -layout" and it works good but not perfect.
> The extracted data out I get there is the input data for another program
> of mine, that's why the text should be in a good shape to import it easily.
> If I use the pdftotext provided from poppler (xpdf) I have the problem
> that the columns out of the pdf are not in a column in the exported text
> file any more.
>
> an example:
> Station A     08.45 10.45 12.38 14.38
> Station B     08.53 10.53 12.46 14.46
> Station C     08.56    10.56   12.56    14.56
> Station D     08.57    10.57   12.57    14.57
>
> I already improved the output of pdftotext by decreasing the
>
> // Minimum spacing between columns, as a fraction of the font size.
> #define minColSpacing2 0.5 (<-- originally 0.3)
>
> in TextOutputDev.cc. But I still have some problems getting a good
> output file.
>
> Does anyone have a good idea for me how to get the data out of the pdf?
> Or are there any other good switches/options which I could change to get
> better results (I already tried a couple but the only real improvement
> was the thing I mentioned above). Or any other idea how to get the data
> out of the pdf?
>
> Here a couple of links:
>
> source pdf:
> http://4n4.de/vvs/501.pdf
>
> original pdftotext-output:
> http://4n4.de/vvs/501.txt.oldversion
>
> modified pdftotext-output
> http://4n4.de/vvs/501.txt
>
> All right. I would be glad to get some hints of you...
>   Michael
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
>


-- 
groeten,
     Jauco Noordzij


More information about the poppler mailing list