[poppler] pdftotext

Sun Dec 24 07:31:06 PST 2006

Thanks for your nice regex.
And please forget about the part where I was talking about having
problems with the second output. It was late the day before I wrote the
email :)
The 'minColSpacing2' changed everything I needed....

Jauco Noordzij schrieb:
> That's rather tough I think. Tables are quite hard to handle.
> Looking at the txt output, it seems like it can be parsed to a csv
> quite easily using sed
>
> sed 's/^\(- \)\?\(\([^ ]\+ \)\+\) \+\(a[bn]\)\{0,1\} \+/"\2";"\4";/'
> 501.txt | sed 's/\(\([0-9]\{2\}.[0-9]\{2\}\)\|\(      \)\)
> \{1,3\}/"\1";/g' |sed "s/[0-9]\{2\}.[0-9]\{2\}/'&/g" > 501.csv
>
> should present kinda usefull output, but it could use some polishing.
> the second output you sent looks nice by itself though. What's wrong
> with it?
>
> On 12/23/06, MS <poppler at 4n4.de> wrote:
>> Hi all!
>> First of all thanks for the great work which is achieved with poppler.
>> I have a question about extracting text out of a pdf.
>> Basically it's about extracting the departures of a timetable stored in
>> a pdf (e.g. http://4n4.de/vvs/501.pdf). Therefore I use "pdftotext
>> -layout" and it works good but not perfect.
>> The extracted data out I get there is the input data for another program
>> of mine, that's why the text should be in a good shape to import it
>> easily.
>> If I use the pdftotext provided from poppler (xpdf) I have the problem
>> that the columns out of the pdf are not in a column in the exported text
>> file any more.
>>
>> an example:
>> Station A     08.45 10.45 12.38 14.38
>> Station B     08.53 10.53 12.46 14.46
>> Station C     08.56    10.56   12.56    14.56
>> Station D     08.57    10.57   12.57    14.57
>>
>> I already improved the output of pdftotext by decreasing the
>>
>> // Minimum spacing between columns, as a fraction of the font size.
>> #define minColSpacing2 0.5 (<-- originally 0.3)
>>
>> in TextOutputDev.cc. But I still have some problems getting a good
>> output file.
>>
>> Does anyone have a good idea for me how to get the data out of the pdf?
>> Or are there any other good switches/options which I could change to get
>> better results (I already tried a couple but the only real improvement
>> was the thing I mentioned above). Or any other idea how to get the data
>> out of the pdf?
>>
>> Here a couple of links:
>>
>> source pdf:
>> http://4n4.de/vvs/501.pdf
>>
>> original pdftotext-output:
>> http://4n4.de/vvs/501.txt.oldversion
>>
>> modified pdftotext-output
>> http://4n4.de/vvs/501.txt
>>
>> All right. I would be glad to get some hints of you...
>>   Michael
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>
>
>