[poppler] Question on windows version??

Michael D. Setzer II mikes at guam.net
Fri Jan 15 19:29:46 UTC 2021


Have setup a system the uses the latest poppler pdftohtml on linux, and it 
works fine. Wanted to make the system available to others that use 
Windows. Was able to fine poppler-0.68.0 for windows, and for my 
needs, it does work fine. The output has some differences from the linux 
version, but not for data that I extract.

Found a number of pages that state this is an outdated version, but no 
real explanation on why a windows version is no longer being updated?

Did find a page that seemed to have a windows version that seemed 
much newer, but when I tried to run it, it came up with a number of 
required DLL files, and I was unable to locate them?

My other question. The pdftohtml works great to extract the data in a raw 
format that basically goes line by line with all the data from the 5 pages 
in the pdf file. When I try the pdftotext, it puts out the data in a column 
by column method that is impossible to process.. 

Use the pdftohtml to extract raw info, then use a cpp program to convert 
it to a csv file. 

So, does what I need, but wondering why the differences in output.
PDF file has staffing pattern data from a spreadsheet converted to pdf.
No access to spreadsheet, and trying to copy it directly doesn't work.

#!/bin/bash
if [[ $# -eq 0 ]] ; then
    echo 'Need to provide name staffing pattern pdf saved from firefox with extension';ls -1 *.pdf;exit 
1
fi
f=${1%.*}
#Convert data from pdf file. Trim lines before data start. Eliminate useless lines fix   issue
#Change non-break space to regular space, and change 3 byte - to regular - ; space before ;
time pdftohtml -nomerge -noframes -q "$f".pdf
# Fix 3 typos - Though they were fixed?? Moved it into program to fix both linux and windows 
process.
#sed -i 
's/Accomodative/Accommodative/g;s/Telecomunications/Telecommunications/g;s/Administative/A
dministrative/g' "$f".html
time ./fixf2b4 "$f".html
#libreoffice --infilter=CSV:59,34,76,1 "$f".csv

real	0m0.064s
user	0m0.059s
sys	0m0.004s

real	0m0.005s
user	0m0.003s
sys	0m0.002s

Have told them of typos, but they have fixed them yet. 
Love that the pdftohtml works great to extract the raw data.
Sure it must take a good deal of coding.
Thanks and be Safe.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20210116/58c8fed0/attachment.htm>


More information about the poppler mailing list