[poppler] Question on windows version??
Michael D. Setzer II
mikes at guam.net
Fri Jan 15 19:29:46 UTC 2021
Have setup a system the uses the latest poppler pdftohtml on linux, and it
works fine. Wanted to make the system available to others that use
Windows. Was able to fine poppler-0.68.0 for windows, and for my
needs, it does work fine. The output has some differences from the linux
version, but not for data that I extract.
Found a number of pages that state this is an outdated version, but no
real explanation on why a windows version is no longer being updated?
Did find a page that seemed to have a windows version that seemed
much newer, but when I tried to run it, it came up with a number of
required DLL files, and I was unable to locate them?
My other question. The pdftohtml works great to extract the data in a raw
format that basically goes line by line with all the data from the 5 pages
in the pdf file. When I try the pdftotext, it puts out the data in a column
by column method that is impossible to process..
Use the pdftohtml to extract raw info, then use a cpp program to convert
it to a csv file.
So, does what I need, but wondering why the differences in output.
PDF file has staffing pattern data from a spreadsheet converted to pdf.
No access to spreadsheet, and trying to copy it directly doesn't work.
#!/bin/bash
if [[ $# -eq 0 ]] ; then
echo 'Need to provide name staffing pattern pdf saved from firefox with extension';ls -1 *.pdf;exit
1
fi
f=${1%.*}
#Convert data from pdf file. Trim lines before data start. Eliminate useless lines fix issue
#Change non-break space to regular space, and change 3 byte - to regular - ; space before ;
time pdftohtml -nomerge -noframes -q "$f".pdf
# Fix 3 typos - Though they were fixed?? Moved it into program to fix both linux and windows
process.
#sed -i
's/Accomodative/Accommodative/g;s/Telecomunications/Telecommunications/g;s/Administative/A
dministrative/g' "$f".html
time ./fixf2b4 "$f".html
#libreoffice --infilter=CSV:59,34,76,1 "$f".csv
real 0m0.064s
user 0m0.059s
sys 0m0.004s
real 0m0.005s
user 0m0.003s
sys 0m0.002s
Have told them of typos, but they have fixed them yet.
Love that the pdftohtml works great to extract the raw data.
Sure it must take a good deal of coding.
Thanks and be Safe.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20210116/58c8fed0/attachment.htm>
More information about the poppler
mailing list