[poppler] Issue with converting pdf files to text to then import to spreadsheet.
Michael D. Setzer II
mikes at guam.net
Mon Sep 14 11:17:44 UTC 2020
This is long, but thought I would ask. Part of the problem is with the
PDF file that I have no control over how it is created, but perhaps I'm
missing something simple. Have a solution that works about 95%, but
does need manual tweaking. Thanks for the time. Just joined list, don't
know how active it is, but thought I would ask. Just sitting at home with
no much to do.. Be Safe.
Have been pulling a staffing pattern pdf file for some time from my
college. The data seems to be converted from an excell spreadsheet, but
the data can not be copied directly??
https://guamcc.edu/sites/default/files/2020_07_31_staffing_pattern_final.
pdf
Is the latest version of file.
If I copy and paste from Firefox to libreoffice calc, it puts everything in
cell A1 with no delimeters between fields?
Same thing if I paste to a text editor (gedit).
Using google chrome get the same if I paste to spreadsheet, but pasting
to the text editor it shows the data line by line, but no delimeter between
fields.
Example line of data
B00010286 Guam Community College FA 00 2 PRE005 1010 Office of
the President President Okada, Mary A. 06/16/07 R-16-a 01/01/21
185,378 48,717 0 2,688 186 4,273 1,301 26 57,165 242,543
Some fields are single items, while other can have variable number of
words. Have used a complex sed script to add ; delimeters, and correctly
distinguish fields.
Saving the pdf file to disk, and trying utilities.
using pdftotext it creates a txt file, but it is rather strange in the output.
Seems to show two columns of first page, then it shows the next column,
and then mixed number of columns so is useless. File 5687 lines long,
but main data should be 252 lines plus headers..
pdf2txt.py-2.7 does the same but is 5527 lines long?
Have much better luck using pdftohtml.
With the following script it converts it to an html file that has all the
fields listed basically one by one. Have to go thru a number of steps to
get it fixed. Has some errors with empty fields that require manual
corrections.
pdftohtml -nomerge -noframes $1.pdf
sed -e 's/;<br/<br/g' $1.html |sed -e 's_<br/>_;_g' |tr '\n' ' ' | sed -e 's/; /;/g'
|sed -e 's/B0/\nB0/g' | sed -e 's/<b>/\n<b>/g' |sed -e 's/<hr/\n<hr/g' | grep
"B0" | sed -e 's/\&/\& /g' | sed -e 's/\ //g' >$1.csv
sed -i 's/Accomodative/Accommodative/g' $1.csv
sed -i 's/Telecomunications/Telecommunications/g' $1.csv
sed -i 's/Administative/Administrative/g' $1.csv
gedit $1.csv
gawk -i inplace '$5=++i' FS=\; OFS=\; $1.csv
libreoffice $1.csv
The first sed line does most of the cleaning.
There are a bunch of that seem to be randomly in file. The first
one needs leave the ; or it throws off columns. Then change <br/> to ; to
be field separators; then convert all lines to a long single line by replace
newlines with spaces. Then clean the spaces after ; to make it cleaner.
Then to make the lines separate, the first field of data lines start with a
B0 (all but 3, that have the fields blank?). Other things I don't need start
with <b> or <hr, so make those separate lines as well.
Then use grep to only get the B0 lines. File has some & and  
in it. For some reason, fields that have this are combined with the next
field, so have to leave the ;??
The next sed lines just fix spelling errors I've found in file.
I then edit the file, since the 3 lines that don't have the B0 are added to
the previous line, so I have to insert a newline, and add blank; to be that
field. There is one other field on one record that is blank, so it also needs
to be fixed.
Save the changes.
The gawk line has nother to do with conversion, but that column use to
have the record numbers, but in this version of file, the first 200 or so do,
but then the number jump to higher values, and even have records with
same numbers, so I just replace the field with the number 1 to 252.
File is the imported to libreoffice calc with ; set as only delimeter.
Final fix is that 4 records have the field in column P as empty, so data is
all shifted over by one, so just need to insert the blank cell and shift
right.
Don't know if there is an easier way, or why the other options didn't
seem to work for me. Thanks sorry for the length.
+------------------------------------------------------------+
Michael D. Setzer II - Computer Science Instructor (Retired)
mailto:mikes at guam.net
mailto:msetzerii at gmail.com
Guam - Where America's Day Begins
G4L Disk Imaging Project maintainer
http://sourceforge.net/projects/g4l/
+------------------------------------------------------------+
More information about the poppler
mailing list