[poppler] Issue with converting pdf files to text to then import to spreadsheet.

Mon Sep 14 11:17:44 UTC 2020

This is long, but thought I would ask. Part of the problem is with the 
PDF file that I have no control over how it is created, but perhaps I'm 
missing something simple. Have a solution that works about 95%, but 
does need manual tweaking.  Thanks for the time. Just joined list, don't 
know how active it is, but thought I would ask. Just sitting at home with 
no much to do.. Be Safe.

Have been pulling a staffing pattern pdf file for some time from my 
college. The data seems to be converted from an excell spreadsheet, but 
the data can not be copied directly??

https://guamcc.edu/sites/default/files/2020_07_31_staffing_pattern_final.
pdf

Is the latest version of file.
If I copy and paste from Firefox to libreoffice calc, it puts everything in 
cell A1 with no delimeters between fields?

Same thing if I paste to a text editor (gedit).

Using google chrome get the same if I paste to spreadsheet, but pasting 
to the text editor it shows the data line by line, but no delimeter between 
fields.

Example line of data
B00010286 Guam Community College FA 00 2 PRE005 1010 Office of 
the President President Okada, Mary A. 06/16/07 R-16-a 01/01/21 
185,378 48,717 0 2,688 186 4,273 1,301 26 57,165 242,543

Some fields are single items, while other can have variable number of 
words. Have used a complex sed script to add ; delimeters, and correctly 
distinguish fields. 

Saving the pdf file to disk, and trying utilities.
using pdftotext it creates a txt file, but it is rather strange in the output.
Seems to show two columns of first page, then it shows the next column, 
and then mixed number of columns so is useless. File 5687 lines long, 
but main data should be 252 lines plus headers..

pdf2txt.py-2.7 does the same but is 5527 lines long?

Have much better luck using pdftohtml.
With the following script it converts it to an html file that has all the 
fields listed basically one by one. Have to go thru a number of steps to 
get it fixed. Has some errors with empty fields that require manual 
corrections.

pdftohtml -nomerge -noframes $1.pdf
sed -e 's/;<br/<br/g' $1.html |sed -e 's_<br/>_;_g' |tr '\n' ' ' | sed -e 's/; /;/g' 
|sed -e 's/B0/\nB0/g' | sed -e 's/<b>/\n<b>/g' |sed -e 's/<hr/\n<hr/g' | grep 
"B0" | sed -e 's/\&/\& /g' | sed -e 's/\&#160//g' >$1.csv
sed -i 's/Accomodative/Accommodative/g' $1.csv
sed -i 's/Telecomunications/Telecommunications/g' $1.csv
sed -i 's/Administative/Administrative/g' $1.csv
gedit $1.csv
gawk -i inplace '$5=++i' FS=\; OFS=\; $1.csv
libreoffice $1.csv

The first sed line does most of the cleaning.
There are a bunch of   that seem to be randomly in file. The first  
one needs leave the ; or it throws off columns.  Then change <br/> to ; to 
be field separators; then convert all lines to a long single line by replace 
newlines with spaces. Then clean the spaces after ; to make it cleaner.
Then to make the lines separate, the first field of data lines start with a 
B0 (all but 3, that have the fields blank?). Other things I don't need start 
with <b> or <hr, so make those separate lines as well. 
Then use grep to only get the B0 lines. File has some & and &#160 
in it. For some reason, fields that have this are combined with the next 
field, so have to leave the ;??
The next sed lines just fix spelling errors I've found in file.
I then edit the file, since the 3 lines that don't have the B0 are added to 
the previous line, so I have to insert a newline, and add blank; to be that 
field. There is one other field on one record that is blank, so it also needs 
to be fixed.
Save the changes.
The gawk line has nother to do with conversion, but that column use to 
have the record numbers, but in this version of file, the first 200 or so do, 
but then the number jump to higher values, and even have records with 
same numbers, so I just replace the field with the number 1 to 252.
File is the imported to libreoffice calc with ; set as only delimeter.
Final fix is that 4 records have the field in column P as empty, so data is 
all shifted over by one, so just need to insert the blank cell and shift 
right. 

Don't know if there is an easier way, or why the other options didn't 
seem to work for me. Thanks sorry for the length.

+------------------------------------------------------------+
 Michael D. Setzer II - Computer Science Instructor (Retired)     
 mailto:mikes at guam.net                            
 mailto:msetzerii at gmail.com
 Guam - Where America's Day Begins                        
 G4L Disk Imaging Project maintainer 
 http://sourceforge.net/projects/g4l/
+------------------------------------------------------------+