[poppler] Issue with converting pdf files to text to then import to spreadsheet.

Mon Sep 14 14:14:18 UTC 2020

Thanks for the quick reply.
I manually add the headers to my spreadsheet and didn't include the sub 
and final totals, since I could always recreate them once the data is in the 
spreadsheet.

I did just try creating a libreoffice pdf file from the spreadsheet, and 
unfortunately see the same result with the extrations.

The libreoffice pdf file does open in calc with no problem with the 
hybrid format, while the original pdf file opens in libreoffice draw..

Have sent email to head of HR department, will see what I get back..
Thanks again.

On 14 Sep 2020 at 12:23, Ross Moore wrote:

From:	Ross Moore <ross.moore at mq.edu.au>
To:	"Michael D. Setzer II" <mikes at guam.net>
Copies to:	"poppler at lists.freedesktop.org" <poppler at lists.freedesktop.org>
Subject:	Re: [poppler] Issue with converting pdf files to text to then import
	to spreadsheet.
Date sent:	Mon, 14 Sep 2020 12:23:15 +0000

> Hello Michael,
> 
>     On 14 Sep 2020, at 9:17 pm, Michael D. Setzer II <mikes at guam.net> 
>     wrote:
> 
>     This is long, but thought I would ask. Part of the problem is with the 
>     PDF file that I have no control over how it is created, but perhaps I'm 
>     missing something simple. Have a solution that works about 95%, but 
>     does need manual tweaking. Thanks for the time. Just joined list, don't 
>     know how active it is, but thought I would ask. Just sitting at home with 
>     no much to do.. Be Safe.
>     
>     Have been pulling a staffing pattern pdf file for some time from my 
>     college. The data seems to be converted from an excell spreadsheet, but 
>     the data can not be copied directly??
>     
>     https://guamcc.edu/sites/default/files/2020_07_31_staffing_pattern_final
>     .
>     pdf
>     
>     Is the latest version of file.
>     If I copy and paste from Firefox to libreoffice calc, it puts everything in 
>     cell A1 with no delimeters between fields?
> 
> 
> There are times when you just need to use the proper software.
> Attached is the result from Exporting to Excel, using Acrobat Pro DC.
> (263 lines in the resulting spreadsheet table.
>  incl. 2 header rows + 8 subhead rows + 1 “total” at the bottom)
> 
> 
> 
> 
> 
> Note that Ghostscript v9.06 was used to create the PDF.
> But Adobe’s software remains the gold standard for *extraction* of content from PDFs,
> especially those with consistent recognisable structure.
> (This one is *not* Tagged PDF, as seen in the image above.)
> 
>     
>     Same thing if I paste to a text editor (gedit).
>     
>     Using google chrome get the same if I paste to spreadsheet, but pasting 
>     to the text editor it shows the data line by line, but no delimeter between 
>     fields.
>     
>     Example line of data
>     B00010286 Guam Community College FA 00 2 PRE005 1010 Office of 
>     the President President Okada, Mary A. 06/16/07 R-16-a 01/01/21 
>     185,378 48,717 0 2,688 186 4,273 1,301 26 57,165 242,543
>     
>     Some fields are single items, while other can have variable number of 
>     words. Have used a complex sed script to add ; delimeters, and correctly 
>     distinguish fields. 
>     
>     Saving the pdf file to disk, and trying utilities.
>     using pdftotext it creates a txt file, but it is rather strange in the output.
>     Seems to show two columns of first page, then it shows the next 
>     column, 
>     and then mixed number of columns so is useless. File 5687 lines long, 
>     but main data should be 252 lines plus headers..
>     
>     pdf2txt.py-2.7 does the same but is 5527 lines long?
>     
>     Have much better luck using pdftohtml.
>     With the following script it converts it to an html file that has all the 
>     fields listed basically one by one. Have to go thru a number of steps to 
>     get it fixed. Has some errors with empty fields that require manual 
>     corrections.
>     
>     pdftohtml -nomerge -noframes $1.pdf
>     sed -e 's/;<br/<br/g' $1.html |sed -e 's_<br/>_;_g' |tr '\n' ' ' | sed -e 's/; /;/g' 
>     |sed -e 's/B0/\nB0/g' | sed -e 's/<b>/\n<b>/g' |sed -e 's/<hr/\n<hr/g' | grep 
>     "B0" | sed -e 's/\&/\& /g' | sed -e 's/\&#160//g' >$1.csv
>     sed -i 's/Accomodative/Accommodative/g' $1.csv
>     sed -i 's/Telecomunications/Telecommunications/g' $1.csv
>     sed -i 's/Administative/Administrative/g' $1.csv
>     gedit $1.csv
>     gawk -i inplace '$5=++i' FS=\; OFS=\; $1.csv
>     libreoffice $1.csv
>     
>     The first sed line does most of the cleaning.
>     There are a bunch of   that seem to be randomly in file. The first 
>     one needs leave the ; or it throws off columns. Then change <br/> to ; to 
>     be field separators; then convert all lines to a long single line by replace 
>     newlines with spaces. Then clean the spaces after ; to make it cleaner.
>     Then to make the lines separate, the first field of data lines start with a 
>     B0 (all but 3, that have the fields blank?). Other things I don't need start 
>     with <b> or <hr, so make those separate lines as well. 
>     Then use grep to only get the B0 lines. File has some & and &#160 
>     in it. For some reason, fields that have this are combined with the next 
>     field, so have to leave the ;??
>     The next sed lines just fix spelling errors I've found in file.
>     I then edit the file, since the 3 lines that don't have the B0 are added to 
>     the previous line, so I have to insert a newline, and add blank; to be that 
>     field. There is one other field on one record that is blank, so it also 
>     needs 
>     to be fixed.
>     Save the changes.
>     The gawk line has nother to do with conversion, but that column use to 
>     have the record numbers, but in this version of file, the first 200 or so 
>     do, 
>     but then the number jump to higher values, and even have records with 
>     same numbers, so I just replace the field with the number 1 to 252.
>     File is the imported to libreoffice calc with ; set as only delimeter.
>     Final fix is that 4 records have the field in column P as empty, so data is 
>     all shifted over by one, so just need to insert the blank cell and shift 
>     right. 
>     
>     Don't know if there is an easier way, or why the other options didn't 
>     seem to work for me. Thanks sorry for the length.
> 
> 
> Acrobat does it in an instant.
> Sorry it is not Open Source — but neither are a lot of other things in life.
> 
> 
>     
>     +------------------------------------------------------------+
>     Michael D. Setzer II - Computer Science Instructor (Retired) 
>     mailto:mikes at guam.net 
>     mailto:msetzerii at gmail.com
>     Guam - Where America's Day Begins 
>     G4L Disk Imaging Project maintainer 
>     http://sourceforge.net/projects/g4l/
>     +------------------------------------------------------------+
>     
>     
>     
>     _______________________________________________
>     poppler mailing list
>     poppler at lists.freedesktop.org
>     https://lists.freedesktop.org/mailman/listinfo/poppler
> 
> 
> Hope this helps. 
> 
> Ross
> 
> 
> Dr Ross Moore
> Department of Mathematics and Statistics
> 12 Wally’s Walk, Level 7, Room 734
> Macquarie University, NSW 2109,Australia
> T:+61 2 98508955 | F:+61 2 98508114
> M:+61 407 288 255 | E:ross.moore at mq.edu.au
> http://www.maths.mq.edu.au
> 
> 
> CRICOS Provider Number 00002J. Think before youprint.
> Please consider the environment before printing thisemail.
> 
> This message is intended for the addressee namedand may
> contain confidential information. If you are not theintended
> recipient, please delete it and notify the sender. Viewsexpressed
> in this message are those of the individual sender, andare not
> necessarily the views of Macquarie University. 
> 
> 

+------------------------------------------------------------+
 Michael D. Setzer II - Computer Science Instructor (Retired)     
 mailto:mikes at guam.net                            
 mailto:msetzerii at gmail.com
 Guam - Where America's Day Begins                        
 G4L Disk Imaging Project maintainer 
 http://sourceforge.net/projects/g4l/
+------------------------------------------------------------+