[Poppler-bugs] [Bug 28282] pdftohtml is unable to extract the text in some PDF files

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Tue Apr 3 15:25:38 PDT 2012


https://bugs.freedesktop.org/show_bug.cgi?id=28282

skierpage <info at skierpage.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |info at skierpage.com

--- Comment #1 from skierpage <info at skierpage.com> 2012-04-03 15:25:38 PDT ---
I reproduced this with pdftohtml version 0.18.4 from Kubuntu 12.04 beta amd64.

pdftohtml extracts the text overlaying the header image on every page:
  GOBIERNO
  de
  CANTABRIA
  B O L E T Í N O F I C I A L D E C A N TA B R I A
but the rest of the page text (e.g. "sumario 1. DISPOCIONES GENERALES") is
missing. And in Okular you also can't copy it as text. (BTW turning on all
Okular flags in kdebugsettings doesn't seem to output any relevant warnings.)

There are a few exceptions, like the table starting on page 43, where the
column heading text (EXPEDIENTE SANCIONADO/A ...) appears OK but the column
entries are garbled text like
6$081$6+9,/,  =85$%
6$081$6+9,/,  =85$%
;

<
;

<
%(1,'250
%(1,'250

Then page 115 most of the text, starting at "ANEXO" appears.
etc.

Also I noticed how the image on page 147 turns into 1,347 1-pixel high pngs,
but pdftohtml doesn't force them to stack e.g. using <div style="line-height:
1px; font-size: 1px;" so there's whitespace between each scanline.  There's
crazy stuff in them pdfs ;-)

-- 
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


More information about the Poppler-bugs mailing list