[Poppler-bugs] [Bug 28282] pdftohtml is unable to extract the text in some PDF files
bugzilla-daemon at freedesktop.org
bugzilla-daemon at freedesktop.org
Tue Apr 3 15:25:38 PDT 2012
https://bugs.freedesktop.org/show_bug.cgi?id=28282
skierpage <info at skierpage.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |info at skierpage.com
--- Comment #1 from skierpage <info at skierpage.com> 2012-04-03 15:25:38 PDT ---
I reproduced this with pdftohtml version 0.18.4 from Kubuntu 12.04 beta amd64.
pdftohtml extracts the text overlaying the header image on every page:
GOBIERNO
de
CANTABRIA
B O L E T Í N O F I C I A L D E C A N TA B R I A
but the rest of the page text (e.g. "sumario 1. DISPOCIONES GENERALES") is
missing. And in Okular you also can't copy it as text. (BTW turning on all
Okular flags in kdebugsettings doesn't seem to output any relevant warnings.)
There are a few exceptions, like the table starting on page 43, where the
column heading text (EXPEDIENTE SANCIONADO/A ...) appears OK but the column
entries are garbled text like
6$081$6+9,/, =85$%
6$081$6+9,/, =85$%
;
<
;
<
%(1,'250
%(1,'250
Then page 115 most of the text, starting at "ANEXO" appears.
etc.
Also I noticed how the image on page 147 turns into 1,347 1-pixel high pngs,
but pdftohtml doesn't force them to stack e.g. using <div style="line-height:
1px; font-size: 1px;" so there's whitespace between each scanline. There's
crazy stuff in them pdfs ;-)
--
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
More information about the Poppler-bugs
mailing list