[poppler] something like an "image_list" API for cpp frontend

Fri Oct 5 21:01:40 UTC 2018

El dilluns, 2 d’abril de 2018, a les 10:22:51 CEST, suzuki toshiya va escriure:
> Hi,

Hi 6 months later :/

> 
> Now I'm thinking about the possibility to add "image_list"
> API, which is similar to text_list API of cpp frontend,
> giving the list of the structures including the rectangle
> and the pointer to the image data stream.
> 
> The easiest idea would be the incorporation of ImageOutputDev
> into cpp frontend. However, there is a known issue in
> ImageOutputDev; the images drawn by tiling operations are
> not counted.
> 
> https://bugs.freedesktop.org/show_bug.cgi?id=91734
> 
> I should emphasize this is not so marginal case. When I
> make a PDF from a HTML with many small images, via Firefox
> on GNU/Linux, often the resulted PDF draw the images by
> the titling operation, although the images never repeated X-o.

You mean the images are in the pdf as a tile repeat of 1?

> I'm not sure whether the fix in above bugzilla is right or
> not (it seems that nobody reviews the quick fix patch), but
> this fix just enables to list (with original metrics), and
> extract the image data - the metrics in drawn result is not
> available. So it is not the perfect solution to discuss the
> "image_list" API.
> 
> there would be a rationale for the original author to
> write such simple patch. The tiling operations are executed
> as:
> 
> 1) create new output (e.g. splash bitmap, cairo surface,
> etc) to draw a single image as a pattern
> 
> 2) transfer the drawn image to original output
> 
> to calculate the positions & metrics in the resulted image,
> the chain of the temporal output should be kept.
> 
> The difficulty to handle the images drawn by tiling would
> be:
> 
> * it is not easy to count how many times the image are
> repeated.
> 
> * to obtain the position & metrics, the chain of tiling
> operation should be preserved. we cannot assume the
> rendering of the image for the title do not invoke yet
> another tiling operation.
> 
> Thinking about the alternative, the possibility would be
> parsing SVG (or XML, or CairoScript) generated by
> CairoOutputDev. It seems that SVG generated by Cairo has
> a flat structure (no grouped coordinate transform), all
> position & metric informations could be retrieved by
> the neighborhood XML elements.
> 
> However, there are 3 concerns.
> 
> --
> 
> a) nobody guarantees the forward compatibility about the
> flat structure of SVG (or CairoScript, XML surface).
> 
> b) poppler has no dependency with XML parsing library,
> except of the case that fontconfig depending libexpat.
> 
> c) tiling onto SVG or XML surface can cause some
> rasterization.
> 
> when I convert pattern-tiling example at
> 
> https://developer.mozilla.org/en-US/docs/Web/SVG/Tutorial/Patterns
> 
> onto PDF by librsvg, it includes no raster data
> (pattern.pdf.xz), but if I revert it from PDF to SVG
> by pdftocairo (pattern.re.svgz), the result includes
> the raster data X-o.
> 
> therefore, there is a possibility that inexisting images
> are counted in this method.
> 
> --
> 
> So, what is the right way? 

I'd say keep ignoring tiles for the time being, and if you find lots of cases where a tile is "wrongly" used, ask the people that generate it to "fix" the pdf, since obviously it's not what they wanted.

> if it is not the time to put "image_list" into cpp frontend

It is ok, actually i know someone else that wanted to do that.

> , is it acceptable to add similar features to pdftimage or pdftocairo?

pdftoppm and pdftocairo have a different purpose, they just render a given page, what would you do with tiled images for them?

Cheers,
  Albert

> 
> Regards,
> mpsuzuki
>