[poppler] patch for a progress callback

Mon Jan 7 02:39:17 PST 2013

On 06 Jan 2013, at 21:01, poppler-request at lists.freedesktop.org wrote:
> From: Albert Astals Cid <aacid at kde.org>
> Subject: Re: [poppler] patch for a progress callback
> 
> El Divendres, 4 de gener de 2013, a les 09:16:25, Damian Stewart va escriure:
> Some comments:
> * The cpp facing api seems quite "un-C++-ish", i think that a class with a 
> pure virtual progressDone() function and a Page::setProgressHandler method 
> would make it more C++ish

In the core Poppler source, the existing callbacks follow the same pattern:

void display(OutputDev *out, double hDPI, double vDPI,
...
GBool (*abortCheckCbk)(void *data) = NULL,
		  void *abortCheckCbkData = NULL,
		  GBool (*annotDisplayDecideCbk)(Annot *annot, void *user_data) = NULL,
		  void *annotDisplayDecideCbkData = NULL,

With this precedent, I would have assumed

		  void (*progressCbk)(int pageNum, float progressPct, void* user_data) = NULL,
		  void *progressCbkData = NULL );

makes the most sense.. or?

> * How useful is this? I.e. i predict that there are operations that are 
> really fast and others really slow, does it really help?

With UX improvements, often 'useful' is not the adjective to use. Try 'less frustrating'. 

Yes, some operations are slow and some are fast, but having a progress indicator that gradually increases from 0 to 1 over 2 hours is much, much less frustrating than a blinking cursor for the same 2 hours, especially if you don't know up front whether it's going to take 1 minute, 20 minutes or 2 hours.

> * The DummyOutputDev should take the real outputdev and return the values of 
> the real outputdev for the "querying" functions

Good point, yes, it should probably do that.

> * Have you measured the performance hit?

I have not, however: 
0. Any progress indicator naturally involves reduced 'performance' (because of GUI redraw calls, mutex grab/release, etc). However from a UX perspective, 10 seconds with no progress indicator is way more frustrating for the user than 10.1 seconds with a progress indicator. 
1. If progressCbk==NULL in Page::displaySlice then there is performance hit as the progress code is never called.
2. If the page has few operations then the performance hit through precalculating operation count will be miniscule.
3. If the page has many operations then the time to perform those operations 'for real' will dwarf the time to count operations probably by several orders of magnitude.
4. The time taken hit to call the progressCbk per operation will depend entirely on the third-party's implementation of the progressCbk; ie, it's not Poppler's responsibility.

On 06/01/13 18:35, Adrian Johnson <ajohnson at redneon.com> wrote:

> This is the biggest problem I see with this patch. There are PDFs with
> one huge image. In this case the progress will sit on 0% for a long time
> then jump 100% when rendering is finished.

Granted. However it's unlikely that a pdf with embedded images will be rendered at 300DPI, as that would mean that the embedded images would also need to be 300DPI; and if someone is emailing you a PDF with 300DPI images you can email them back and say 'please just send me the images'.

Also: Have you seen how other progress indicators out there function? This is a universal issue with progress indicators.

> It also doesn't take into
> account XObjects and patterns which may also take a long time to render
> with no progress available. There are PDFs with a single XObject painted
> in the content stream. All the work is done in the XObject.
> 
> If we merge this feature what do we do when users start filing bugs
> about these cases that don't provide useful progress feedback?

I don't know. I'm also not sure what you're saying. So no-one should have a progress bar, because the maintainers don't want to deal with users complaining about the edge cases?

>> * The DummyOutputDev should take the real outputdev and return the values of 
>> the real outputdev for the "querying" functions, e.g. your dummy outpudev 
>> returns false for useTilingPatternFill, but SplashOutputDev returns true, this 
>> probably means that you never get to 100% on files that have patterns, no?
>> * Have you measured the performance hit?
> 
> I would also like to know the performance cost of rendering to
> DummyOutputDev.

This point was addressed above.

> Have you got a sample PDF we can test with? It would be interesting to
> see if there is anything that can be done to speed up the rendering.

https://dl.dropbox.com/u/16106653/pdfToImage-progress-example.zip

Note: the PDFs in this zip will probably take a very, very long time to render with standard desktop PDF viewers. The first one 2012_12_28_13_6_22.pdf (generated by Cairo) is reasonably well behaved. The second one 2012_12_28_13_6_22_0,5stroke-1.pdf (generated by loading 2012_12_28_13_6_22.pdf into Adobe Illustrator and modifying the stroke width for all of the lines) seems to have multiple embedded streams, but is typical of the documents I was working with in making this patch.

After compiling pdfToImage-progress.cpp, try:
 ./a.out -i 2012_12_28_13_6_22.pdf -o test.png --width 12000 
(12000 pixels is the image size necessary to produce a ~100cm wide print at 300DPI)
(This is a direct output from an openFrameworks application that draws several thousand line segments every frame, running at 60fps for a minute or two)

For a poor performance example, try
 ./a.out -i 2012_12_28_13_6_22_0,5stroke-1.pdf -o test.png --width 3508
(3508 pixels wide is a DIN-A4 print at 300DPI)
(This is generated from the same data file as above, after loading it into Adobe Illustrator, selecting all the lines and setting their stroke width to 0.5 -- yes, there are better ways to achieve this than via a GUI application, but this is the use case I am working from.)

> I don't think calling the progress callback on every operation is
> efficient or necessary. It should be sufficient to call the callback
> every n operations. I suggest making n = TotalOperations/100.

Try with the 2012_12_28_13_6_22_0,5stroke-1.pdf example at 300dpi, as in the paragraph above. With this one, some single operations can take several seconds on my (2GHz Intel) machine. If n=TotalOperations/100, the progress bar would stall on this for tens of seconds to a minute, which is annoying. IMO the implementor of the callback function should be able to decide how frequently they want to respond to events.

> Instead of parsing the content twice to get the operation count you
> could use the current position in the content stream to report progress.
> You would have to use the compressed stream position
> (getBaseStream()->getLength()) since the uncompressed length of a stream
> is not stored in the pdf file.

I initially attempted that. With the 2012_12_28_13_6_22_0,5stroke-1.pdf from the zip linked above, it seems there are multiple streams embedded in the PDF, as the stream position pointer keeps on jumping back to zero. I'm assuming there are multiple streams, which means you'd need to do a preprocessing step which involves looping through the PDF, adding up all the stream lengths, and then actually rendering .... which is exactly what I've done with this patch, but counting operations rather than stream lengths.

Cheers
Damian

--
damian stewart . interdisciplinary creative engineer . vienna, austria
http://damianstewart.com . twitter @damian0815