[poppler] Loading documents from GInputStream?

Mon May 12 05:12:13 PDT 2008

On May 12, 2008, at 7:14 AM, Tommi Komulainen wrote:
>>  Then there are that "web optimized" PDF files that have multiple  
>> XRef that
>>  form independent parts inside the PDF file so you can load a part  
>> of it as
>>  soon as you find the first XRef, poppler does not any sort of  
>> intelligent
>>  algorithm to work with partially downloaded streams.
>
> Ah, evil. Forgive my ignorance about PDF format, but I take it that
> you really need the XRef to be able to display anything? It's not like
> you'd only lose images or so?
>

	Correct.  The XRef is the catalog of where all the objects live.


> And I'd guess there also no way of telling beforehand whether a file
> is 'web optimized' or not?
>
	Only by looking at the first 1024 bytes of so - that will identify  
if there is a "linearization table" present or not.


>> I think it would be a nice addition to have just not sure what  
>> kind of api
>> we'd need.
>
> Given the need for random access I guess you'd need to store the whole
> file in memory anyway. And the getChar() implementation could just
> block reading the stream when necessary. Would be simple, but far from
> optimal.
>
	You could do that, but that would sort of remove the whole point, yes??


> An alternative to blocking on the stream could signal 'try again
> later' but I'm not sure what poppler could do in such case. Skip to
> processing some other part of the file?
>
	Potentially - depending on what is going on.  For example, perhaps  
there are three images on a page and one is large and the other two  
are small - you could fork off other threads to read the smaller ones  
at the same time the larger is being read (esp. if they've already  
been downloaded).  But this and many other such techniques would  
require MAJOR refactoring of the poppler/Xpdf architecture...


Leonard