[poppler] [PATCH] Experimental HTTP Streaming Support

Stefan Thomas thomas at eload24.com
Fri Oct 16 10:22:07 PDT 2009


Hey popplers and popplerettes,

I don't remember much from last night, except that when I woke up at 3pm 
today, I found this patch in my git log and tons of scribbled notes 
about a "HTTP block cache" on my desk.

So some words on what this is and what it isn't.

It does stream any PDF file (linearized, non-linearized, object streams, 
whatever) from any HTTP 1.1 compliant server. (HTTP 1.1 compliance 
requires support for Content-Range iirc.)

On localhost it is about 2-5% slower than reading directly from disk. 
Over 1GBaseT Ethernet it is about 10-15% slower than reading from the 
local disk. Over the Internet, depending on your connection it is 
unusably slow (around 20-30 seconds to render the first page, which 
means it's 1200% slower in my example).

It separates the PDF into 2KB blocks that it caches. As in: It won't 
download any given block twice. It doesn't care about memory usage at 
this point.

It's dumb, meaning it will fetch one block at a time, resulting in over 
1000 requests for my test document (to render page 1.) That's actually 
surprisingly fast as long as the latency is low. This is probably why it 
slows down so badly over the net. I imagine that even just a quick call 
to precache a whole range for example when loading an image would create 
vastly improved performance of 60-80% faster in the Internet example.

It's definitely *NOT* meant for inclusion in mainline poppler, but it's 
fun to play with and an interesting start.

It does not change Poppler's outside API at all. All it does is that it 
now allows you to provide filenames starting with http:// - I tested 
pdfinfo and pdftoppm, both worked perfectly without any changes. 
(Alternatively, one could also create an HttpStream manually and pass 
that to PDFDoc's constructor. Although I didn't test that method.)

All tests took place using a 35MB real-world PDF. (I work at a 
publishing company.) It is linearized, but I don't currently parse that 
information.

If you wanna play around with this, have fun! I'd definitely love to see 
anybody try to compile a version of Evince with this and stream PDFs off 
of their intranet. :)

I am very much looking for suggestions and ideas on how to improve 
performance. If you'd like to look at that aspect, please look at the 
CurlCache.cc, I left most of the debugging statements in there, just 
activate them and you can watch as it's transferring the data. Pdftoppm 
rendering a single page works great as a fairly realistic test case.

Cheers,

Stefan Thomas
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 0001-Added-libcurl-to-build-system.patch
Url: http://lists.freedesktop.org/archives/poppler/attachments/20091016/7371d24d/attachment.ksh 


More information about the poppler mailing list