[poppler] [PATCH] Experimental HTTP Streaming Support
thomas at eload24.com
Fri Oct 16 10:22:07 PDT 2009
Hey popplers and popplerettes,
I don't remember much from last night, except that when I woke up at 3pm
today, I found this patch in my git log and tons of scribbled notes
about a "HTTP block cache" on my desk.
So some words on what this is and what it isn't.
It does stream any PDF file (linearized, non-linearized, object streams,
whatever) from any HTTP 1.1 compliant server. (HTTP 1.1 compliance
requires support for Content-Range iirc.)
On localhost it is about 2-5% slower than reading directly from disk.
Over 1GBaseT Ethernet it is about 10-15% slower than reading from the
local disk. Over the Internet, depending on your connection it is
unusably slow (around 20-30 seconds to render the first page, which
means it's 1200% slower in my example).
It separates the PDF into 2KB blocks that it caches. As in: It won't
download any given block twice. It doesn't care about memory usage at
It's dumb, meaning it will fetch one block at a time, resulting in over
1000 requests for my test document (to render page 1.) That's actually
surprisingly fast as long as the latency is low. This is probably why it
slows down so badly over the net. I imagine that even just a quick call
to precache a whole range for example when loading an image would create
vastly improved performance of 60-80% faster in the Internet example.
It's definitely *NOT* meant for inclusion in mainline poppler, but it's
fun to play with and an interesting start.
It does not change Poppler's outside API at all. All it does is that it
now allows you to provide filenames starting with http:// - I tested
pdfinfo and pdftoppm, both worked perfectly without any changes.
(Alternatively, one could also create an HttpStream manually and pass
that to PDFDoc's constructor. Although I didn't test that method.)
All tests took place using a 35MB real-world PDF. (I work at a
publishing company.) It is linearized, but I don't currently parse that
If you wanna play around with this, have fun! I'd definitely love to see
anybody try to compile a version of Evince with this and stream PDFs off
of their intranet. :)
I am very much looking for suggestions and ideas on how to improve
performance. If you'd like to look at that aspect, please look at the
CurlCache.cc, I left most of the debugging statements in there, just
activate them and you can watch as it's transferring the data. Pdftoppm
rendering a single page works great as a fairly realistic test case.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
More information about the poppler