[poppler] [PATCH] Experimental HTTP Streaming Support

Sun Oct 18 14:44:01 PDT 2009

A Divendres, 16 d'octubre de 2009, Stefan Thomas va escriure:
> Hey popplers and popplerettes,
> 
> I don't remember much from last night, except that when I woke up at 3pm
> today, I found this patch in my git log and tons of scribbled notes
> about a "HTTP block cache" on my desk.
> 
> So some words on what this is and what it isn't.
> 
> It does stream any PDF file (linearized, non-linearized, object streams,
> whatever) from any HTTP 1.1 compliant server. (HTTP 1.1 compliance
> requires support for Content-Range iirc.)
> 
> On localhost it is about 2-5% slower than reading directly from disk.
> Over 1GBaseT Ethernet it is about 10-15% slower than reading from the
> local disk. Over the Internet, depending on your connection it is
> unusably slow (around 20-30 seconds to render the first page, which
> means it's 1200% slower in my example).
> 
> It separates the PDF into 2KB blocks that it caches. As in: It won't
> download any given block twice. It doesn't care about memory usage at
> this point.
> 
> It's dumb, meaning it will fetch one block at a time, resulting in over
> 1000 requests for my test document (to render page 1.) That's actually
> surprisingly fast as long as the latency is low. This is probably why it
> slows down so badly over the net. I imagine that even just a quick call
> to precache a whole range for example when loading an image would create
> vastly improved performance of 60-80% faster in the Internet example.
> 
> It's definitely *NOT* meant for inclusion in mainline poppler, but it's
> fun to play with and an interesting start.
> 
> It does not change Poppler's outside API at all. All it does is that it
> now allows you to provide filenames starting with http:// - I tested
> pdfinfo and pdftoppm, both worked perfectly without any changes.
> (Alternatively, one could also create an HttpStream manually and pass
> that to PDFDoc's constructor. Although I didn't test that method.)
> 
> All tests took place using a 35MB real-world PDF. (I work at a
> publishing company.) It is linearized, but I don't currently parse that
> information.
> 
> If you wanna play around with this, have fun! I'd definitely love to see
> anybody try to compile a version of Evince with this and stream PDFs off
> of their intranet. :)
> 
> I am very much looking for suggestions and ideas on how to improve
> performance. If you'd like to look at that aspect, please look at the
> CurlCache.cc, I left most of the debugging statements in there, just
> activate them and you can watch as it's transferring the data. Pdftoppm
> rendering a single page works great as a fairly realistic test case.

Nice proof of concept.

If you ever get to implement that i'd ask you to try to get the "network" code 
as abstracted/independent as possible from the caching code, because for 
example in the Qt frontend we'd probably prefer to use some Qt code to do the 
downloading than using curl that adds a new dependency.

Albert

> 
> Cheers,
> 
> Stefan Thomas
>