Re: [pdf-devel] Comments about the HTTP filesystem implementation

pdf-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [pdf-devel] Comments about the HTTP filesystem implementation

From:	Aleksander Morgado
Subject:	Re: [pdf-devel] Comments about the HTTP filesystem implementation
Date:	Wed, 01 Jun 2011 10:58:23 +0200

Hi hi William,

> Some comments:
> 
> >  * file_open() should just do an HTTP-HEAD request, and store the
> > information we get back, specially Accept-Ranges and Content-Length
> > values (when available). Note that it is ok if we don't get a
> > Content-Length value, not a big deal.
> 
> I agree entirely.
> 
> I could imagine, if ranges are allowed, we might want to immediately
> download X bytes (a small number) since the front-end of the PDF will
> almost always be read.  But I agree that downloading the entire file
> immediately is undesirable.
> 

Exactly, yes.

At least for file_open() we shouldn't download any real content.

> >  * file_read() should:
> >    a) if the HTTP server replied "Accept-Ranges: bytes" in the
> > HTTP-HEAD request of the open(), we should setup an HTTP-GET request
> > requesting *only* to retrieve the byte range we want (a chunk of bytes
> > starting in current file offset and with the size of the read()
> > operation. See [1].
> >    b) if the HTTP server doesn't support ranges request, we should
> > probably return an error. We really don't want to try to read 100 bytes
> > from a file and then end up fully downloading a 4GB file. Thus, it makes
> > sense to support only HTTP 1.1 servers with byte serving capabilities.
> 
> This has some merit.  I guess most modern web servers would support
> ranges for static files (although some PDFs might be generated
> on-the-fly).
> 

That is a very fair point. I'm not completely sure how on-the-fly
generated files behave w.r.t to byte-serving requests. But, if the case
is that they do not allow byte-serving because they are not static, the
application (not libgnupdf) should download the full whole file to a
temp file, and just use the disk filesystem implementation to read from
it. I'm not sure why we would like to get in that ourselves. I think
that for a starting HTTP filesystem implementation this byte-serving
reads in chunks should be enough (trying to keep it as simple as
possible, and as close as possible to how the Disk filesystem works, to
get something stable working).

> Also, if a bunch of small file_read() calls are made (ex. 200 bytes,
> then 400 bytes, then 300 bytes...), a lot of connections would be
> opened, with a undesirable increase in overhead, unless we are smart
> with recycling connections where possible.  Even if we recycle
> connections, too many small requests would likely cause a bit of a
> slowdown.
> 

That's a problem of the one using the library, I would say, not our
problem. If the one using the library asks us to read in chunks of 10
bytes, then we can't do much to improve efficiency. So I don't think
this is a reason on its own. It's worse to open a single connection and
download 4GB that we don't want, than opening 20 connections to download
only the 2000 bytes that we really want.

> I presume the PDF parser will only file_read() a given range of the
> file once.  If this is not the case, we should probably consider
> caching the data we download.  I suppose RIA data should also be
> cached.  (I could imagine someone RIAing a large range, but then only
> reading a small chunk from that range in any one file_read().  Maybe
> this is the solution to the problem of too many requests for small
> ranges.)
> 

RIA is another thing I have been thinking about. From my POV, RIA is
just an async read implemented directly by the library, so an
application using RIA would like to do the following:
 * Open the file
 * Request RIA of the first 2000 bytes for example right away
 * Do some more stuff in parallel to the RIA operation
 * Once RIA finishes, read the first 2000 bytes.

Of course it could also ask for the first 10000 bytes but then only read
the first 1000 or so.

Anyway, I would modify the RIA operation API so that:
 * When requesting the RIA operation we provide a
progress-reporting-callback to be called once in a while with the
progress (if any can be known).
 * When requesting the RIA operation we provide a finished-callback to
be called once the RIA operation finished.
 * When requesting the RIA operation, we get back a unique ID of the RIA
operation in the file, so that we can then cancel that specific RIA
operation.
 * We should be able to request several different RIA operations in the
same file, e.g. requesting several different ranges. If the ranges
requested overlap each other, we should be able to transparently ignore
the overlap ourselves so that the same chunks are not asked twice to the
HTTP server.
 * All finished RIA operations should be kept in heap, I would say. The
HTTP filesystem shouldn't need an underlying Disk filesystem working.

All in all, this RIA behaviour could also be implemented in the same way
in the Disk filesystem.

Cheers,

-- 
Aleksander

signature.asc
Description: This is a digitally signed message part

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [pdf-devel] Comments about the HTTP filesystem implementation, Aleksander Morgado <=
- Re: [pdf-devel] Comments about the HTTP filesystem implementation, Aleksander Morgado, 2011/06/02
  - Re: [pdf-devel] Comments about the HTTP filesystem implementation, Aleksander Morgado, 2011/06/02
  - Re: [pdf-devel] Comments about the HTTP filesystem implementation, William Demchick, 2011/06/03

Next by Date: [pdf-devel] Document how to setup a MinGW build of gnupdf
Next by thread: Re: [pdf-devel] Comments about the HTTP filesystem implementation
Index(es):
- Date
- Thread