lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev Why doesn't lynx cache HTML source?


From: Klaus Weide
Subject: Re: lynx-dev Why doesn't lynx cache HTML source?
Date: Wed, 18 Nov 1998 08:14:22 -0600 (CST)

Commenting on an older message:

On Sun, 15 Nov 1998, David Woolley wrote:

> > > A large proportion of web pages these days are uncacheable, often for
> > > misguided commercial reasons.  I strongly suspect that a disproportionate
> > > number of the ones that people will need to view source on or parse in
> > > different ways will fall in this category.

Yeah, probably...  But looking at the SOURCE of completely valid and 
static pages is still useful.  And toggling image links with '*' and so
on.

> > Thats mostly because Lynx say "HTTP/1.0" in it's header and server reply so.

(No, as mentioned elsewhere.)

> > HTTP 1.1 have unique ETag that allow advanced validation for any cached 
> > data.
> > So most benefits from lynx cache - to receive short responce like HEAD
> > instead of fetching a complete document (sometimes even a head-like request
> > not needed but this is an obvious check and not for your case).
> 
> Very little of the web server and cache software around supports ETag.

Apache does.  It generates Etag whenever it automatically generates
Last-Modified for a static file, AFAIK.

> But, in any case most cacheability failures are due to apparently deliberate
> attempts to frustrate it (accurate hit counts are much more saleable that
> fast access, it seems) or dynamic content that goes way beyond content
> negotiation (I don't think many people even know it is possible).

"Deliberate" or, maybe more often, "We never thought of it".  See
<http://lynx.browser.org/>: why is there no Last-Modified header?

> > You will help me considerably if you read spec and compare against
> > the comments in HTTP.c - actions on return status (200, 304, etc, etc.)
> 
> Looks like I will have to do this.

[ ... ]

> > If any browser revalidate something once per session it obviously
> > break the spec: there are a special http/1.1 rules for this,
> > for example, Expired or "no-cache" documents should be validated every time
> > we are trying to access them.
> 
> I think that IE4 will honour Expires - in fact we had to remove an
> immediate expires from a dynamic page because it was expired even
> before it could handed on to an OLE helper (Excel).  I'd have to check
> the status of no-cache.  But most pages do not have these headers (and
> when they start to, will probably have the authoring tools defaults!).
> IE4 does not use heuristics on Last Modified Date and will still cache
> pages without this header and without other cache controlling headers.
> There is a user configuration option with three values: never revalidate,
> validate once per session and always revalidate.  Out of the box it
> is once per session and most users will never change it from that.
> Lynx currently also behaves as once per session!

:)

Well Lynx behaves as "once per session", except that "revalidation" always
consists of a new fetch (no If-Modified-Since).

> Calculation based revalidation normally requires an extensive set of rules,
> e.g. most people configure external caches to assume that the likely lifetime
> of a .gif is a much larger proportion of its lifetime up to the point of
> fetching than they would for a .html or .htm.  The rules may also favour
> certain sites.

It doesn't have to be that difficult.  Of course one can make very complex 
rules, but a simple "10% of the age when received" heuristic for freshness
lifetime can go a long way.

> The formal backing for this behaviour is section 13.13. of RFC 2068,
> although there might be semantic arguments about the border between
> caching and history mechanisms (as a user, I would expect the same result
> from using the back button to return to a home page and using a link
> on the subordinate page); in my view, all those wanting unrendered
> caching in Lynx to support the \ command would want the history
> interpretation, to avoid refetching of dynamic content.

I think we all agree that '\' and similar should ideally have "history"
behavior, as opposed to "semantically transparent" behavior.  Although
that is a deliberate choice of one of several possible interpretations.

Regarding the parenthesis: You may expect that, but it would be wrong.
And there is not way for an author to say "this link should have
history behavior".  That is for "normal" links, disregarding any
scripting.

> In fact this section has a warning that using revalidation for history
> pages may actually cause site designers not to use cache control
> information properly on their pages, because to do so migh force
> unexpected reloads and unstable content.
> 
> > The rules insist on validating (either by local calculation or with remote)
> > for entry using of cached data, no more nor less.
> > I think we may be a little more strict and ask the remote (server or proxy)
> > for validation when we could do this but too lazy to do our calclulations.

We should never make more network requests than necessary, if there is no
other benefit, just because someone was too lazy!

> > Anyway, this is a small overhead and could be easily done
> > when the main code will be implemented (not so easy!).
> 
> Revalidation itself can sometimes be slow (and, as was the main point of the
> article, will very often result in a complete reload of the page). 

Indeed.

> Slowness
> is particularly a problem for GUI browsers, where a large number of GIFs
> may have to be revalidated!  It also makes it impossible to operate the
> browser in offline mode.  IE4 has explicit support for this, but even
> Lynx will satisfy pages from its rendered cache, even though you've
> stopped paying the phone company or ISP for connectivity.

Let's image a scale of possible behavior (although that cannot really all
be expressed in one dimension):

     complete                                                 use all           
        
     semantic   <----------------------------------|-------> cached data  
   transparency                                    |          forever
                                            Lynx in general

Complete transparency means: whenever a link is followed, a new request
is made.  The other end of the scale would be: ignore all
"no-cache", "expires" etc. directives, keep documents around as long as 
they fit in the cache, and never re-request what we already have. 
No client implements either extreme (by default).

Lynx is currently closer to the right; it is not completeley there
because it honors no-cache in responses and (some forms of) Expires,
and resubmits POST and HEAD requests.

There are several "modes" of going to a page:
 1) explicit reload: ^R, 'x'
 2) '*', '\', '[', '"', ^V, etc.
 3) form submissions (POST)
 4) following a normal link, entering an address with 'g' etc.; default
 5) going back in history: either left arrow, or link from History Page

I have ordered them according to the scale above, 1) corresponds to
the left end, 5) to the right end.

I would argue that any change or addition of caching mechanisms should
not move Lynx much to the left or to the right, for any of the modes,
_by default_ -- except for 2), see below.

I don't want a lynx session to act much more semantically transparent
by default (I have a slow link, too), especially for 4), although it would
be more correct to do so (it should follow the rules HTTP sets for caches
more closely).  But it would be nice to be able to configure Lynx to
act more semantically transparent.

I also don't want Lynx to act (much) more relaxed by default.  But it
would be nice to be able to configure Lynx to do even less checking.
(For example, never honor Expires, or ignore it in META tags.)

Mode 2 is different, because it is close to the left by accident and
not by design: we would like it to behave like 5 but cannot since we
throw away the raw bytes.  So now someone wants to implement a cache
for raw bytes of HTML documents to achieve that.  Apart from the
implementation, the major question is: how should this change the behavior
in other modes.

If the answer is: It shouldn't, by default -- then the minimal solution
is simple: just use the new rawdata cache for what it was intended,
that is only mode 2 requests.  It is very tempting to reuse the rawdata
for mode 5 requests -- it seems such a waste not to do it -- but we don't
have to do it.

If the new rawbyte cache never gets used for requests other than mode 2,
then no change is needed in the rules for when to make a new request,
and no If-Modified-Since/Etag implementation is needed to preserve the
current behavior.  IMS/Etag/304 could still be implemented later, but
that is then a separate problem.  (It could also already be implemented
already now, for the existing rendered-doc cache, [except for the
"language confusion" problem,] which shows that it is separate.)

But it DOES seem wasteful to keep raw data if in most cases we won't
use them.  But:
  A. The impementation doesn't have to be complicated if
     - We never keep the cached raw data around for longer than we
       keep the rendered text, with one exception: during a mode 2
       request when the data is reparsed.
     - We keep cached raw data in memory (not files).  We could 
       simply put each bufferful of data into a HTChunk while it is being
       received.
     - There is no new expiration, validation, or timestamp comparison
       logic, so no new metadata needs to be stored.
  B. It can be greatly restricted what documents get entered into the
     cache in the first place.  We have a choice of
     - Caching everything received.
     - Caching all text/html, maybe with further restrictions based
       on URL, method, etc.
     - Require explicit user action.  Maybe a special "Enter cache" key, 
       meaning "I am going to want this text reparsed, so start caching
       it".
     - When '\' is pressed the first time for the current text, we mark
       if for rawbyte caching.  There could be a confirmation question.
       That means at least two network request are needed, but after that
       cached data is used.
     When we go to view a document in other than mode 2, the cached
     data can be thrown away.  Or alternatively, whenever there is a new
     network request.

Note that this could be done without significant changes in mainloop().
It just would have to set a "this is a mode 2 request" flag,
HTuncache_current_document() might have to take care to preserve
existing raw data in this case (and maybe get rid of it otherwise), 
Then other lower-level functions could handle the storing to cache
and reading from it. The mainloop() function wouldn't have to know
where the new data comes from.

> Actually, if there is a case for source caching in Lynx, as against an
> external caching proxy, it is that it can relax the revalidation rules.

I agree that that would be at least an important benefit.  It should
be configurable how relaxed we are.

My above ideas only address the specific case of what I called mode 2
requests.  For all other situations, the answer "use an external
caching proxy" would still apply.

    Klaus

reply via email to

[Prev in Thread] Current Thread [Next in Thread]